lunar (1) anfo-tool.1.gz

Provided by: anfo_0.98-9_amd64 bug

NAME

       anfo-tool - process native ANFO binary files

SYNOPSIS

       anfo-tool [ option | pattern ... ]

DESCRIPTION

       anfo-tool is used to filter, process and convert the files created by anfo.  Every pattern
       on the command line is wildcard expanded, then for  every  input  file  (or  the  standard
       input,  if no pattern is given), anfo-tool builds a chain of input filters, it then merges
       these input streams in one of several ways, splits the  result  up  into  multiple  output
       streams, each of which can have a chain of output filter applied.

OPTIONS

   General Options
       These  options  apply  globally and modify the behavior of the whole program.  They can be
       placed anywhere in the command line.

       -V, --version
              Print version number and exit.

       -q, --quiet
              Suppress all output except fatal error messages.

       -v, --verbose
              Produce more output, including progress indicators for most operations.

       --debug
              Produce debugging output in addition to progress information.

       -n, --dry-run
              Parse command line, optionally print a description of the intended operations, then
              exit.

       --vmem X
              Limit  virtual  memory to X megabytes.  If memory runs out, anfo-tool tries to free
              up memory by forgetting about big files, e.g. genomes.  Use this  option  to  avoid
              swapping  or  out-of-memory  conditions  when  operations  involve  big or multiple
              genomes.

   Setting Parameters
       A parameter can be set multiple times on the command  line  and  will  overwrite  previous
       settings.   Any  filter  option  that  needs a parameter picks up the last definition that
       appeared before the filter option.

       --set-slope S
              Set the slope parameter to S.  The slope is used together with the intercept  where
              filters apply to alignment scores; alignments scoring no worse than slope * (length
              - intercept) are considered good.  The default is 7.5.

       --set-intercept L
              Set the intercept parameter to L.  The intercept is used together  with  the  slope
              where  filters  apply to alignment scores; alignments scoring no worse than slope *
              (length - intercept) are considered good.  The default is 20.

       --set-context C
              Set the context parameter to C.  The context is the number of surrounding bases  of
              the reference included when printing alignments in text form.  The default is 0.

       --set-genome G
              Set the genome parameter to G.  Many filters will only consider the best alignments
              to this specific genome if it is set.  If no  genome  is  set,  the  globally  best
              alignment is used.

       --clear-genome
              Clear  the  genome  parameter.   Filters  apply  to  the  globally  best  alignment
              afterwards.

   Filter Options
       Filters can be applied before merging the inputs or after splitting the back up.

       -s, --sort-pos=n
              Sort by alignment position while buffering no more than n  MiB  in  memory.   If  a
              genome is set, alignments to that genome are used.

       -S, --sort-name=n
              Sort by read name while buffering no more than n MiB in memory.

       -l, --filter-length=L
              Retain  alignments only for reads of at least L bases length.  The reads themselves
              are kept.

       -f, --filter-score
              Retain alignments only if their score is good enough.  Usesslopeandintercept.

       --filter-mapq=Q
              Remove alignments with mapping quality below Q.

       -h, --filter-hit=SEQ
              Keep only reads that have a hit to a sequence named SEQ.  If SEQ  is  empty,  reads
              are  kept  if they have any hit.  If the genome parameter is set, only hits to that
              genome count.

       --delete-hit=SEQ
              Delete alignments to SEQ.  If SEQ is empty, all alignments  are  deleted.   If  the
              genome parameter is set, only alignments to that genome are deleted.

       --filter-qual=Q
              Mask  out  bases  with quality below Q.  Such a base is replaced by the N ambiguity
              code.

       --multiplicity=N
              Keep only reads of molecules that have been sequenced at least N times.  Reads  are
              considered to come from the same original molecule if their aligned coordinates are
              identical.

       --subsample=F
              Subsample a fraction F of the results.  Every read is  independently  and  randomly
              choosen to be kept or not.

       --inside-regions=FILE
              Read  a  list  of  regions  from  FILE,  then  keep only alignments that overlap an
              annotated region.

       --outside-regions=FILE
              Read a list of regions from FILE, then keep only alignments that do not overlap  an
              annotated region.

   Special Filters
       -d, --rmdup=Q
              Remove  PCR  duplicates, clamp quality scores to Q.  Two reads are considered to be
              duplicates, if their aligned coordinates are identical.  If a genome  is  set,  the
              best  alignment  to  that  genome  is used, else the globally best alignment.  Both
              alignments must be  good,  as  determined  by  slopeandintercept.   For  a  set  of
              duplicates,  a  consensus is called, generally increasing the quality scores.  If a
              resulting quality score exceeds Q, it is set to Q.  This filter requires the  input
              to be sorted by alignment coordinate on the selected genome.

              --duct-tape=NAME Duct-tape overlapping alignments into contigs and call a consensus
              for them.  If a genome is set,  alignments  to  that  genome  are  used,  else  the
              globally  best  alignments.   This  filter requires input to be sorted by alignment
              coordinate on the genome.  Output is a set of contigs, every position gets assigned
              a consensus base, a quality score and likelihoods for every possible diallele.  (It
              is called duct-taping because it kind of looks like an assembly, but is not  nearly
              as solid.)

       --edit-header=ED
              Invoke the editor ED on the text representaion of the stream's header.  This can be
              used to clean up header that have accumulated too much cruft.

   Merging Filters
       Exactly one merging filter should be  given  on  the  command  line,  all  filter  options
       occuring  before  that  are  part  of  the input filter chains, all further filters become
       output chains.  If no merging filter is given, --concat is assumed, and  all  filters  are
       input filters.

       -c, --concat
              Concatenate all input streams in the order they appear on the command line.

       -m, --merge
              Merge  sorted  input streams, producing a sorted result.  All inputs must be sorted
              in the same way.

       -j, --join
              Join input streams and retain the single best hits to  each  genome.   Every  input
              stream must contain a record for every read, reads are buffered in memory until all
              of their hits are collected.  This way, joining works well if all inputs are nearly
              in the same order.  If reads are missing from some streams, joining them will waste
              memory.

       --mega-merge
              Merge many streams such as  those  produced  by  running  anfo-sge.   Streams  that
              operated on the same reads are joined, then everything is merged.

   Output Options
       If an output option is given on the command line, the current output filter chain is ended
       and a new one is started.  If no output option is given, a textual representation  of  the
       final stream is written to stdout.  All output options accept - to write to stdout.

       -o, --output FILE
              Write  native  binary  stream  (a  compressed protobuf message) to FILE.  Writing a
              binary stream and reading it back in is lossless.

       --output-text FILE
              Write protobuf text stream to FILE.  If the  necessary  genomes  are  available,  a
              textual  representation of the alignments is included.  If the context parameter is
              set, that many additional bases of the reference upstream and downstream  from  the
              alignment are included.

       --output-sam=FILE
              Write alignments in SAM format to FILE.

       --output-glz FILE
              Write  contigs  in  GLZ  0.9  format  to  FILE.   Generating  GLZ  only works after
              application of --duct-tape, every contigs becomes a GLZ record.

       --output-3aln FILE
              Write contigs in a table based format to FILE.  The  format  is  still  subject  to
              change, see the source code for detailed documentation.

       --output-fasta FILE
              Write  alignments(!)  in  FastA  format  to FILE.  Alignments are writte as pair of
              reference and query sequence, aligned coordinates are indicated in the  description
              of the query sequence.  If the context parameter is set, that many additional bases
              of the reference upstream and downstream from the  alignment  are  included.   This
              format  is  not  suggested  for  any  serious  use,  it  exists  to  support legacy
              applications.

       --output-fastq FILE
              Write  sequences(!)  in  FastQ  format  to   FILE.    Writing   FastQ   effectively
              reconstitutes the input to ANFO if no filtering was done on the results.

       --output-table FILE
              Write  per-alignment  statistics  to  FILE.   The  file  has three colums:Âsequence
              length, alignment score, difference to next best alignment.  It is mainly useful to
              analyze/visualize the distribution of alignment scores.

       --stats FILE
              Write simple statistics to FILE.  This results in some simple summary statistics of
              a whole stream: number of aligned sequences, average length, GC content.

ENVIRONMENT

       ANFO_PATH
              Colon separated list of directories searched for genome and index files.

       ANFO_TEMP
              Temporary space used for sorting of large files.

FILES

       /etc/popt
              The system wide configuration file for popt(3).   anfo-tool  identifies  itself  as
              "anfo-tool" to popt.

       ~/.popt
              Per user configuration file for popt(3).

BUGS

       The   command   line  of  this  tools  is  way  too  complicated  and  its  semantics  are
       counterintuitive.  Using anfo-tool is probably best  avoided  in  most  cases,  the  guile
       bindings should provide a much more scalable and easier to understand interface.

AUTHOR

       Udo Stenzel <udo_stenzel@eva.mpg.de>

SEE ALSO

       anfo(1), fa2dna(1) popt(3), fasta(5)