Provided by: anfo_0.98-7build2_amd64 bug

NAME

       anfo-tool - process native ANFO binary files

SYNOPSIS

       anfo-tool [ option | pattern ... ]

DESCRIPTION

       anfo-tool is used to filter, process and convert the files created by anfo.  Every pattern on the command
       line is wildcard expanded, then for every input file (or the standard input, if  no  pattern  is  given),
       anfo-tool  builds  a  chain  of input filters, it then merges these input streams in one of several ways,
       splits the result up into multiple output streams, each of which  can  have  a  chain  of  output  filter
       applied.

OPTIONS

   General Options
       These  options  apply globally and modify the behavior of the whole program.  They can be placed anywhere
       in the command line.

       -V, --version
              Print version number and exit.

       -q, --quiet
              Suppress all output except fatal error messages.

       -v, --verbose
              Produce more output, including progress indicators for most operations.

       --debug
              Produce debugging output in addition to progress information.

       -n, --dry-run
              Parse command line, optionally print a description of the intended operations, then exit.

       --vmem X
              Limit virtual memory to X megabytes.  If memory runs out, anfo-tool tries to  free  up  memory  by
              forgetting  about  big  files,  e.g.  genomes.  Use this option to avoid swapping or out-of-memory
              conditions when operations involve big or multiple genomes.

   Setting Parameters
       A parameter can be set multiple times on the command line and  will  overwrite  previous  settings.   Any
       filter option that needs a parameter picks up the last definition that appeared before the filter option.

       --set-slope S
              Set  the  slope parameter to S.  The slope is used together with the intercept where filters apply
              to alignment scores; alignments scoring no worse than slope * (length - intercept) are  considered
              good.  The default is 7.5.

       --set-intercept L
              Set  the  intercept  parameter  to L.  The intercept is used together with the slope where filters
              apply to alignment scores; alignments scoring no worse than  slope  *  (length  -  intercept)  are
              considered good.  The default is 20.

       --set-context C
              Set  the  context parameter to C.  The context is the number of surrounding bases of the reference
              included when printing alignments in text form.  The default is 0.

       --set-genome G
              Set the genome parameter to G.  Many filters will  only  consider  the  best  alignments  to  this
              specific genome if it is set.  If no genome is set, the globally best alignment is used.

       --clear-genome
              Clear the genome parameter.  Filters apply to the globally best alignment afterwards.

   Filter Options
       Filters can be applied before merging the inputs or after splitting the back up.

       -s, --sort-pos=n
              Sort  by  alignment  position  while  buffering no more than n MiB in memory.  If a genome is set,
              alignments to that genome are used.

       -S, --sort-name=n
              Sort by read name while buffering no more than n MiB in memory.

       -l, --filter-length=L
              Retain alignments only for reads of at least L bases length.  The reads themselves are kept.

       -f, --filter-score
              Retain alignments only if their score is good enough.  Usesslopeandintercept.

       --filter-mapq=Q
              Remove alignments with mapping quality below Q.

       -h, --filter-hit=SEQ
              Keep only reads that have a hit to a sequence named SEQ.  If SEQ is empty, reads are kept if  they
              have any hit.  If the genome parameter is set, only hits to that genome count.

       --delete-hit=SEQ
              Delete  alignments  to SEQ.  If SEQ is empty, all alignments are deleted.  If the genome parameter
              is set, only alignments to that genome are deleted.

       --filter-qual=Q
              Mask out bases with quality below Q.  Such a base is replaced by the N ambiguity code.

       --multiplicity=N
              Keep only reads of molecules that have been sequenced at least N times.  Reads are  considered  to
              come from the same original molecule if their aligned coordinates are identical.

       --subsample=F
              Subsample  a  fraction  F  of the results.  Every read is independently and randomly choosen to be
              kept or not.

       --inside-regions=FILE
              Read a list of regions from FILE, then keep only alignments that overlap an annotated region.

       --outside-regions=FILE
              Read a list of regions from FILE, then keep only alignments  that  do  not  overlap  an  annotated
              region.

   Special Filters
       -d, --rmdup=Q
              Remove  PCR  duplicates, clamp quality scores to Q.  Two reads are considered to be duplicates, if
              their aligned coordinates are identical.  If a genome is set, the best alignment to that genome is
              used,  else  the  globally  best  alignment.   Both  alignments  must  be  good,  as determined by
              slopeandintercept.  For a set of duplicates, a  consensus  is  called,  generally  increasing  the
              quality scores.  If a resulting quality score exceeds Q, it is set to Q.  This filter requires the
              input to be sorted by alignment coordinate on the selected genome.

              --duct-tape=NAME Duct-tape overlapping alignments into contigs and call a consensus for them.   If
              a  genome  is  set,  alignments  to that genome are used, else the globally best alignments.  This
              filter requires input to be sorted by alignment coordinate on the genome.   Output  is  a  set  of
              contigs,  every position gets assigned a consensus base, a quality score and likelihoods for every
              possible diallele.  (It is called duct-taping because it kind of looks like an  assembly,  but  is
              not nearly as solid.)

       --edit-header=ED
              Invoke  the editor ED on the text representaion of the stream's header.  This can be used to clean
              up header that have accumulated too much cruft.

   Merging Filters
       Exactly one merging filter should be given on the command line, all filter options occuring  before  that
       are  part  of the input filter chains, all further filters become output chains.  If no merging filter is
       given, --concat is assumed, and all filters are input filters.

       -c, --concat
              Concatenate all input streams in the order they appear on the command line.

       -m, --merge
              Merge sorted input streams, producing a sorted result.  All inputs must be sorted in the same way.

       -j, --join
              Join input streams and retain the single best hits  to  each  genome.   Every  input  stream  must
              contain  a  record  for  every  read,  reads  are  buffered  in memory until all of their hits are
              collected.  This way, joining works well if all inputs are nearly in the same order.  If reads are
              missing from some streams, joining them will waste memory.

       --mega-merge
              Merge  many streams such as those produced by running anfo-sge.  Streams that operated on the same
              reads are joined, then everything is merged.

   Output Options
       If an output option is given on the command line, the current output filter chain is ended and a new  one
       is  started.   If  no  output option is given, a textual representation of the final stream is written to
       stdout.  All output options accept - to write to stdout.

       -o, --output FILE
              Write native binary stream (a compressed protobuf message) to FILE.  Writing a binary  stream  and
              reading it back in is lossless.

       --output-text FILE
              Write  protobuf  text  stream  to  FILE.   If  the  necessary  genomes  are  available,  a textual
              representation of the alignments is  included.   If  the  context  parameter  is  set,  that  many
              additional bases of the reference upstream and downstream from the alignment are included.

       --output-sam=FILE
              Write alignments in SAM format to FILE.

       --output-glz FILE
              Write  contigs  in GLZ 0.9 format to FILE.  Generating GLZ only works after application of --duct-
              tape, every contigs becomes a GLZ record.

       --output-3aln FILE
              Write contigs in a table based format to FILE.  The format is still subject  to  change,  see  the
              source code for detailed documentation.

       --output-fasta FILE
              Write alignments(!) in FastA format to FILE.  Alignments are writte as pair of reference and query
              sequence, aligned coordinates are indicated in the description of  the  query  sequence.   If  the
              context parameter is set, that many additional bases of the reference upstream and downstream from
              the alignment are included.  This format is not suggested  for  any  serious  use,  it  exists  to
              support legacy applications.

       --output-fastq FILE
              Write  sequences(!) in FastQ format to FILE.  Writing FastQ effectively reconstitutes the input to
              ANFO if no filtering was done on the results.

       --output-table FILE
              Write per-alignment statistics to FILE.  The file has  three  colums:Âsequence  length,  alignment
              score,  difference  to  next  best  alignment.   It  is  mainly  useful  to  analyze/visualize the
              distribution of alignment scores.

       --stats FILE
              Write simple statistics to FILE.  This results in  some  simple  summary  statistics  of  a  whole
              stream: number of aligned sequences, average length, GC content.

ENVIRONMENT

       ANFO_PATH
              Colon separated list of directories searched for genome and index files.

       ANFO_TEMP
              Temporary space used for sorting of large files.

FILES

       /etc/popt
              The  system  wide  configuration  file for popt(3).  anfo-tool identifies itself as "anfo-tool" to
              popt.

       ~/.popt
              Per user configuration file for popt(3).

BUGS

       The command line of this tools is way too complicated and  its  semantics  are  counterintuitive.   Using
       anfo-tool  is probably best avoided in most cases, the guile bindings should provide a much more scalable
       and easier to understand interface.

AUTHOR

       Udo Stenzel <udo_stenzel@eva.mpg.de>

SEE ALSO

       anfo(1), fa2dna(1) popt(3), fasta(5)