Provided by: pbgenomicconsensus_2.0.0+20151210-1_all bug

NAME

       variantCaller - variant-calling algorithms for PacBio sequencing data

SYNOPSIS

       variantCaller.py is invoked from the command line.  For example, a simple invocation is:

          variantCaller.py -j8 --algorithm=quiver \
                           -r lambdaNEB.fa        \
                           -o variants.gff        \
                           aligned_reads.cmp.h5

       which  requests  that variant calling proceed, - using 8 worker processes, - employing the
       quiver algorithm, - taking input from the file aligned_reads.cmp.h5,  -  using  the  FASTA
       file lambdaNEB.fa as the reference, - and writing output to variants.gff (see pbgff(5)).

       A  particularly  useful  option  is  --referenceWindow/-w:  this option allows the user to
       direct the tool to perform variant calling  exclusively  on  a  window  of  the  reference
       genome, where the

OPTIONS

          variantCaller.py --help

       will provide a help message explaining all available options.

NOTES

   Input and output
       variantCaller.py requires two input files:

       • A file of reference-aligned reads in PacBio's standard cmp.h5 format;

       • A FASTA file that has been processed by ReferenceUploader.

       The  tool's  output  is formatted in the GFF format, as described in (how to link to other
       file?).  External tools can be used to convert the GFF file to a  VCF  or  BED  file---two
       other standard interchange formats for variant calling.

       NOTE:
          Input cmp.h5 file requirements

          variantCaller.py  requires its input cmp.h5 file to be be sorted.  An unsorted file can
          be sorting using the tool cmpH5Sort.py.

          The quiver(1) algorithm in variantCaller requires its input cmp.h5  file  to  have  the
          following pulse features:

          System Message: ERROR/3 (doc/VariantCallerFunctionalSpecification.rst:, line 69)
                 Unexpected indentation.

              • InsQV,

              • SubsQV,

              • DelQV,

              • DelTag,

              • MergeQV.

          The plurality(1) algorithm can be run on cmp.h5 files that lack these features.

       The input file is the main argument to variantCaller.py, while the output file is provided
       as an argument to the -o flag.  For example,

          variantCaller.py aligned_reads.cmp.h5 -r lambda.fa  -o variants.gff

       will read input from aligned_reads.cmp.h5, using the reference lambda.fa, and send  output
       to  the  file  variants.gff.   The  extension  of  the filename provided to the -o flag is
       meaningful, as  it  determines  the  output  file  format.   The  file  formats  presently
       supported, by extension, are

       .gff   GFFv3 format

       .txt   a simplified human readable format used primarily by the developers

       If the -o flag is not provided, the default behavior is to output to a variants.gff in the
       current directory.

       NOTE:
          variantCaller.py does not modify its input cmp.h5 file in any way.  This is in contrast
          to  previous variant callers in use at PacBio, which would write a consensus dataset to
          the input cmp.h5 file.

   Available algorithms
       At this time there are two algorithms available for variant calling: plurality and quiver.

       Plurality is a simple and very fast procedure that merely tallies the most  frequent  read
       base or bases found in alignment with each reference base, and reports deviations from the
       reference as potential variants.

       Quiver is a more complex procedure based  on  algorithms  originally  developed  for  CCS.
       Quiver  leverages  the  quality  values (QVs) provided by upstream processing tools, which
       provide insight into whether insertions/deletions/substitutions were deemed  likely  at  a
       given  read position.  Use of quiver requires the ConsensusCore library as well as trained
       parameter set, which will be loaded from a standard location (TBD).  Quiver can be thought
       of as a QV-aware local-realignment procedure.

       Both  algorithms  are expected to converge to zero errors (miscalled variants) as coverage
       increases; however  quiver  should  converge  much  faster  (i.e.,  fewer  errors  at  low
       coverage), and should provide greater variant detection power at a given error level.

   Confidence values
       Both  quiver  and  plurality  make a confidence metric available for every position of the
       consensus sequence.  The confidence should be interpreted as a phred-transformed posterior
       probability that the consensus call is incorrect; i.e.

          QV = -10 \log_{10}(p_{err})

       variantCaller.py  clips  reported  QV  values at 93---larger values cannot be encoded in a
       standard FASTQ file.

   Chemistry specificity
       The Quiver algorithm parameters are trained per-chemistry.   SMRTanalysis  software  loads
       metadata into the cmp.h5 to indicate the chemistry used per movie.  Quiver sees this table
       and automatically chooses the appropriate parameter set to use.   This  selection  can  be
       overridden by a command line flag.

       When multiple chemistries are represented in the reads in a cmp.h5, Quiver will model each
       read appropriately using the parameter  set  for  its  chemistry,  thus  yielding  optimal
       results.

   Performance Requirements
       variantCaller.py  performs  variant  calling  in  parallel using multiple processes.  Work
       splitting and inter-process communication are handled  using  the  Python  multiprocessing
       module.   Work  can  be  split  among  an  arbitrary  number  of  processes  (using the -j
       command-line flag), but for best performance one should use no more worker processes  than
       there are CPUs in the host computer.

       The  running  time  of  the plurality algorithm should not exceed the runtime of the BLASR
       process that produced the cmp.h5. The running time of  the  quiver  algorithm  should  not
       exceed 4x the runtime of BLASR.

       The  amount  of  core  memory  (RAM)  used  among  all  the python processes launched by a
       variantCaller.py run should not exceed the size of the uncompressed input .cmp.h5 file.

SEE ALSO

       quiver(1) plurality(1) pbgff(5) blasr(1)

                                          February 2016                          VARIANTCALLER(1)