Provided by: pbgenomicconsensus_2.1.0-1_all bug

NAME

       variantCaller - variant-calling algorithms for PacBio sequencing data

SYNOPSIS

       variantCaller is invoked from the command line.  For example, a simple invocation is:

              variantCaller -j8 --algorithm=quiver \
                                -r lambdaNEB.fa        \
                                -o variants.gff        \
                                aligned_reads.cmp.h5

       which  requests  that variant calling proceed, - using 8 worker processes, - employing the
       quiver algorithm, - taking input from the file aligned_reads.cmp.h5,  -  using  the  FASTA
       file lambdaNEB.fa as the reference, - and writing output to variants.gff (see pbgff(5)).

       A  particularly  useful  option  is  --referenceWindow/-w:  this option allows the user to
       direct the tool to perform variant calling  exclusively  on  a  window  of  the  reference
       genome, where the

OPTIONS

              variantCaller --help

       will provide a help message explaining all available options.

NOTES

   Input and output
       variantCaller requires two input files:

       • A file of reference-aligned reads in PacBio's standard cmp.h5 format;

       • A FASTA file that has been processed by ReferenceUploader.

       The  tool's  output  is formatted in the GFF format, as described in (how to link to other
       file?).  External tools can be used to convert the GFF file to a  VCF  or  BED  file---two
       other standard interchange formats for variant calling.

              note

              Input cmp.h5 file requirements

              variantCaller requires its input cmp.h5 file to be be sorted.  An unsorted file can
              be sorting using the tool cmpH5Sort.py.

              The quiver(1) algorithm in variantCaller requires its input cmp.h5 file to have the
              following pulse features: - InsQV, - SubsQV, - DelQV, - DelTag, - MergeQV.

              The plurality(1) algorithm can be run on cmp.h5 files that lack these features.

       The input file is the main argument to variantCaller, while the output file is provided as
       an argument to the -o flag.  For example,

              variantCaller aligned_reads.cmp.h5 -r lambda.fa  -o variants.gff

       will read input from aligned_reads.cmp.h5, using the reference lambda.fa, and send  output
       to  the  file  variants.gff.   The  extension  of  the filename provided to the -o flag is
       meaningful, as  it  determines  the  output  file  format.   The  file  formats  presently
       supported, by extension, are

       .gff   GFFv3 format

       .txt   a simplified human readable format used primarily by the developers

       If the -o flag is not provided, the default behavior is to output to a variants.gff in the
       current directory.

              note

              variantCaller does not modify its input  cmp.h5  file  in  any  way.   This  is  in
              contrast  to  previous  variant  callers  in  use  at  PacBio,  which would write a
              consensus dataset to the input cmp.h5 file.

   Available algorithms
       At this time there are two algorithms available for variant calling: plurality and quiver.

       Plurality is a simple and very fast procedure that merely tallies the most  frequent  read
       base or bases found in alignment with each reference base, and reports deviations from the
       reference as potential variants.

       Quiver is a more complex procedure based  on  algorithms  originally  developed  for  CCS.
       Quiver  leverages  the  quality  values (QVs) provided by upstream processing tools, which
       provide insight into whether insertions/deletions/substitutions were deemed  likely  at  a
       given  read  position.   Use  of  quiver  requires  the  ConsensusCore  and ConsensusCore2
       libraries as well as trained parameter set, which will be loaded from a standard  location
       (TBD).   Arrow  and  Quiver can be thought of as local-realignment procedures (QV-aware in
       the case of Quiver).

       Both algorithms are expected to converge to zero errors (miscalled variants)  as  coverage
       increases;  however  quiver  should  converge  much  faster  (i.e.,  fewer  errors  at low
       coverage), and should provide greater variant detection power at a given error level.

   Confidence values
       Both quiver and plurality make a confidence metric available for  every  position  of  the
       consensus sequence.  The confidence should be interpreted as a phred-transformed posterior
       probability that the consensus call is incorrect; i.e.

              QV = −10log~10~(p~err~)

       variantCaller clips reported QV values at  93---larger  values  cannot  be  encoded  in  a
       standard FASTQ file.

   Chemistry specificity
       The  Quiver  algorithm  parameters are trained per-chemistry.  SMRTanalysis software loads
       metadata into the cmp.h5 to indicate the chemistry used per movie.  Quiver sees this table
       and  automatically  chooses  the  appropriate parameter set to use.  This selection can be
       overridden by a command line flag.

       When multiple chemistries are represented in the reads in a cmp.h5, Quiver will model each
       read  appropriately  using  the  parameter  set  for  its chemistry, thus yielding optimal
       results.

   Performance Requirements
       variantCaller performs  variant  calling  in  parallel  using  multiple  processes.   Work
       splitting  and  inter-process  communication  are handled using the Python multiprocessing
       module.  Work can  be  split  among  an  arbitrary  number  of  processes  (using  the  -j
       command-line  flag), but for best performance one should use no more worker processes than
       there are CPUs in the host computer.

       The running time of the plurality algorithm should not exceed the  runtime  of  the  BLASR
       process  that  produced  the  cmp.h5.  The running time of the quiver algorithm should not
       exceed 4x the runtime of BLASR.

       The amount of core memory (RAM)  used  among  all  the  python  processes  launched  by  a
       variantCaller run should not exceed the size of the uncompressed input .cmp.h5 file.

EXAMPLES

   Basic running instructions
       Basic usage---using 8 CPUs to compute consensus of mapped reads and variants relative to a
       reference---is as follows:

              % quiver -j8 aligned_reads{.cmp.h5, .bam, .fofn, or .xml} \
              >     -r reference{.fasta or .xml} -o variants.gff        \
              >     -o consensus.fasta -o consensus.fastq

       quiver is a shortcut for variantCaller --algorithm=quiver.  Naturally, to  use  arrow  you
       could use the arrow shortcut or variantCaller --algorithm=arrow.

       in  this  example  we perform haploid consensus and variant calling on the mapped reads in
       the aligned_reads.bam which was aligned to reference.fasta.  The reference.fasta  is  only
       used  for  designating  variant  calls,  not  for  computing the consensus.  The consensus
       quality score for every position can be found in the output FASTQ file.

       Note that 2.3 SMRTanalysis does not support "dataset" input (FOFN or XML files); those who
       need  this  feature  should  wait for the forthcoming release of SMRTanalysis 3.0 or build
       from GitHub sources.

   Running a large-scale resequencing/polishing job in SMRTanalysis
       2.3

       We do not recommend attempting to construct a single giant cmp.h5 file and then processing
       it  on  a  single node.  This is inefficient and users attempting to do this have run into
       many problems with the instability of the HDF5 library (which PacBio is moving away  from,
       in favor of BAM.)

       To  run a large-scale resequencing job (>50 megabase genome @ 50x coverage,nominally), you
       want to spread the computation load across multiple nodes in your computing cluster.

       The smrtpipe workflow engine in SMRTanalysis 2.3 provides a convenient workflow automating
       this---it  will  automatically spread the load for both mapping and quiver jobs among your
       available cluster nodes.  This is accessible via the SMRTportal UI; the  simplest  way  to
       set  up  and  run  thse  workflows  is  via  tha UI.  Nonetheless, we include command-line
       instructions for completeness.

       If you have to run the smrtpipe workflow manually from the command line, a  recipe  is  as
       folows:

       1. Make  sure  the reference you will align and compare against is present in a SMRTportal
          "reference repository".  Even if  you  don't  want  to  use  SMRTportal,  you  need  to
          build/import  the  reference  appropriately,  and  the  simplest  way to do that is via
          SMRTportal.  If you don't have a SMRTportal instance, you can use the referenceUploader
          command to prepare your reference repository.

       2. Prepare  an  "input.fofn"  file listing, one-per-line, each "bax.h5" file in your input
          data set.

       3. Convert the "input.fofn" to an "input.xml" file that SMRTpipe can understand:

           $ fofnToSmrtpipeInput.py input.fofn > input.xml

       4. Prepare your "params.xml" file.  Here is a params.xml template  (./params-template.xml)
          you can use; you should just need to edit the reference path.

       5. Activate your SMRTanalysis environment, and invoke smrtpipe:

           $  source  <SMRT Analysis>/etc/setup.sh $ smrtpipe.py --distribute --params=params.xml
           xml:input.xml

       6. After  successful  execution  is  complete,  the  results  should   be   available   as
          data/consensus.fast[aq].gz and data/variants.gff.gz, etc.

       Please     consult     the     SMRTpipe    reference    manual    (http://www.pacb.com/wp-
       content/uploads/2015/09/SMRT-Pipe-Reference-Guide.pdf) for further information.

       Note that resequencing  (mapping  reads  against  a  reference  genome  and  then  calling
       consensus  and identifying variants) and polishing (mapping reads against a draft assembly
       and then taking the consensus output as  the  final,  polished,  assembly)  are  the  same
       algorithmic  operation, the only effective difference is that the "variants.gff" output is
       not biologically meaningful in the polishing case---it just records the  edits  that  were
       made to the draft to produce the polished assembly.

   Running a large-scale quiver/arrow job in SMRTanalysis 3.0+
       (Forthcoming)

SEE ALSO

       quiver(1) arrow(1) plurality(1) pbgff(5) blasr(1)

                                                                                 variantCaller(1)