Provided by: pbgenomicconsensus_2.1.0-1_all bug

NAME

       variantCaller - variant-calling algorithms for PacBio sequencing data

SYNOPSIS

       variantCaller is invoked from the command line.  For example, a simple invocation is:

              variantCaller -j8 --algorithm=quiver \
                                -r lambdaNEB.fa        \
                                -o variants.gff        \
                                aligned_reads.cmp.h5

       which  requests  that  variant  calling proceed, - using 8 worker processes, - employing the quiver algo‐
       rithm, - taking input from the file aligned_reads.cmp.h5, - using the FASTA file lambdaNEB.fa as the ref‐
       erence, - and writing output to variants.gff (see pbgff(5)).

       A particularly useful option is --referenceWindow/-w: this option allows the user to direct the  tool  to
       perform variant calling exclusively on a window of the reference genome, where the

OPTIONS

              variantCaller --help

       will provide a help message explaining all available options.

NOTES

   Input and output
       variantCaller requires two input files:

       • A file of reference-aligned reads in PacBio's standard cmp.h5 format;

       • A FASTA file that has been processed by ReferenceUploader.

       The tool's output is formatted in the GFF format, as described in (how to link to other file?).  External
       tools  can  be used to convert the GFF file to a VCF or BED file---two other standard interchange formats
       for variant calling.

              note

              Input cmp.h5 file requirements

              variantCaller requires its input cmp.h5 file to be be sorted.  An unsorted file can be sorting us‐
              ing the tool cmpH5Sort.py.

              The quiver(1) algorithm in variantCaller requires its input cmp.h5  file  to  have  the  following
              pulse features: - InsQV, - SubsQV, - DelQV, - DelTag, - MergeQV.

              The plurality(1) algorithm can be run on cmp.h5 files that lack these features.

       The input file is the main argument to variantCaller, while the output file is provided as an argument to
       the -o flag.  For example,

              variantCaller aligned_reads.cmp.h5 -r lambda.fa  -o variants.gff

       will  read  input  from  aligned_reads.cmp.h5, using the reference lambda.fa, and send output to the file
       variants.gff.  The extension of the filename provided to the -o flag is meaningful, as it determines  the
       output file format.  The file formats presently supported, by extension, are

       .gff   GFFv3 format

       .txt   a simplified human readable format used primarily by the developers

       If the -o flag is not provided, the default behavior is to output to a variants.gff in the current direc‐
       tory.

              note

              variantCaller  does  not modify its input cmp.h5 file in any way.  This is in contrast to previous
              variant callers in use at PacBio, which would write a consensus dataset to the input cmp.h5 file.

   Available algorithms
       At this time there are two algorithms available for variant calling: plurality and quiver.

       Plurality is a simple and very fast procedure that merely tallies the most frequent read  base  or  bases
       found in alignment with each reference base, and reports deviations from the reference as potential vari‐
       ants.

       Quiver  is  a  more complex procedure based on algorithms originally developed for CCS.  Quiver leverages
       the quality values (QVs) provided by upstream processing tools, which provide insight into whether inser‐
       tions/deletions/substitutions were deemed likely at a given read position.  Use of  quiver  requires  the
       ConsensusCore  and ConsensusCore2 libraries as well as trained parameter set, which will be loaded from a
       standard location (TBD).  Arrow and Quiver can be thought of as local-realignment procedures (QV-aware in
       the case of Quiver).

       Both algorithms are expected to converge to zero errors (miscalled variants) as coverage increases;  how‐
       ever  quiver should converge much faster (i.e., fewer errors at low coverage), and should provide greater
       variant detection power at a given error level.

   Confidence values
       Both quiver and plurality make a confidence metric available for every  position  of  the  consensus  se‐
       quence.   The confidence should be interpreted as a phred-transformed posterior probability that the con‐
       sensus call is incorrect; i.e.

              QV = −10log~10~(p~err~)

       variantCaller clips reported QV values at 93---larger values cannot be encoded in a standard FASTQ file.

   Chemistry specificity
       The Quiver algorithm parameters are trained per-chemistry.  SMRTanalysis software loads metadata into the
       cmp.h5 to indicate the chemistry used per movie.  Quiver sees this table and  automatically  chooses  the
       appropriate parameter set to use.  This selection can be overridden by a command line flag.

       When multiple chemistries are represented in the reads in a cmp.h5, Quiver will model each read appropri‐
       ately using the parameter set for its chemistry, thus yielding optimal results.

   Performance Requirements
       variantCaller  performs  variant  calling  in  parallel using multiple processes.  Work splitting and in‐
       ter-process communication are handled using the Python multiprocessing module.  Work can be  split  among
       an  arbitrary  number  of processes (using the -j command-line flag), but for best performance one should
       use no more worker processes than there are CPUs in the host computer.

       The running time of the plurality algorithm should not exceed the runtime of the BLASR process that  pro‐
       duced the cmp.h5.  The running time of the quiver algorithm should not exceed 4x the runtime of BLASR.

       The  amount  of  core  memory  (RAM)  used among all the python processes launched by a variantCaller run
       should not exceed the size of the uncompressed input .cmp.h5 file.

EXAMPLES

   Basic running instructions
       Basic usage---using 8 CPUs to compute consensus of mapped reads and variants relative to a reference---is
       as follows:

              % quiver -j8 aligned_reads{.cmp.h5, .bam, .fofn, or .xml} \
              >     -r reference{.fasta or .xml} -o variants.gff        \
              >     -o consensus.fasta -o consensus.fastq

       quiver is a shortcut for variantCaller --algorithm=quiver.  Naturally, to use arrow you could use the ar‐
       row shortcut or variantCaller --algorithm=arrow.

       in this  example  we  perform  haploid  consensus  and  variant  calling  on  the  mapped  reads  in  the
       aligned_reads.bam which was aligned to reference.fasta.  The reference.fasta is only used for designating
       variant  calls,  not  for computing the consensus.  The consensus quality score for every position can be
       found in the output FASTQ file.

       Note that 2.3 SMRTanalysis does not support "dataset" input (FOFN or XML files); those who need this fea‐
       ture should wait for the forthcoming release of SMRTanalysis 3.0 or build from GitHub sources.

   Running a large-scale resequencing/polishing job in SMRTanalysis
       2.3

       We do not recommend attempting to construct a single giant cmp.h5 file and then processing it on a single
       node.  This is inefficient and users attempting to do this have run into many problems with the instabil‐
       ity of the HDF5 library (which PacBio is moving away from, in favor of BAM.)

       To run a large-scale resequencing job (>50 megabase genome @ 50x coverage,nominally), you want to  spread
       the computation load across multiple nodes in your computing cluster.

       The smrtpipe workflow engine in SMRTanalysis 2.3 provides a convenient workflow automating this---it will
       automatically  spread the load for both mapping and quiver jobs among your available cluster nodes.  This
       is accessible via the SMRTportal UI; the simplest way to set up and run thse workflows  is  via  tha  UI.
       Nonetheless, we include command-line instructions for completeness.

       If you have to run the smrtpipe workflow manually from the command line, a recipe is as folows:

       1. Make  sure  the  reference  you  will  align and compare against is present in a SMRTportal "reference
          repository".  Even if you don't want to use SMRTportal, you need to build/import the reference  appro‐
          priately, and the simplest way to do that is via SMRTportal.  If you don't have a SMRTportal instance,
          you can use the referenceUploader command to prepare your reference repository.

       2. Prepare an "input.fofn" file listing, one-per-line, each "bax.h5" file in your input data set.

       3. Convert the "input.fofn" to an "input.xml" file that SMRTpipe can understand:

           $ fofnToSmrtpipeInput.py input.fofn > input.xml

       4. Prepare  your  "params.xml"  file.  Here is a params.xml template (./params-template.xml) you can use;
          you should just need to edit the reference path.

       5. Activate your SMRTanalysis environment, and invoke smrtpipe:

           $ source <SMRT Analysis>/etc/setup.sh $ smrtpipe.py --distribute --params=params.xml xml:input.xml

       6. After successful execution is complete, the results should be available as  data/consensus.fast[aq].gz
          and data/variants.gff.gz, etc.

       Please  consult  the SMRTpipe reference manual (http://www.pacb.com/wp-content/uploads/2015/09/SMRT-Pipe-
       Reference-Guide.pdf) for further information.

       Note that resequencing (mapping reads against a reference genome and then calling consensus and identify‐
       ing variants) and polishing (mapping reads against a draft assembly and then taking the consensus  output
       as  the  final,  polished, assembly) are the same algorithmic operation, the only effective difference is
       that the "variants.gff" output is not biologically meaningful in the polishing case---it just records the
       edits that were made to the draft to produce the polished assembly.

   Running a large-scale quiver/arrow job in SMRTanalysis 3.0+
       (Forthcoming)

SEE ALSO

       quiver(1) arrow(1) plurality(1) pbgff(5) blasr(1)

                                                                                                variantCaller(1)