xenial (1) variantCaller.1.gz

Provided by: pbgenomicconsensus_2.0.0+20151210-1_all bug

NAME

       variantCaller - variant-calling algorithms for PacBio sequencing data

SYNOPSIS

       variantCaller.py is invoked from the command line.  For example, a simple invocation is:

          variantCaller.py -j8 --algorithm=quiver \
                           -r lambdaNEB.fa        \
                           -o variants.gff        \
                           aligned_reads.cmp.h5

       which  requests  that  variant  calling  proceed,  -  using  8  worker  processes, - employing the quiver
       algorithm, - taking input from the file aligned_reads.cmp.h5, - using the FASTA file lambdaNEB.fa as  the
       reference, - and writing output to variants.gff (see pbgff(5)).

       A  particularly  useful option is --referenceWindow/-w: this option allows the user to direct the tool to
       perform variant calling exclusively on a window of the reference genome, where the

OPTIONS

          variantCaller.py --help

       will provide a help message explaining all available options.

NOTES

   Input and output
       variantCaller.py requires two input files:

       • A file of reference-aligned reads in PacBio's standard cmp.h5 format;

       • A FASTA file that has been processed by ReferenceUploader.

       The tool's output is formatted in the GFF format, as described in (how to link to other file?).  External
       tools  can  be used to convert the GFF file to a VCF or BED file---two other standard interchange formats
       for variant calling.

       NOTE:
          Input cmp.h5 file requirements

          variantCaller.py requires its input cmp.h5 file to be be sorted.  An  unsorted  file  can  be  sorting
          using the tool cmpH5Sort.py.

          The  quiver(1)  algorithm  in variantCaller requires its input cmp.h5 file to have the following pulse
          features:

          System Message: ERROR/3 (doc/VariantCallerFunctionalSpecification.rst:, line 69)
                 Unexpected indentation.

              • InsQV,

              • SubsQV,

              • DelQV,

              • DelTag,

              • MergeQV.

          The plurality(1) algorithm can be run on cmp.h5 files that lack these features.

       The input file is the main argument to variantCaller.py, while the output file is provided as an argument
       to the -o flag.  For example,

          variantCaller.py aligned_reads.cmp.h5 -r lambda.fa  -o variants.gff

       will  read  input  from  aligned_reads.cmp.h5, using the reference lambda.fa, and send output to the file
       variants.gff.  The extension of the filename provided to the -o flag is meaningful, as it determines  the
       output file format.  The file formats presently supported, by extension, are

       .gff   GFFv3 format

       .txt   a simplified human readable format used primarily by the developers

       If  the  -o  flag  is  not  provided,  the default behavior is to output to a variants.gff in the current
       directory.

       NOTE:
          variantCaller.py does not modify its input cmp.h5 file in any way.  This is in  contrast  to  previous
          variant callers in use at PacBio, which would write a consensus dataset to the input cmp.h5 file.

   Available algorithms
       At this time there are two algorithms available for variant calling: plurality and quiver.

       Plurality  is  a  simple and very fast procedure that merely tallies the most frequent read base or bases
       found in alignment with each reference base, and reports  deviations  from  the  reference  as  potential
       variants.

       Quiver  is  a  more complex procedure based on algorithms originally developed for CCS.  Quiver leverages
       the quality values (QVs) provided by upstream  processing  tools,  which  provide  insight  into  whether
       insertions/deletions/substitutions  were  deemed likely at a given read position.  Use of quiver requires
       the ConsensusCore library as well as trained parameter set, which will be loaded from a standard location
       (TBD).  Quiver can be thought of as a QV-aware local-realignment procedure.

       Both  algorithms  are  expected  to  converge  to zero errors (miscalled variants) as coverage increases;
       however quiver should converge much faster (i.e., fewer errors  at  low  coverage),  and  should  provide
       greater variant detection power at a given error level.

   Confidence values
       Both  quiver  and  plurality  make  a  confidence  metric  available  for every position of the consensus
       sequence.  The confidence should be interpreted as a phred-transformed  posterior  probability  that  the
       consensus call is incorrect; i.e.

          QV = -10 \log_{10}(p_{err})

       variantCaller.py  clips  reported  QV  values at 93---larger values cannot be encoded in a standard FASTQ
       file.

   Chemistry specificity
       The Quiver algorithm parameters are trained per-chemistry.  SMRTanalysis software loads metadata into the
       cmp.h5  to  indicate  the chemistry used per movie.  Quiver sees this table and automatically chooses the
       appropriate parameter set to use.  This selection can be overridden by a command line flag.

       When multiple chemistries are represented in  the  reads  in  a  cmp.h5,  Quiver  will  model  each  read
       appropriately using the parameter set for its chemistry, thus yielding optimal results.

   Performance Requirements
       variantCaller.py  performs  variant  calling  in  parallel  using multiple processes.  Work splitting and
       inter-process communication are handled using the Python multiprocessing module.  Work can be split among
       an  arbitrary  number  of processes (using the -j command-line flag), but for best performance one should
       use no more worker processes than there are CPUs in the host computer.

       The running time of the plurality algorithm should not exceed the  runtime  of  the  BLASR  process  that
       produced the cmp.h5. The running time of the quiver algorithm should not exceed 4x the runtime of BLASR.

       The  amount  of  core memory (RAM) used among all the python processes launched by a variantCaller.py run
       should not exceed the size of the uncompressed input .cmp.h5 file.

SEE ALSO

       quiver(1) plurality(1) pbgff(5) blasr(1)

                                                  February 2016                                 VARIANTCALLER(1)