Ubuntu Manpage: variantCaller - variant-calling algorithms for PacBio sequencing data

Provided by: pbgenomicconsensus_2.1.0-1_all

NAME

       variantCaller - variant-calling algorithms for PacBio sequencing data

SYNOPSIS

       variantCaller is invoked from the command line.  For example, a simple invocation is:

              variantCaller -j8 --algorithm=quiver \
                                -r lambdaNEB.fa        \
                                -o variants.gff        \
                                aligned_reads.cmp.h5

       which  requests  that variant calling proceed, - using 8 worker processes, - employing the
       quiver algorithm, - taking input from the file aligned_reads.cmp.h5,  -  using  the  FASTA
       file lambdaNEB.fa as the reference, - and writing output to variants.gff (see pbgff(5)).

       A  particularly  useful  option  is  --referenceWindow/-w:  this option allows the user to
       direct the tool to perform variant calling  exclusively  on  a  window  of  the  reference
       genome, where the

OPTIONS

              variantCaller --help

       will provide a help message explaining all available options.

NOTES

Input and output
variantCaller requires two input files:

• A file of reference-aligned reads in PacBio's standard cmp.h5 format;

• A FASTA file that has been processed by ReferenceUploader.

The tool's output is formatted in the GFF format, as described in (how to link to other
file?). External tools can be used to convert the GFF file to a VCF or BED file---two
other standard interchange formats for variant calling.

note

Input cmp.h5 file requirements

variantCaller requires its input cmp.h5 file to be be sorted. An unsorted file can
be sorting using the tool cmpH5Sort.py.

The quiver(1) algorithm in variantCaller requires its input cmp.h5 file to have the
following pulse features: - InsQV, - SubsQV, - DelQV, - DelTag, - MergeQV.

The plurality(1) algorithm can be run on cmp.h5 files that lack these features.

The input file is the main argument to variantCaller, while the output file is provided as
an argument to the -o flag. For example,

variantCaller aligned_reads.cmp.h5 -r lambda.fa -o variants.gff

will read input from aligned_reads.cmp.h5, using the reference lambda.fa, and send output
to the file variants.gff. The extension of the filename provided to the -o flag is
meaningful, as it determines the output file format. The file formats presently
supported, by extension, are

.gff GFFv3 format

.txt a simplified human readable format used primarily by the developers

If the -o flag is not provided, the default behavior is to output to a variants.gff in the
current directory.

note

variantCaller does not modify its input cmp.h5 file in any way. This is in
contrast to previous variant callers in use at PacBio, which would write a
consensus dataset to the input cmp.h5 file.

Available algorithms
At this time there are two algorithms available for variant calling: plurality and quiver.

Plurality is a simple and very fast procedure that merely tallies the most frequent read
base or bases found in alignment with each reference base, and reports deviations from the
reference as potential variants.

Quiver is a more complex procedure based on algorithms originally developed for CCS.
Quiver leverages the quality values (QVs) provided by upstream processing tools, which
provide insight into whether insertions/deletions/substitutions were deemed likely at a
given read position. Use of quiver requires the ConsensusCore and ConsensusCore2
libraries as well as trained parameter set, which will be loaded from a standard location
(TBD). Arrow and Quiver can be thought of as local-realignment procedures (QV-aware in
the case of Quiver).

Both algorithms are expected to converge to zero errors (miscalled variants) as coverage
increases; however quiver should converge much faster (i.e., fewer errors at low
coverage), and should provide greater variant detection power at a given error level.

Confidence values
Both quiver and plurality make a confidence metric available for every position of the
consensus sequence. The confidence should be interpreted as a phred-transformed posterior
probability that the consensus call is incorrect; i.e.

QV = −10log~10~(p~err~)

variantCaller clips reported QV values at 93---larger values cannot be encoded in a
standard FASTQ file.

Chemistry specificity
The Quiver algorithm parameters are trained per-chemistry. SMRTanalysis software loads
metadata into the cmp.h5 to indicate the chemistry used per movie. Quiver sees this table
and automatically chooses the appropriate parameter set to use. This selection can be
overridden by a command line flag.

When multiple chemistries are represented in the reads in a cmp.h5, Quiver will model each
read appropriately using the parameter set for its chemistry, thus yielding optimal
results.

Performance Requirements
variantCaller performs variant calling in parallel using multiple processes. Work
splitting and inter-process communication are handled using the Python multiprocessing
module. Work can be split among an arbitrary number of processes (using the -j
command-line flag), but for best performance one should use no more worker processes than
there are CPUs in the host computer.

The running time of the plurality algorithm should not exceed the runtime of the BLASR
process that produced the cmp.h5. The running time of the quiver algorithm should not
exceed 4x the runtime of BLASR.

The amount of core memory (RAM) used among all the python processes launched by a
variantCaller run should not exceed the size of the uncompressed input .cmp.h5 file.

EXAMPLES

Basic running instructions
Basic usage---using 8 CPUs to compute consensus of mapped reads and variants relative to a
reference---is as follows:

% quiver -j8 aligned_reads{.cmp.h5, .bam, .fofn, or .xml} \
> -r reference{.fasta or .xml} -o variants.gff \
> -o consensus.fasta -o consensus.fastq

quiver is a shortcut for variantCaller --algorithm=quiver. Naturally, to use arrow you
could use the arrow shortcut or variantCaller --algorithm=arrow.

in this example we perform haploid consensus and variant calling on the mapped reads in
the aligned_reads.bam which was aligned to reference.fasta. The reference.fasta is only
used for designating variant calls, not for computing the consensus. The consensus
quality score for every position can be found in the output FASTQ file.

Note that 2.3 SMRTanalysis does not support "dataset" input (FOFN or XML files); those who
need this feature should wait for the forthcoming release of SMRTanalysis 3.0 or build
from GitHub sources.

Running a large-scale resequencing/polishing job in SMRTanalysis
2.3

We do not recommend attempting to construct a single giant cmp.h5 file and then processing
it on a single node. This is inefficient and users attempting to do this have run into
many problems with the instability of the HDF5 library (which PacBio is moving away from,
in favor of BAM.)

To run a large-scale resequencing job (>50 megabase genome @ 50x coverage,nominally), you
want to spread the computation load across multiple nodes in your computing cluster.

The smrtpipe workflow engine in SMRTanalysis 2.3 provides a convenient workflow automating
this---it will automatically spread the load for both mapping and quiver jobs among your
available cluster nodes. This is accessible via the SMRTportal UI; the simplest way to
set up and run thse workflows is via tha UI. Nonetheless, we include command-line
instructions for completeness.

If you have to run the smrtpipe workflow manually from the command line, a recipe is as
folows:

1. Make sure the reference you will align and compare against is present in a SMRTportal
"reference repository". Even if you don't want to use SMRTportal, you need to
build/import the reference appropriately, and the simplest way to do that is via
SMRTportal. If you don't have a SMRTportal instance, you can use the referenceUploader
command to prepare your reference repository.

2. Prepare an "input.fofn" file listing, one-per-line, each "bax.h5" file in your input
data set.

3. Convert the "input.fofn" to an "input.xml" file that SMRTpipe can understand:

$ fofnToSmrtpipeInput.py input.fofn > input.xml

4. Prepare your "params.xml" file. Here is a params.xml template (./params-template.xml)
you can use; you should just need to edit the reference path.

5. Activate your SMRTanalysis environment, and invoke smrtpipe:

$ source <SMRT Analysis>/etc/setup.sh $ smrtpipe.py --distribute --params=params.xml
xml:input.xml

6. After successful execution is complete, the results should be available as
data/consensus.fast[aq].gz and data/variants.gff.gz, etc.

Please consult the SMRTpipe reference manual (http://www.pacb.com/wp-
content/uploads/2015/09/SMRT-Pipe-Reference-Guide.pdf) for further information.

Note that resequencing (mapping reads against a reference genome and then calling
consensus and identifying variants) and polishing (mapping reads against a draft assembly
and then taking the consensus output as the final, polished, assembly) are the same
algorithmic operation, the only effective difference is that the "variants.gff" output is
not biologically meaningful in the polishing case---it just records the edits that were
made to the draft to produce the polished assembly.

Running a large-scale quiver/arrow job in SMRTanalysis 3.0+
(Forthcoming)

NAME

SYNOPSIS

OPTIONS

NOTES

EXAMPLES

SEE ALSO