xenial (7) quiver-faq.7.gz

Provided by: pbgenomicconsensus_2.0.0+20151210-1_all bug

NAME

       quiver-faq - frequently asked questions regarding the quiver genomic consensus caller

QUESTIONS

   What are EviCons? GenomicConsensus? Quiver? Plurality?
       GenomicConsensus  is the current PacBio consensus and variant calling suite.  It contains a main program,
       variantCaller.py, which provides two consensus / variant calling algorithms: Plurality and Quiver.  These
       algorithms  can be run by calling variantCaller.py --algorithm=[quiver|plurality] or by going through the
       convenience wrapper scripes quiver and plurality.

       EviCons was the previous generation PacBio variant caller (removed in software release v1.3.1).

       A separate package called ConsensusCore is a C++ library where all the computation behind Quiver is  done
       (and is transparent to the user after installation).

   What is Plurality?
       Plurality  is  a  very  simple  variant  calling  algorithm: it stacks up the aligned reads (alignment as
       produced by BLASR, or alternate mapping tool), and for each column under a reference base, calls the most
       abundant  (i.e.,  the  plurality)  read  base  (or bases, or deletion) as the consensus at that reference
       position.

   Why is Plurality a weak algorithm?
       Plurality does not perform any local realignment.  This means it  is  heavily  biased  by  the  alignment
       produced  by  the  mapper  (BLASR, typically).  It also means that it is insensitive at detecting indels.
       Consider this example:

          Reference    AAAA
                       ----
            Aligned    A-AA
              reads    AA-A
                       -AAA
                       ----
          Plurality    AAAA
          consensus

       Note here that every read has a deletion and the correct consensus call would be "AAA", but  due  to  the
       mapper's  freedom  in  gap-placement  at the single-read level, the plurality sequence is "AAAA"---so the
       deletion is missed.  Local realignment, which plurality does not do, but which  could  be  considered  as
       implicit  in  the Quiver algorithm, essentially pushes the gaps here to the same column, thus identifying
       the deletion.  While plurality could be adjusted to  use  a  simple  "gap  normalizing"  realignment,  in
       practice  noncognate  extras  (spurious non-homopolymer base calls) in the midst of homopolymer runs pose
       challenges.

   What is Quiver?
       Quiver is a more sophisticated algorithm that finds the maximum likelihood template sequence given PacBio
       reads  of  the  template.   PacBio  reads  are  modeled  using  a  conditional random field approach that
       prescribes a probability to a read given a template sequence.  In addition to the base sequence  of  each
       read,  Quiver uses several additional QV covariates that the basecaller provides.  Using these covariates
       provides additional information about each read, allowing more accurate consensus calls.

       Quiver does not use the alignment provided by the mapper (BLASR, typically), except for  determining  how
       to  group  reads  together at a macro level.  It implicitly performs its own realignment, so it is highly
       sensitive to all variant types, including indels---for example, it resolves the example above with ease.

       The name Quiver reflects a consensus-calling algorithm that is QV-aware.

   What does Quiver put in its output files?
       There are three output files from Quiver:

       • A consensus FASTA file containing the consensus sequence

       • A consensus FASTQ file containing the consensus sequence with quality annotations

       • A variants GFF file containing a filtered, annotated list of variants identified

       It is important to note that the variants included in the  output  variants  GFF  file  are  filtered  by
       coverage  and  quality, so not all variants that are apparent in comparing the reference to the consensus
       FASTA output will correspond to variants in the output variants GFF file.

       To enable all output files, the following can be run (for example):

          % quiver -j16 aligned_reads.cmp.h5 -r ref.fa
                 -o   consensus.fa                             -o    consensus.fq                             -o
                 variants.gff

       The extension is used to determine the output file format.

   What does it mean that Quiver's consensus is de novo?
       Quiver's consensus is de novo in the sense that the reference and the reference alignment are not used to
       inform the consensus output.  Only the reads factor into the determination of the consensus.

       The only time the reference sequence is used to make consensus calls - when the --noEvidenceConsensusCall
       flag is set to reference or lowercasereference (the default)- is when there is no effective coverage in a
       genomic   window,   so   Quiver   has   no   evidence   for   computing   consensus.    One    can    set
       --noEvidenceConsensusCall=nocall to avoid using the reference even in zero coverage regions.

   What is Quiver's accuracy?
       Quiver's  expected  accuracy  is  a  function  of  coverage  and  chemistry.  The C2 chemistry (no longer
       available), P6-C4 and P4-C2 chemistries provide the most accuracy.  Nominal consensus accuracy levels are
       as follows:

                               ┌─────────────────┬─────────────────────────────┬───────┐
                               │Coverage         │ Expected consensus accuracy │       │
                               ├─────────────────┼─────────────────────────────┼───────┤
                               │C2, P4-C2, P6-C4 │ P5-C3                       │       │
                               ├─────────────────┼─────────────────────────────┼───────┤
                               │10x              │ > Q30                       │ > Q30 │
                               ├─────────────────┼─────────────────────────────┼───────┤
                               │20x              │ > Q40                       │ > Q40 │
                               ├─────────────────┼─────────────────────────────┼───────┤
                               │40x              │ > Q50                       │ > Q45 │
                               ├─────────────────┼─────────────────────────────┼───────┤
                               │60-80x           │ ~ Q60                       │ > Q55 │
                               └─────────────────┴─────────────────────────────┴───────┘

       The "Q" values referred to are Phred-scaled quality values:

          q = -10 \log_{10} p_{error}

       for  instance,  Q50  corresponds  to  a  p_error  of  0.00001---an  accuracy  of 99.999%.  These accuracy
       expectations are based on routine  validations  performed  on  multiple  bacterial  genomes  before  each
       chemistry release.

   What are the residual errors after applying Quiver?
       If  there  are  errors  remaining  applying Quiver, they will almost invariably be homopolymer run-length
       errors (insertions or deletions).

   Does Quiver need to know what sequencing chemistry was used?
       At present, the Quiver model is trained per-chemistry, so it is very  important  that  Quiver  knows  the
       sequencing chemistries used.

       If  SMRT  Analysis software was used to build the cmp.h5 file, the cmp.h5 will be loaded with information
       about the sequencing chemistry used for each SMRT Cell, and Quiver will automatically identify the  right
       parameters to use.

       If  custom  software  was  used to build the cmp.h5, or an override of Quiver's autodetection is desired,
       then the chemistry or model must be explicity entered. For example:

          % quiver -p P4-C2 ...
          % quiver -p P4-C2.AllQVsMergingByChannelModel ...

   Can a mix of chemistries be used in a cmp.h5 file for Quiver?
       Yes!  Quiver automatically sees the chemistry per-SMRT Cell, so it can figure out  the  right  parameters
       for each read and model them appropriately.

       Chemistry  mixtures  of  P6-C4,  P4-C2, P5-C3, and C2 are supported.  If other chemistries are mixed in a
       cmp.h5, Quiver will give undefined results.  However, Quiver  can  still  be  used  on  any  cmp.h5  file
       containing sequencing reads from a single chemistry.

   What are the QVs that Quiver uses?
       Quiver  uses  additional  QV  tracks  provided  by  the basecaller.  These QVs may be looked at as little
       breadcrumbs that are left behind by the basecaller to help identify positions where it  was  likely  that
       errors  of  a  given type occurred.  Formally, the QVs for a given read are vectors of the same length as
       the number of bases called; the QVs used are as follows:

          • DeletionQV

          • InsertionQV

          • MergeQV

          • SubstitutionQV

          • DeletionTag

       To find out if your cmp.h5 file is loaded with these QV tracks, run the command

          % h5ls -rv aligned_reads.cmp.h5

       and look for the QV track names in the output.  If your cmp.h5 file is  lacking  some  of  these  tracks,
       Quiver will still run, though it will issue a warning that its performance will be suboptimal.

   Why is Quiver making errors in some region?
       The  most  likely  cause  for  true errors made by Quiver is that the coverage in the region was low.  If
       there is 5x coverage over a 1000-base region, then 10 errors in that region can be expected.

       It is important to understand that the effective coverage available to Quiver is not  the  full  coverage
       apparent  in  plots---Quiver  and  Plurality  both  filter  out ambiguously mapped reads by default.  The
       remaining coverage after filtering is  called  the  /effective  coverage/.   See  the  next  section  for
       discussion of MapQV.

       If  you  have  verified  that  there  is  high effective coverage in the region in question, it is highly
       possible---given the high accuracy Quiver can achieve---that the apparent errors  actually  reflect  true
       sequence  variants.   Inspect  the  FASTQ  output  file  to  ensure  that  the  region was called at high
       confidence; if an erroneous sequence variant is being called at high confidence, please report a  bug  to
       us.

   What does Quiver do for genomic regions with no effective coverage?
       For regions with no effective coverage, no variants are outputted, and the FASTQ confidence is 0.

       The  output  in  the  FASTA  and  FASTQ  consensus  sequence  tracks  is  dependent on the setting of the
       --noEvidenceConsensusCall flag.  Assuming the reference in the window is "ACGT", the options are:

                              ┌──────────────────────────────────────┬──────────────────┐
                              │--noEvidenceConsensusCall=...         │ Consensus output │
                              ├──────────────────────────────────────┼──────────────────┤
                              │nocall (default in 1.4)               │ NNNN             │
                              ├──────────────────────────────────────┼──────────────────┤
                              │reference                             │ ACGT             │
                              ├──────────────────────────────────────┼──────────────────┤
                              │lowercasereference (new post 1.4, and │ acgt             │
                              │the default)                          │                  │
                              └──────────────────────────────────────┴──────────────────┘

   What is MapQV and why is it important?
       MapQV  is a single scalar Phred-scaled QV per aligned read that reflects the mapper's degree of certainty
       that the read aligned to this part of the reference and not some other.  Unambigously mapped  reads  will
       have  a  high  MapQV (typically 255), while a read that was equally likely to have come from two parts of
       the reference would have a MapQV of 3.

       MapQV is pretty important when you want highly accurate variant calls.  Quiver and Plurality both  filter
       out aligned reads with a MapQV below 20 (by default), so as not to call a variant using data of uncertain
       genomic origin.

       This can be problematic if using Quiver to get a consensus sequence.  If the genome of interest  contains
       long  (relative  to  the library insert size) highly-similar repeats, the effective coverage (after MapQV
       filtering) may be reduced in the repeat regions---this is termed these MapQV dropouts.  If  the  coverage
       is sufficiently reduced in these regions, Quiver will not call consensus in these regions---see What does
       Quiver do for genomic regions with no effective coverage?.

       If you want to use ambiguously mapped reads in computing a consensus for a  denovo  assembly,  the  MapQV
       filter  can  be  turned  off entirely.  In this case, the consensus for each instance of a genomic repeat
       will be calculated using reads that may actually be from other instances of  the  repeat,  so  the  exact
       trustworthiness  of  the  consensus  in  that  region  may be suspect.  The next section describes how to
       disable the MapQV filter.

   How can the MapQV filter be turned off and when should it be?
       The MapQV filter can be disabled using the flag  --mapQvThreshold=0  (shorthand:  -m=0).   If  running  a
       Quiver  job  via  SMRT  Portal,  this can be done by unchecking the "Use only unambiguously mapped reads"
       option. Consider this in de novo assembly projects,  but  it  is  not  recommended  for  variant  calling
       applications.

   How can variant calls made by Quiver be inspected or validated?
       When  in  doubt,  it is easiest to inspect the region in a tool like SMRT View, which enables you to view
       the reads aligned to the region.  Deletions and substitutions should be fairly  easy  to  spot;  to  view
       insertions, right-click on the reference base and select "View Insertions Before...".

   What are the filtering parameters that Quiver uses?
       Quiver limits read coverage, filters reads by MapQV, and filters variants by quality and coverage.

       • The overall read coverage used to call consensus in every window is 100x by default, but can be changed
         using -X=value.

       • The  MapQV  filter,  by  default,  removes  reads  with  MapQV  <  20.   This   is   configured   using
         --mapQvThreshold=value / -m=value

       • Variants are only called if the read coverage of the site exceeds 5x, by default---this is configurable
         using -x=value.  Further, they will not be called if the  confidence  (Phred-scaled)  does  not  exceed
         40---configurable using -q=value.

   What happens when the sample is a mixture, or diploid?
       At  present,  Quiver  assumes  a  haploid  sample,  and  the  behavior  of  Quiver  on sample mixtures or
       diploid/polyploid samples is undefined.  The program will not crash,  but  the  output  results  are  not
       guaranteed to accord with any one of the haplotypes in the sample, as opposed to a potential patchwork.

   Why would I want to iterate the mapping+Quiver process?
       Some  customers  using  Quiver  for  polishing highly repetitive genomes have found that if they take the
       consensus FASTA output of Quiver, use it as a new reference, and then perform mapping and Quiver again to
       get a new consensus, they get improved results from the second round of Quiver.

       This  can  be  explained by noting that the output of the first round of Quiver is more accurate than the
       initial draft consensus output by the assembler, so the second round's mapping to  the  Quiver  consensus
       can  be  more  sensitive  in  mapping  reads  from  repetitive regions.  This can then result in improved
       consensus in those repetitive regions, because the reads have been assigned more correctly to their  true
       genomic  loci.   However there is also a possibility that the potential shifting of reads around from one
       rounds' mapping to the next might alter borderline  (low  confidence)  consensus  calls  even  away  from
       repetitive regions.

       We recommend the (mapping+Quiver) iteration for customers polishing repetitive genomes, and it could also
       prove useful for resequencing applications.  However we caution that this is  very  much  an  exploratory
       procedure and we make no guarantees about its performance.  In particular, borderline consensus calls can
       change when the procedure is iterated, and the procedure is not guaranteed to be convergent.

   Is iterating the (mapping+Quiver) process a convergent procedure?
       We have seen many examples where (mapping+Quiver), repeated many times, is  evidently  not  a  convergent
       procedure.   For  example,  a variant call may be present in iteration n, absent in n+1, and then present
       again in n+2.  It is possible for subtle changes in mapping to change the  set  of  reads  examined  upon
       inspecting  a  genomic  window,  and therefore result in a different consensus sequence there.  We expect
       this to be the case primarily for "borderline" (low confidence) base calls.

SEE ALSO

       quiver(1) plurality(1) variantCaller(1) pbgff(5)