Ubuntu Manpage: ipdSummary - Detect DNA base-modifications from kinetic signatures.

Provided by: kineticstools_0.6.1+20161222-1ubuntu1_all

NAME

       ipdSummary - Detect DNA base-modifications from kinetic signatures.

DESCRIPTION

kineticsTool loads IPDs observed at each position in the genome, and compares those IPDs to value
expected for unmodified DNA, and outputs the result of this statistical test. The expected IPD value for
unmodified DNA can come from either an in-silico control or an amplified control. The in silico control
is trained by PacBio and shipped with the package. It predicts predicts the IPD using the local sequence
context around the current position. An amplified control dataset is generated by sequencing unmodified
DNA with the same sequence as the test sample. An amplified control sample is usually generated by
whole-genome amplification of the original sample.

Modification Detection
The basic mode of kineticsTools does an independent comparison of IPDs at each position on the genome,
for each strand, and emits various statistics to CSV and GFF (after applying a significance filter).

Modifications Identification
kineticsTools also has a Modification Identification mode that can decode multi-site IPD 'fingerprints'
into a reduced set of calls of specific modifications. This feature has the following benefits:

• Different modifications occurring on the same base can be distinguished (for example m5C and
m4C)

• The signal from one modification is combined into one statistic, improving sensitivity, removing
extra peaks, and correctly centering the call

OPTIONS

       Please call this program with --help to see the available options.

ALGORITHM

   Synthetic Control
       Studies of the relationship between IPD and sequence context reveal that most of the  variation  in  mean
       IPD  across  a genome can be predicted from a 12-base sequence context surrounding the active site of the
       DNA polymerase. The bounds of the relevant context window correspond to the window of DNA in contact with
       the polymerase, as seen in DNA/polymerase crystal structures.  To simplify the  process  of  finding  DNA
       modifications with PacBio data, the tool includes a pre-trained lookup table mapping 12-mer DNA sequences
       to mean IPDs observed in C2 chemistry.

   Filtering and Trimming
       kineticsTools  uses  the  Mapping  QV  generated  by  BLASR  and  stored  in  the  cmp.h5 or BAM file (or
       AlignmentSet) to ignore reads that aren't confidently mapped.  The default minimum Mapping QV required is
       10, implying that BLASR has 90\% confidence that the read is correctly mapped. Because of  the  range  of
       readlengths  inherent  in  PacBio  dataThis  can  be  changed  in using the --mapQvThreshold command line
       argument, or via the SMRTPortal configuration dialog for Modification Detection.

       There are a few features of PacBio  data  that  require  special  attention  in  order  to  achieve  good
       modification  detection performance.  kineticsTools inspects the alignment between the observed bases and
       the reference sequence -- in order for an IPD measurement to be included in the analysis, the PacBio read
       sequence must match the reference sequence for k around the cognate base. In the current module  k=1  The
       IPD distribution at some locus be thought of as a mixture between the 'normal' incorporation process IPD,
       which  is  sensitive  to  the  local  sequence  context and DNA modifications and a contaminating 'pause'
       process IPD which have a much longer duration (mean >10x longer than normal), but happen rarely  (~1%  of
       IPDs).   Note:  Our  current  understanding  is  that  pauses  do  not carry useful information about the
       methylation state of the DNA,  however  a  more  careful  analysis  may  be  warranted.  Also  note  that
       modifications  that  drastically  increase the Roughly 1% of observed IPDs are generated by pause events.
       Capping observed IPDs at the global 99th  percentile  is  motivated  by  theory  from  robust  hypothesis
       testing.   Some sequence contexts may have naturally longer IPDs, to avoid capping too much data at those
       contexts,  the  cap  threshold  is  adjusted  per  context  as  follows:  capThreshold  =   max(global99,
       5*modelPrediction, percentile(ipdObservations, 75))

   Statistical Testing
       We  test  the  hypothesis that IPDs observed at a particular locus in the sample have a longer means than
       IPDs observed at the same locus in unmodified DNA.   If  we  have  generated  a  Whole  Genome  Amplified
       dataset,  which  removes  DNA  modifications,  we  use a case-control, two-sample t-test.  This tool also
       provides a pre-calibrated 'synthetic control' model which predicts the unmodified IPD, given  a  12  base
       sequence context. In the synthetic control case we use a one-sample t-test, with an adjustment to account
       for error in the synthetic control model.

EXAMPLE USAGE

       Basic use with BAM input, GFF+HDF5 output:

          ipdSummary aligned.bam --reference ref.fasta --identify m6A,m4C --gff basemods.gff --csv_h5 kinetics.h5

       With cmp.h5 input, methyl fraction calculation and GFF+CSV output:

          ipdSummary aligned.cmp.h5 --reference ref.fasta --identify m6A,m4C --methylFraction --gff basemods.gff --csv kinetics.csv

INPUTS

   Aligned Reads
       A  standard  PacBio  alignment file - either AlignmentSet XML, BAM, or cmp.h5 - containing alignments and
       IPD information supplies the kinetic data used to perform modification detection.   The  standard  cmp.h5
       file of a SMRTportal jobs is data/aligned_read.cmp.h5.

   Reference Sequence
       The  tool requires the reference sequence used to perform alignments.  This can be either a FASTA file or
       a ReferenceSet XML.

OUTPUTS

       The modification detection  tool  provides  results  in  a  variety  of  formats  suitable  for  in-depth
       statistical  analysis,  quick  reference, and comsumption by visualization tools such as PacBio SMRTView.
       Results are generally indexed by reference position and reference strand.  In all cases the strand  value
       refers  to  the  strand  carrying the modification in DNA sample. Remember that the kinetic effect of the
       modification is observed in read sequences aligning to the opposite strand.  So  reads  aligning  to  the
       positive  strand  carry information about modification on the negative strand and vice versa, but in this
       toolkit we alway report the strand containing the putative modification.

       The following output options are available:

          • --gff FILENAME: GFF format

          • --csv FILENAME: comma-separated value format

          • --csv_h5 FILENAME: compact binary equivalent of CSV in HDF5 format

          • --bigwig FILENAME: BigWig file (mostly only useful for SMRTView)

   modifications.gff
       The    modifications.gff    is    compliant    with    the    GFF    Version    3    specification     (‐
       http://www.sequenceontology.org/gff3.shtml).  Each  template position / strand pair whose p-value exceeds
       the pvalue threshold appears as a row. The template position is 1-based, per the GFF  spec.   The  strand
       column  refers  to the strand carrying the detected modification, which is the opposite strand from those
       used to detect the modification. The GFF confidence column is a Phred-transformed pvalue of detection.

       Note on genome browser compatibility

       The modifications.gff file will not work directly with most genome browsers.  You  will  likely  need  to
       make  a  copy of the GFF file and convert the _seqid_ columns from the generic 'ref0000x' names generated
       by PacBio, to the FASTA headers present in the original reference  FASTA  file.   The  mapping  table  is
       written  in  the  header  of  the  modifications.gff  file in  #sequence-header tags.  This issue will be
       resolved in the 1.4 release of kineticsTools.

       The auxiliary data column of the GFF file contains  other  statistics  which  may  be  useful  downstream
       analysis or filtering.  In particular the coverage level of the reads used to make the call, and +/- 20bp
       sequence context surrounding the site.

       System Message: ERROR/3 (doc/manual.rst:, line 114)
              Malformed table.  Text in column margin in table line 2.

          ================  ===========
          Column      Description
          ================  ===========
          seqid     Fasta contig name
          source            Name of tool -- 'kinModCall'
          type                    Modification type -- in identification mode this will be m6A, m4C, or m5C for identified bases, or the generic tag 'modified_base' if a kinetic event was detected that does not match a known modification signature
          start                   Modification position on contig
          end                     Modification position on contig
          score                   Phred transformed p-value of detection - this is the single-site detection p-value
          strand                  Sample strand containing modification
          phase                   Not applicable
          attributes              Extra fields relevant to base mods. IPDRatio is traditional IPDRatio, context is the reference sequence -20bp to +20bp around the modification, and coverage level is the number of IPD observations used after Mapping QV filtering and accuracy filtering. If the row results from an identified modification we also include an identificationQv tag with the from the modification identification procedure. identificationQv is the phred-transformed probability of an incorrect identification, for bases that were identified as having a particular modification. frac, fracLow, fracUp are the estimated fraction of molecules carrying the modification, and the 5% confidence intervals of the estimate. The methylated fraction estimation is a beta-level feature, and should only be used for exploratory purposes.
          ================  ===========

   modifications.csv
       The  modifications.csv  file contains one row for each (reference position, strand) pair that appeared in
       the dataset with coverage at least x.  x defaults to 3, but is configurable with '--minCoverage' flag  to
       ipdSummary.py.  The  reference  position  index  is  1-based  for  compatibility  with the gff file the R
       environment.  Note that this output type scales poorly and is not recommended for large genomes; the HDF5
       output should perform much better in these cases.

   Output columns
       in-silico control mode
                             ┌─────────────────┬───────────────────────────────────────┐
                             │ Column          │ Description                           │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ refId           │ reference   sequence   ID   of   this │
                             │                 │ observation                           │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ tpl             │ 1-based template position             │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ strand          │ native  sample  strand where kinetics │
                             │                 │ were generated. '0' is the strand  of │
                             │                 │ the  original  FASTA, '1' is opposite │
                             │                 │ strand from FASTA                     │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ base            │ the cognate base at this position  in │
                             │                 │ the reference                         │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ score           │ Phred-transformed   pvalue   that   a │
                             │                 │ kinetic  deviation  exists  at   this │
                             │                 │ position                              │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ tMean           │ capped   mean   of   normalized  IPDs │
                             │                 │ observed at this position             │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ tErr            │ capped standard error  of  normalized │
                             │                 │ IPDs   observed   at   this  position │
                             │                 │ (standard deviation / sqrt(coverage)  │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ modelPrediction │ normalized mean IPD predicted by  the │
                             │                 │ synthetic   control  model  for  this │
                             │                 │ sequence context                      │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ ipdRatio        │ tMean / modelPrediction               │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ coverage        │ count of valid IPDs at this  position │
                             │                 │ (see Filtering section for details)   │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ frac            │ estimate of the fraction of molecules │
                             │                 │ that carry the modification           │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ fracLow         │ 2.5%   confidence   bound   of   frac │
                             │                 │ estimate                              │
                             ├─────────────────┼───────────────────────────────────────┤
                             │ fracUpp         │ 97.5%  confidence   bound   of   frac │
                             │                 │ estimate                              │
                             └─────────────────┴───────────────────────────────────────┘

       case-control mode
                             ───────────────────────────────────────────────────────────
                               Column            Description
                             ───────────────────────────────────────────────────────────
                               refId             reference   sequence   ID   of   this
                                                 observation
                             ───────────────────────────────────────────────────────────
                               tpl               1-based template position
                             ───────────────────────────────────────────────────────────
                               strand            native sample strand  where  kinetics
                                                 were  generated. '0' is the strand of
                                                 the original FASTA, '1'  is  opposite
                                                 strand from FASTA
                             ───────────────────────────────────────────────────────────
                               base              the  cognate base at this position in
                                                 the reference
                             ───────────────────────────────────────────────────────────
                               score             Phred-transformed   pvalue   that   a
                                                 kinetic   deviation  exists  at  this
                                                 position
                             ───────────────────────────────────────────────────────────
                               caseMean          mean of normalized case IPDs observed
                                                 at this position
                             ───────────────────────────────────────────────────────────
                               controlMean       mean  of  normalized   control   IPDs
                                                 observed at this position
                             ───────────────────────────────────────────────────────────
                               caseStd           standard   deviation   of  case  IPDs
                                                 observed at this position
                             ───────────────────────────────────────────────────────────
                               controlStd        standard deviation  of  control  IPDs
                                                 observed at this position
                             ───────────────────────────────────────────────────────────
                               ipdRatio          tMean / modelPrediction
                             ───────────────────────────────────────────────────────────
                               testStatistic     t-test statistic
                             ───────────────────────────────────────────────────────────
                               coverage          mean of case and control coverage
                             ───────────────────────────────────────────────────────────
                               controlCoverage   count  of  valid control IPDs at this
                                                 position (see Filtering  section  for
                                                 details)
                             ───────────────────────────────────────────────────────────
                               caseCoverage      count  of  valid  case  IPDs  at this
                                                 position (see Filtering  section  for
                                                 details)
                             ┌─────────────────┬───────────────────────────────────────┐
                             │                 │                                       │
SEE ALSO                     │                 │                                       │