Provided by: kineticstools_0.6.1+20161222-1ubuntu1_all bug

NAME

       ipdSummary - Detect DNA base-modifications from kinetic signatures.

DESCRIPTION

       kineticsTool  loads  IPDs observed at each position in the genome, and compares those IPDs
       to value expected for unmodified DNA, and outputs the result  of  this  statistical  test.
       The  expected IPD value for unmodified DNA can come from either an in-silico control or an
       amplified control. The in silico control  is  trained  by  PacBio  and  shipped  with  the
       package.  It predicts predicts the IPD using the local sequence context around the current
       position.  An amplified control dataset is generated by sequencing unmodified DNA with the
       same  sequence  as  the  test  sample. An amplified control sample is usually generated by
       whole-genome amplification of the original sample.

   Modification Detection
       The basic mode of kineticsTools does an independent comparison of IPDs at each position on
       the genome, for each strand, and emits various statistics to CSV and GFF (after applying a
       significance filter).

   Modifications Identification
       kineticsTools also has a Modification Identification mode that can decode  multi-site  IPD
       'fingerprints' into a reduced set of calls of specific modifications. This feature has the
       following benefits:

              • Different modifications occurring on the same  base  can  be  distinguished  (for
                example m5C and m4C)

              • The  signal  from  one  modification  is  combined  into one statistic, improving
                sensitivity, removing extra peaks, and correctly centering the call

OPTIONS

       Please call this program with --help to see the available options.

ALGORITHM

   Synthetic Control
       Studies of the relationship between IPD and sequence  context  reveal  that  most  of  the
       variation  in  mean  IPD  across a genome can be predicted from a 12-base sequence context
       surrounding the active site of the DNA polymerase. The  bounds  of  the  relevant  context
       window  correspond  to  the  window  of  DNA  in  contact  with the polymerase, as seen in
       DNA/polymerase crystal structures.  To simplify the process of finding  DNA  modifications
       with  PacBio  data,  the  tool  includes  a  pre-trained  lookup  table mapping 12-mer DNA
       sequences to mean IPDs observed in C2 chemistry.

   Filtering and Trimming
       kineticsTools uses the Mapping QV generated by BLASR and stored in the cmp.h5 or BAM  file
       (or  AlignmentSet)  to  ignore  reads that aren't confidently mapped.  The default minimum
       Mapping QV required is 10, implying that BLASR  has  90\%  confidence  that  the  read  is
       correctly  mapped.  Because of the range of readlengths inherent in PacBio dataThis can be
       changed in using the  --mapQvThreshold  command  line  argument,  or  via  the  SMRTPortal
       configuration dialog for Modification Detection.

       There are a few features of PacBio data that require special attention in order to achieve
       good modification detection performance.  kineticsTools inspects the alignment between the
       observed  bases  and  the  reference  sequence  --  in  order for an IPD measurement to be
       included in the analysis, the PacBio read sequence must match the reference sequence for k
       around  the  cognate base. In the current module k=1 The IPD distribution at some locus be
       thought of as a mixture between the 'normal' incorporation process IPD, which is sensitive
       to  the  local  sequence context and DNA modifications and a contaminating 'pause' process
       IPD which have a much longer duration (mean >10x longer than normal),  but  happen  rarely
       (~1%  of  IPDs).   Note:  Our  current  understanding  is  that pauses do not carry useful
       information about the methylation state of the DNA, however a more careful analysis may be
       warranted.  Also  note  that  modifications  that  drastically  increase the Roughly 1% of
       observed IPDs are generated by pause events. Capping observed  IPDs  at  the  global  99th
       percentile  is motivated by theory from robust hypothesis testing.  Some sequence contexts
       may have naturally longer IPDs, to avoid capping too much data at those contexts, the  cap
       threshold   is   adjusted   per   context   as   follows:   capThreshold  =  max(global99,
       5*modelPrediction, percentile(ipdObservations, 75))

   Statistical Testing
       We test the hypothesis that IPDs observed at a particular  locus  in  the  sample  have  a
       longer means than IPDs observed at the same locus in unmodified DNA.  If we have generated
       a Whole Genome Amplified dataset, which removes DNA modifications, we use a  case-control,
       two-sample  t-test.   This  tool  also provides a pre-calibrated 'synthetic control' model
       which predicts the unmodified IPD, given a 12 base  sequence  context.  In  the  synthetic
       control  case  we  use a one-sample t-test, with an adjustment to account for error in the
       synthetic control model.

EXAMPLE USAGE

       Basic use with BAM input, GFF+HDF5 output:

          ipdSummary aligned.bam --reference ref.fasta --identify m6A,m4C --gff basemods.gff --csv_h5 kinetics.h5

       With cmp.h5 input, methyl fraction calculation and GFF+CSV output:

          ipdSummary aligned.cmp.h5 --reference ref.fasta --identify m6A,m4C --methylFraction --gff basemods.gff --csv kinetics.csv

INPUTS

   Aligned Reads
       A standard PacBio alignment file - either AlignmentSet XML, BAM, or  cmp.h5  -  containing
       alignments  and  IPD  information  supplies  the kinetic data used to perform modification
       detection.  The standard cmp.h5 file of a SMRTportal jobs is data/aligned_read.cmp.h5.

   Reference Sequence
       The tool requires the reference sequence used to perform alignments.  This can be either a
       FASTA file or a ReferenceSet XML.

OUTPUTS

       The  modification  detection  tool  provides  results in a variety of formats suitable for
       in-depth statistical analysis, quick reference, and  comsumption  by  visualization  tools
       such  as  PacBio  SMRTView.   Results  are  generally  indexed  by  reference position and
       reference strand.  In all cases the  strand  value  refers  to  the  strand  carrying  the
       modification  in  DNA  sample.  Remember  that  the  kinetic effect of the modification is
       observed in read sequences aligning to the opposite  strand.  So  reads  aligning  to  the
       positive  strand  carry  information  about  modification  on the negative strand and vice
       versa,  but  in  this  toolkit  we  alway  report  the  strand  containing  the   putative
       modification.

       The following output options are available:

          • --gff FILENAME: GFF format

          • --csv FILENAME: comma-separated value format

          • --csv_h5 FILENAME: compact binary equivalent of CSV in HDF5 format

          • --bigwig FILENAME: BigWig file (mostly only useful for SMRTView)

   modifications.gff
       The   modifications.gff   is   compliant   with   the   GFF  Version  3  specification  (‐
       http://www.sequenceontology.org/gff3.shtml). Each template position /  strand  pair  whose
       p-value  exceeds  the pvalue threshold appears as a row. The template position is 1-based,
       per the GFF  spec.   The  strand  column  refers  to  the  strand  carrying  the  detected
       modification, which is the opposite strand from those used to detect the modification. The
       GFF confidence column is a Phred-transformed pvalue of detection.

       Note on genome browser compatibility

       The modifications.gff file will not work directly with most  genome  browsers.   You  will
       likely  need  to  make  a  copy  of  the GFF file and convert the _seqid_ columns from the
       generic 'ref0000x' names generated by PacBio, to the FASTA headers present in the original
       reference FASTA file.  The mapping table is written in the header of the modifications.gff
       file in  #sequence-header tags.  This issue  will  be  resolved  in  the  1.4  release  of
       kineticsTools.

       The  auxiliary  data  column of the GFF file contains other statistics which may be useful
       downstream analysis or filtering.  In particular the coverage level of the reads  used  to
       make the call, and +/- 20bp sequence context surrounding the site.

       System Message: ERROR/3 (doc/manual.rst:, line 114)
              Malformed table.  Text in column margin in table line 2.

          ================  ===========
          Column      Description
          ================  ===========
          seqid     Fasta contig name
          source            Name of tool -- 'kinModCall'
          type                    Modification type -- in identification mode this will be m6A, m4C, or m5C for identified bases, or the generic tag 'modified_base' if a kinetic event was detected that does not match a known modification signature
          start                   Modification position on contig
          end                     Modification position on contig
          score                   Phred transformed p-value of detection - this is the single-site detection p-value
          strand                  Sample strand containing modification
          phase                   Not applicable
          attributes              Extra fields relevant to base mods. IPDRatio is traditional IPDRatio, context is the reference sequence -20bp to +20bp around the modification, and coverage level is the number of IPD observations used after Mapping QV filtering and accuracy filtering. If the row results from an identified modification we also include an identificationQv tag with the from the modification identification procedure. identificationQv is the phred-transformed probability of an incorrect identification, for bases that were identified as having a particular modification. frac, fracLow, fracUp are the estimated fraction of molecules carrying the modification, and the 5% confidence intervals of the estimate. The methylated fraction estimation is a beta-level feature, and should only be used for exploratory purposes.
          ================  ===========

   modifications.csv
       The  modifications.csv  file  contains  one row for each (reference position, strand) pair
       that appeared in the dataset  with  coverage  at  least  x.   x  defaults  to  3,  but  is
       configurable  with  '--minCoverage' flag to ipdSummary.py. The reference position index is
       1-based for compatibility with the gff file the R environment.  Note that this output type
       scales  poorly  and  is  not recommended for large genomes; the HDF5 output should perform
       much better in these cases.

   Output columns
       in-silico control mode

                          ┌────────────────┬──────────────────────────────────┐
                          │Column          │ Description                      │
                          ├────────────────┼──────────────────────────────────┤
                          │refId           │ reference sequence  ID  of  this │
                          │                │ observation                      │
                          ├────────────────┼──────────────────────────────────┤
                          │tpl             │ 1-based template position        │
                          ├────────────────┼──────────────────────────────────┤
                          │strand          │ native   sample   strand   where │
                          │                │ kinetics were generated. '0'  is │
                          │                │ the   strand   of  the  original │
                          │                │ FASTA, '1'  is  opposite  strand │
                          │                │ from FASTA                       │
                          ├────────────────┼──────────────────────────────────┤
                          │base            │ the   cognate   base   at   this │
                          │                │ position in the reference        │
                          ├────────────────┼──────────────────────────────────┤
                          │score           │ Phred-transformed pvalue that  a │
                          │                │ kinetic deviation exists at this │
                          │                │ position                         │
                          ├────────────────┼──────────────────────────────────┤
                          │tMean           │ capped mean of  normalized  IPDs │
                          │                │ observed at this position        │
                          └────────────────┴──────────────────────────────────┘

                          │tErr            │ capped    standard    error   of │
                          │                │ normalized IPDs observed at this │
                          │                │ position  (standard  deviation / │
                          │                │ sqrt(coverage)                   │
                          ├────────────────┼──────────────────────────────────┤
                          │modelPrediction │ normalized mean IPD predicted by │
                          │                │ the  synthetic control model for │
                          │                │ this sequence context            │
                          ├────────────────┼──────────────────────────────────┤
                          │ipdRatio        │ tMean / modelPrediction          │
                          ├────────────────┼──────────────────────────────────┤
                          │coverage        │ count  of  valid  IPDs  at  this │
                          │                │ position  (see Filtering section │
                          │                │ for details)                     │
                          ├────────────────┼──────────────────────────────────┤
                          │frac            │ estimate  of  the  fraction   of │
                          │                │ molecules    that    carry   the │
                          │                │ modification                     │
                          ├────────────────┼──────────────────────────────────┤
                          │fracLow         │ 2.5% confidence  bound  of  frac │
                          │                │ estimate                         │
                          ├────────────────┼──────────────────────────────────┤
                          │fracUpp         │ 97.5%  confidence  bound of frac │
                          │                │ estimate                         │
                          └────────────────┴──────────────────────────────────┘

       case-control mode

                          ┌────────────────┬──────────────────────────────────┐
                          │Column          │ Description                      │
                          ├────────────────┼──────────────────────────────────┤
                          │refId           │ reference sequence  ID  of  this │
                          │                │ observation                      │
                          ├────────────────┼──────────────────────────────────┤
                          │tpl             │ 1-based template position        │
                          ├────────────────┼──────────────────────────────────┤
                          │strand          │ native   sample   strand   where │
                          │                │ kinetics were generated. '0'  is │
                          │                │ the   strand   of  the  original │
                          │                │ FASTA, '1'  is  opposite  strand │
                          │                │ from FASTA                       │
                          ├────────────────┼──────────────────────────────────┤
                          │base            │ the   cognate   base   at   this │
                          │                │ position in the reference        │
                          ├────────────────┼──────────────────────────────────┤
                          │score           │ Phred-transformed pvalue that  a │
                          │                │ kinetic deviation exists at this │
                          │                │ position                         │
                          ├────────────────┼──────────────────────────────────┤
                          │caseMean        │ mean  of  normalized  case  IPDs │
                          │                │ observed at this position        │
                          ├────────────────┼──────────────────────────────────┤
                          │controlMean     │ mean  of normalized control IPDs │
                          │                │ observed at this position        │
                          ├────────────────┼──────────────────────────────────┤
                          │caseStd         │ standard deviation of case  IPDs │
                          │                │ observed at this position        │
                          ├────────────────┼──────────────────────────────────┤
                          │controlStd      │ standard  deviation  of  control │
                          │                │ IPDs observed at this position   │
                          ├────────────────┼──────────────────────────────────┤
                          │ipdRatio        │ tMean / modelPrediction          │
                          ├────────────────┼──────────────────────────────────┤
                          │testStatistic   │ t-test statistic                 │
                          └────────────────┴──────────────────────────────────┘

                          │coverage        │ mean   of   case   and   control │
                          │                │ coverage                         │
                          ├────────────────┼──────────────────────────────────┤
                          │controlCoverage │ count  of  valid control IPDs at │
                          │                │ this  position  (see   Filtering │
                          │                │ section for details)             │
                          ├────────────────┼──────────────────────────────────┤
                          │caseCoverage    │ count of valid case IPDs at this │
                          │                │ position (see Filtering  section │
                          │                │ for details)                     │
                          └────────────────┴──────────────────────────────────┘

SEE ALSO

       summarizeModifications(1)

                                          December 2015                             IPDSUMMARY(1)