Provided by: kineticstools_0.5.1+20150828+git3aa3d96+dfsg-1_all bug

NAME

       ipdSummary - Detect DNA base-modifications from kinetic signatures.

DESCRIPTION

       kineticsTool  loads  IPDs observed at each position in the genome, and compares those IPDs
       to value expected for unmodified DNA, and outputs the result  of  this  statistical  test.
       The  expected IPD value for unmodified DNA can come from either an in-silico control or an
       amplified control. The in silico control  is  trained  by  PacBio  and  shipped  with  the
       package.  It predicts predicts the IPD using the local sequence context around the current
       position.  An amplified control dataset is generated by sequencing unmodified DNA with the
       same  sequence  as  the  test  sample. An amplified control sample is usually generated by
       whole-genome amplification of the original sample.

   Modification Detection
       The basic mode of kineticsTools does an independent comparison of IPDs at each position on
       the genome, for each strand, and emits various statistics to CSV and GFF (after applying a
       significance filter).

   Modifications Identification
       kineticsTools also has a Modification Identification mode that can decode  multi-site  IPD
       'fingerprints' into a reduced set of calls of specific modifications. This feature has the
       following benefits:

              • Different modifications occuring on the  same  base  can  be  distinguished  (for
                example m5C and m4C)

              • The  signal  from  one  modification  is  combined  into one statistic, improving
                sensitivity, removing extra peaks, and correctly centering the call

OPTIONS

       Please call this program with --help to see the available options.

ALGORITHM

   Synthetic Control
       Studies of the relationship between IPD and sequence  context  reveal  that  most  of  the
       variation  in  mean  IPD  across a genome can be predicted from a 12-base sequence context
       surrounding the active site of the DNA polymerase. The  bounds  of  the  relevant  context
       window  correspond  to  the  window  of  DNA  in  contact  with the polymerase, as seen in
       DNA/polymerase crystal structures.  To simplify the process of finding  DNA  modifications
       with  PacBio  data,  the  tool  includes  a  pre-trained  lookup  table mapping 12-mer DNA
       sequences to mean IPDs observed in C2 chemistry.

   Filtering and Trimming
       kineticsTools uses the Mapping QV generated by BLASR and stored  in  the  cmp.h5  file  to
       ignore  reads  that aren't confidently mapped.  The default minimum Mapping QV required is
       10, implying that BLASR has 90\% confidence that the read is correctly mapped. Because  of
       the  range  of  readlengths  inherent  in  PacBio  dataThis  can  be  changed in using the
       --mapQvThreshold command line argument, or via the  SMRTPortal  configuration  dialog  for
       Modification Detection.

       There are a few features of PacBio data that require special attention in order to acheive
       good modification detection performance.  kineticsTools inspects the alignment between the
       observed  bases  and  the  reference  sequence  --  in  order for an IPD measurement to be
       included in the analysis, the PacBio read sequence must match the reference sequence for k
       around  the  cognate base. In the current module k=1 The IPD distribution at some locus be
       thought of as a mixture between the 'normal' incorporation process IPD, which is sensitive
       to  the  local  sequence context and DNA modifications and a contaminating 'pause' process
       IPD which have a much longer duration (mean >10x longer than normal),  but  happen  rarely
       (~1%  of  IPDs).   Note:  Our  current  understanding  is  that pauses do not carry useful
       information about the methylation state of the DNA, however a more careful analysis may be
       warranted.  Also  note  that  modifications  that  drastically  increase the Roughly 1% of
       observed IPDs are generated by pause events. Capping observed  IPDs  at  the  global  99th
       percentile  is motivated by theory from robust hypothesis testing.  Some sequence contexts
       may have naturally longer IPDs, to avoid capping too much data at those contexts, the  cap
       threshold   is   adjusted   per   context   as   follows:   capThreshold  =  max(global99,
       5*modelPrediction, percentile(ipdObservations, 75))

   Statistical Testing
       We test the hypothesis that IPDs observed at a particular  locus  in  the  sample  have  a
       longer means than IPDs observed at the same locus in unmodified DNA.  If we have generated
       a Whole Genome Amplified dataset, which removes DNA modifications, we use a  case-control,
       two-sample  t-test.   This  tool  also provides a pre-calibrated 'synthetic control' model
       which predicts the unmodified IPD, given a 12 base  sequence  context.  In  the  synthetic
       control  case  we  use a one-sample t-test, with an adjustment to account for error in the
       synthetic control model.

INPUTS

   aligned_reads.cmp.h5
       A standard cmp.h5 file contain alignments and IPD information supplies  the  kinetic  data
       used  to perform modification detection.  The standard cmp.h5 file of a SMRTportal jobs is
       data/aligned_read.cmp.h5.

   Reference Sequence
       The tool requires the reference sequence used to perform alignments.  Currently this  must
       be supplied via the path to a SMRTportal reference repository entry.

OUTPUTS

       The  modification  detection  tool  provides  results in a variety of formats suitable for
       in-depth statistical analysis, quick reference, and  comsumption  by  visualization  tools
       such  as  PacBio  SMRTView.   Results  are  generally  indexed  by  reference position and
       reference strand.  In all cases the  strand  value  refers  to  the  strand  carrying  the
       modification  in  DNA  sample.  Remember  that  the  kinetic effect of the modification is
       observed in read sequences aligning to the opposite  strand.  So  reads  aligning  to  the
       positive  strand  carry  information  about  modification  on the negative strand and vice
       versa,  but  in  this  toolkit  we  alway  report  the  strand  containing  the   putative
       modification.

   modifications.csv
       The  modifications.csv  file  contains  one row for each (reference position, strand) pair
       that appeared in the dataset  with  coverage  at  least  x.   x  defaults  to  3,  but  is
       configurable  with  '--minCoverage' flag to ipdSummary.py. The reference position index is
       1-based for compatibility with the gff file the R environment.

   Output columns
       in-silico control mode

                          ┌────────────────┬──────────────────────────────────┐
                          │Column          │ Description                      │
                          ├────────────────┼──────────────────────────────────┤
                          │refId           │ reference sequence  ID  of  this │
                          │                │ observation                      │
                          ├────────────────┼──────────────────────────────────┤
                          │tpl             │ 1-based template position        │
                          ├────────────────┼──────────────────────────────────┤
                          │strand          │ native   sample   strand   where │
                          │                │ kinetics were generated. '0'  is │
                          │                │ the   strand   of  the  original │
                          │                │ FASTA, '1'  is  opposite  strand │
                          │                │ from FASTA                       │
                          ├────────────────┼──────────────────────────────────┤
                          │base            │ the   cognate   base   at   this │
                          │                │ position in the reference        │
                          ├────────────────┼──────────────────────────────────┤
                          │score           │ Phred-transformed pvalue that  a │
                          │                │ kinetic deviation exists at this │
                          │                │ position                         │
                          └────────────────┴──────────────────────────────────┘

                          │tMean           │ capped mean of  normalized  IPDs │
                          │                │ observed at this position        │
                          ├────────────────┼──────────────────────────────────┤
                          │tErr            │ capped    standard    error   of │
                          │                │ normalized IPDs observed at this │
                          │                │ position  (standard  deviation / │
                          │                │ sqrt(coverage)                   │
                          ├────────────────┼──────────────────────────────────┤
                          │modelPrediction │ normalized mean IPD predicted by │
                          │                │ the  synthetic control model for │
                          │                │ this sequence context            │
                          ├────────────────┼──────────────────────────────────┤
                          │ipdRatio        │ tMean / modelPrediction          │
                          ├────────────────┼──────────────────────────────────┤
                          │coverage        │ count  of  valid  IPDs  at  this │
                          │                │ position  (see Filtering section │
                          │                │ for details)                     │
                          ├────────────────┼──────────────────────────────────┤
                          │frac            │ estimate  of  the  fraction   of │
                          │                │ molecules    that    carry   the │
                          │                │ modification                     │
                          ├────────────────┼──────────────────────────────────┤
                          │fracLow         │ 2.5% confidence  bound  of  frac │
                          │                │ estimate                         │
                          ├────────────────┼──────────────────────────────────┤
                          │fracUpp         │ 97.5%  confidence  bound of frac │
                          │                │ estimate                         │
                          └────────────────┴──────────────────────────────────┘

       case-control mode

                          ┌────────────────┬──────────────────────────────────┐
                          │Column          │ Description                      │
                          ├────────────────┼──────────────────────────────────┤
                          │refId           │ reference sequence  ID  of  this │
                          │                │ observation                      │
                          ├────────────────┼──────────────────────────────────┤
                          │tpl             │ 1-based template position        │
                          ├────────────────┼──────────────────────────────────┤
                          │strand          │ native   sample   strand   where │
                          │                │ kinetics were generated. '0'  is │
                          │                │ the   strand   of  the  original │
                          │                │ FASTA, '1'  is  opposite  strand │
                          │                │ from FASTA                       │
                          ├────────────────┼──────────────────────────────────┤
                          │base            │ the   cognate   base   at   this │
                          │                │ position in the reference        │
                          ├────────────────┼──────────────────────────────────┤
                          │score           │ Phred-transformed pvalue that  a │
                          │                │ kinetic deviation exists at this │
                          │                │ position                         │
                          ├────────────────┼──────────────────────────────────┤
                          │caseMean        │ mean  of  normalized  case  IPDs │
                          │                │ observed at this position        │
                          ├────────────────┼──────────────────────────────────┤
                          │controlMean     │ mean  of normalized control IPDs │
                          │                │ observed at this position        │
                          ├────────────────┼──────────────────────────────────┤
                          │caseStd         │ standard deviation of case  IPDs │
                          │                │ observed at this position        │
                          ├────────────────┼──────────────────────────────────┤
                          │controlStd      │ standard  deviation  of  control │
                          │                │ IPDs observed at this position   │
                          └────────────────┴──────────────────────────────────┘

                          │ipdRatio        │ tMean / modelPrediction          │
                          ├────────────────┼──────────────────────────────────┤
                          │testStatistic   │ t-test statistic                 │
                          ├────────────────┼──────────────────────────────────┤
                          │coverage        │ mean   of   case   and   control │
                          │                │ coverage                         │
                          ├────────────────┼──────────────────────────────────┤
                          │controlCoverage │ count  of  valid control IPDs at │
                          │                │ this  position  (see   Filtering │
                          │                │ section for details)             │
                          ├────────────────┼──────────────────────────────────┤
                          │caseCoverage    │ count of valid case IPDs at this │
                          │                │ position (see Filtering  section │
                          │                │ for details)                     │
                          └────────────────┴──────────────────────────────────┘

   modifications.gff
       The   modifications.gff   is   compliant   with   the   GFF  Version  3  specification  (‐
       http://www.sequenceontology.org/gff3.shtml). Each template position /  strand  pair  whose
       p-value  exceeds  the pvalue threshold appears as a row. The template position is 1-based,
       per the GFF  spec.   The  strand  column  refers  to  the  strand  carrying  the  detected
       modification, which is the opposite strand from those used to detect the modification. The
       GFF confidence column is a Phred-transformed pvalue of detection.

       Note on genome browser compatibility

       The modifications.gff file will not work directly with most  genome  browsers.   You  will
       likely  need  to  make  a  copy  of  the GFF file and convert the _seqid_ columns from the
       generic 'ref0000x' names generated by PacBio, to the FASTA headers present in the original
       reference FASTA file.  The mapping table is written in the header of the modifications.gff
       file in  #sequence-header tags.  This issue  will  be  resolved  in  the  1.4  release  of
       kineticsTools.

       The  auxiliary  data  column of the GFF file contains other statistics which may be useful
       downstream analysis or filtering.  In particular the coverage level of the reads  used  to
       make the call, and +/- 20bp sequence context surrounding the site.

                            ┌───────────┬──────────────────────────────────┐
                            │Column     │ Description                      │
                            ├───────────┼──────────────────────────────────┤
                            │seqid      │ Fasta contig name                │
                            ├───────────┼──────────────────────────────────┤
                            │source     │ Name of tool -- 'kinModCall'     │
                            ├───────────┼──────────────────────────────────┤
                            │type       │ Modification    type    --    in │
                            │           │ identification mode this will be │
                            │           │ m6A,  m4C, or m5C for identified │
                            │           │ bases,  or   the   generic   tag │
                            │           │ 'modified_base'   if  a  kinetic │
                            │           │ event was detected that does not │
                            │           │ match   a   known   modification │
                            │           │ signature                        │
                            ├───────────┼──────────────────────────────────┤
                            │start      │ Modification position on contig  │
                            ├───────────┼──────────────────────────────────┤
                            │end        │ Modification position on contig  │
                            ├───────────┼──────────────────────────────────┤
                            │score      │ Phred  transformed  p-value   of │
                            │           │ detection    -   this   is   the │
                            │           │ single-site detection p-value    │
                            ├───────────┼──────────────────────────────────┤
                            │strand     │ Sample     strand     containing │
                            │           │ modification                     │
                            └───────────┴──────────────────────────────────┘

                            │phase      │ Not applicable                   │
                            ├───────────┼──────────────────────────────────┤
                            │attributes │ Extra  fields  relevant  to base │
                            │           │ mods.  IPDRatio  is  traditional │
                            │           │ IPDRatio,    context    is   the │
                            │           │ reference  sequence   -20bp   to │
                            │           │ +20bp  around  the modification, │
                            │           │ and coverage level is the number │
                            │           │ of  IPD  observations used after │
                            │           │ Mapping   QV    filtering    and │
                            │           │ accuracy  filtering.  If the row │
                            │           │ results   from   an   identified │
                            │           │ modification  we also include an │
                            │           │ identificationQv  tag  with  the │
                            │           │ from       the      modification │
                            │           │ identification        procedure. │
                            │           │ identificationQv      is     the │
                            │           │ phred-transformed probability of │
                            │           │ an incorrect identification, for │
                            │           │ bases that  were  identified  as │
                            │           │ having        a       particular │
                            │           │ modification.   frac,   fracLow, │
                            │           │ fracUp    are    the   estimated │
                            │           │ fraction of  molecules  carrying │
                            │           │ the  modification,  and  the  5% │
                            │           │ confidence  intervals   of   the │
                            │           │ estimate.     The     methylated │
                            │           │ fraction   estimation    is    a │
                            │           │ beta-level  feature,  and should │
                            │           │ only  be  used  for  exploratory │
                            │           │ purposes.                        │
                            └───────────┴──────────────────────────────────┘

   motifs.gff
       If  the Motif Finder tool is run, it will generate motifs.gff, which a reprocessed version
       of modifications.gff with the following changes. If a detected modification  occurs  on  a
       motif  detected  by  the  motif  finder, the modification is annotated with motif data. An
       attribute 'motif' is added containing the motif string, and an  attribute  'id'  is  added
       containing   the   motif   id,   which   is  the  motif  string  for  unpaired  motifs  or
       'motifString1/motifString2' for paired motifs. If a motif instance exists in  the  genome,
       but was not detected in modifications.gff, an entry is added to motifs.gff, indicating the
       presence of that motif and the kinetics that were observed at that site.

   motif_summary.csv
       If the Motif Finder tool is run, motif_summary.csv is generated, summarizing the  modified
       motifs  discovered  by  the  tool.  The  CSV contains one row per detected motif, with the
       following columns

                        ┌───────────────────┬──────────────────────────────────┐
                        │Column             │ Description                      │
                        ├───────────────────┼──────────────────────────────────┤
                        │motifString        │ Detected motif sequence          │
                        ├───────────────────┼──────────────────────────────────┤
                        │centerPos          │ Position     in     motif     of │
                        │                   │ modification (0-based)           │
                        ├───────────────────┼──────────────────────────────────┤
                        │fraction           │ Fraction  of  instances  of this │
                        │                   │ motif with modification QV above │
                        │                   │ the QV threshold                 │
                        ├───────────────────┼──────────────────────────────────┤
                        │nDetected          │ Number   of  instances  of  this │
                        │                   │ motif with above threshold       │
                        └───────────────────┴──────────────────────────────────┘

                        │nGenome            │ Number  of  instances  of   this │
                        │                   │ motif in reference sequence      │
                        ├───────────────────┼──────────────────────────────────┤
                        │groupTag           │ A  string  idetifying  the motif │
                        │                   │ grouping. For paired motifs this │
                        │                   │ is                               │
                        │                   │ "<motifString1>/<motifString2>", │
                        │                   │ For  unpaired motifs this equals │
                        │                   │ motifString                      │
                        ├───────────────────┼──────────────────────────────────┤
                        │partnerMotifString │ motifString  of   paired   motif │
                        │                   │ (motif                      with │
                        │                   │ reverse-complementary            │
                        │                   │ motifString)                     │
                        ├───────────────────┼──────────────────────────────────┤
                        │meanScore          │ Mean Modification Qv of detected │
                        │                   │ instances                        │
                        ├───────────────────┼──────────────────────────────────┤
                        │meanIpdRatio       │ Mean  IPD  ratio   of   detected │
                        │                   │ instances                        │
                        ├───────────────────┼──────────────────────────────────┤
                        │meanCoverage       │ Mean    coverage   of   detected │
                        │                   │ instances                        │
                        ├───────────────────┼──────────────────────────────────┤
                        │objectiveScore     │ Objective score of this motif in │
                        │                   │ the motif finder algorithm       │
                        └───────────────────┴──────────────────────────────────┘

SEE ALSO

       summarizeModifications(1)

                                          December 2015                             IPDSUMMARY(1)