xenial (1) ipdSummary.1.gz

Provided by: kineticstools_0.5.1+20150828+git3aa3d96+dfsg-1_all bug

NAME

       ipdSummary - Detect DNA base-modifications from kinetic signatures.

DESCRIPTION

       kineticsTool  loads  IPDs  observed  at  each  position  in  the genome, and compares those IPDs to value
       expected for unmodified DNA, and outputs the result of this statistical test.  The expected IPD value for
       unmodified  DNA  can come from either an in-silico control or an amplified control. The in silico control
       is trained by PacBio and shipped with the package. It predicts predicts the IPD using the local  sequence
       context  around the current position.  An amplified control dataset is generated by sequencing unmodified
       DNA with the same sequence as the test sample. An  amplified  control  sample  is  usually  generated  by
       whole-genome amplification of the original sample.

   Modification Detection
       The  basic  mode  of kineticsTools does an independent comparison of IPDs at each position on the genome,
       for each strand, and emits various statistics to CSV and GFF (after applying a significance filter).

   Modifications Identification
       kineticsTools also has a Modification Identification mode that can decode multi-site  IPD  'fingerprints'
       into a reduced set of calls of specific modifications. This feature has the following benefits:

              • Different modifications occuring on the same base can be distinguished (for example m5C and m4C)

              • The signal from one modification is combined into one statistic, improving sensitivity, removing
                extra peaks, and correctly centering the call

OPTIONS

       Please call this program with --help to see the available options.

ALGORITHM

   Synthetic Control
       Studies of the relationship between IPD and sequence context reveal that most of the  variation  in  mean
       IPD  across  a genome can be predicted from a 12-base sequence context surrounding the active site of the
       DNA polymerase. The bounds of the relevant context window correspond to the window of DNA in contact with
       the  polymerase,  as  seen  in DNA/polymerase crystal structures.  To simplify the process of finding DNA
       modifications with PacBio data, the tool includes a pre-trained lookup table mapping 12-mer DNA sequences
       to mean IPDs observed in C2 chemistry.

   Filtering and Trimming
       kineticsTools  uses  the Mapping QV generated by BLASR and stored in the cmp.h5 file to ignore reads that
       aren't confidently mapped.  The default minimum Mapping QV required is 10, implying that BLASR  has  90\%
       confidence  that  the  read  is  correctly mapped. Because of the range of readlengths inherent in PacBio
       dataThis can be changed in using the --mapQvThreshold  command  line  argument,  or  via  the  SMRTPortal
       configuration dialog for Modification Detection.

       There  are  a  few  features  of  PacBio  data  that  require  special attention in order to acheive good
       modification detection performance.  kineticsTools inspects the alignment between the observed bases  and
       the reference sequence -- in order for an IPD measurement to be included in the analysis, the PacBio read
       sequence must match the reference sequence for k around the cognate base. In the current module  k=1  The
       IPD distribution at some locus be thought of as a mixture between the 'normal' incorporation process IPD,
       which is sensitive to the local sequence context  and  DNA  modifications  and  a  contaminating  'pause'
       process  IPD  which have a much longer duration (mean >10x longer than normal), but happen rarely (~1% of
       IPDs).  Note: Our current understanding is  that  pauses  do  not  carry  useful  information  about  the
       methylation  state  of  the  DNA,  however  a  more  careful  analysis  may  be warranted. Also note that
       modifications that drastically increase the Roughly 1% of observed IPDs are generated  by  pause  events.
       Capping  observed  IPDs  at  the  global  99th  percentile  is motivated by theory from robust hypothesis
       testing.  Some sequence contexts may have naturally longer IPDs, to avoid capping too much data at  those
       contexts,   the  cap  threshold  is  adjusted  per  context  as  follows:  capThreshold  =  max(global99,
       5*modelPrediction, percentile(ipdObservations, 75))

   Statistical Testing
       We test the hypothesis that IPDs observed at a particular locus in the sample have a  longer  means  than
       IPDs  observed  at  the  same  locus  in  unmodified  DNA.  If we have generated a Whole Genome Amplified
       dataset, which removes DNA modifications, we use a  case-control,  two-sample  t-test.   This  tool  also
       provides  a  pre-calibrated  'synthetic control' model which predicts the unmodified IPD, given a 12 base
       sequence context. In the synthetic control case we use a one-sample t-test, with an adjustment to account
       for error in the synthetic control model.

INPUTS

   aligned_reads.cmp.h5
       A  standard  cmp.h5 file contain alignments and IPD information supplies the kinetic data used to perform
       modification detection.  The standard cmp.h5 file of a SMRTportal jobs is data/aligned_read.cmp.h5.

   Reference Sequence
       The tool requires the reference sequence used to perform alignments.  Currently this must be supplied via
       the path to a SMRTportal reference repository entry.

OUTPUTS

       The  modification  detection  tool  provides  results  in  a  variety  of  formats  suitable for in-depth
       statistical analysis, quick reference, and comsumption by visualization tools such  as  PacBio  SMRTView.
       Results  are generally indexed by reference position and reference strand.  In all cases the strand value
       refers to the strand carrying the modification in DNA sample. Remember that the  kinetic  effect  of  the
       modification  is  observed  in  read  sequences aligning to the opposite strand. So reads aligning to the
       positive strand carry information about modification on the negative strand and vice versa, but  in  this
       toolkit we alway report the strand containing the putative modification.

   modifications.csv
       The  modifications.csv  file contains one row for each (reference position, strand) pair that appeared in
       the dataset with coverage at least x.  x defaults to 3, but is configurable with '--minCoverage' flag  to
       ipdSummary.py.  The  reference  position  index  is  1-based  for  compatibility  with the gff file the R
       environment.

   Output columns
       in-silico control mode

                               ┌────────────────┬───────────────────────────────────────┐
                               │Column          │ Description                           │
                               ├────────────────┼───────────────────────────────────────┤
                               │refId           │ reference   sequence   ID   of   this │
                               │                │ observation                           │
                               ├────────────────┼───────────────────────────────────────┤
                               │tpl             │ 1-based template position             │
                               ├────────────────┼───────────────────────────────────────┤
                               │strand          │ native  sample  strand where kinetics │
                               │                │ were generated. '0' is the strand  of │
                               │                │ the  original  FASTA, '1' is opposite │
                               │                │ strand from FASTA                     │
                               ├────────────────┼───────────────────────────────────────┤
                               │base            │ the cognate base at this position  in │
                               │                │ the reference                         │
                               ├────────────────┼───────────────────────────────────────┤
                               │score           │ Phred-transformed   pvalue   that   a │
                               │                │ kinetic  deviation  exists  at   this │
                               │                │ position                              │
                               ├────────────────┼───────────────────────────────────────┤
                               │tMean           │ capped   mean   of   normalized  IPDs │
                               │                │ observed at this position             │
                               ├────────────────┼───────────────────────────────────────┤
                               │tErr            │ capped standard error  of  normalized │
                               │                │ IPDs   observed   at   this  position │
                               │                │ (standard deviation / sqrt(coverage)  │
                               ├────────────────┼───────────────────────────────────────┤
                               │modelPrediction │ normalized mean IPD predicted by  the │
                               │                │ synthetic   control  model  for  this │
                               │                │ sequence context                      │
                               ├────────────────┼───────────────────────────────────────┤
                               │ipdRatio        │ tMean / modelPrediction               │
                               └────────────────┴───────────────────────────────────────┘

                               │coverage        │ count of valid IPDs at this  position │
                               │                │ (see Filtering section for details)   │
                               ├────────────────┼───────────────────────────────────────┤
                               │frac            │ estimate of the fraction of molecules │
                               │                │ that carry the modification           │
                               ├────────────────┼───────────────────────────────────────┤
                               │fracLow         │ 2.5%   confidence   bound   of   frac │
                               │                │ estimate                              │
                               ├────────────────┼───────────────────────────────────────┤
                               │fracUpp         │ 97.5%   confidence   bound   of  frac │
                               │                │ estimate                              │
                               └────────────────┴───────────────────────────────────────┘

       case-control mode

                               ┌────────────────┬───────────────────────────────────────┐
                               │Column          │ Description                           │
                               ├────────────────┼───────────────────────────────────────┤
                               │refId           │ reference   sequence   ID   of   this │
                               │                │ observation                           │
                               ├────────────────┼───────────────────────────────────────┤
                               │tpl             │ 1-based template position             │
                               ├────────────────┼───────────────────────────────────────┤
                               │strand          │ native  sample  strand where kinetics │
                               │                │ were generated. '0' is the strand  of │
                               │                │ the  original  FASTA, '1' is opposite │
                               │                │ strand from FASTA                     │
                               ├────────────────┼───────────────────────────────────────┤
                               │base            │ the cognate base at this position  in │
                               │                │ the reference                         │
                               ├────────────────┼───────────────────────────────────────┤
                               │score           │ Phred-transformed   pvalue   that   a │
                               │                │ kinetic  deviation  exists  at   this │
                               │                │ position                              │
                               ├────────────────┼───────────────────────────────────────┤
                               │caseMean        │ mean of normalized case IPDs observed │
                               │                │ at this position                      │
                               ├────────────────┼───────────────────────────────────────┤
                               │controlMean     │ mean  of  normalized   control   IPDs │
                               │                │ observed at this position             │
                               ├────────────────┼───────────────────────────────────────┤
                               │caseStd         │ standard   deviation   of  case  IPDs │
                               │                │ observed at this position             │
                               ├────────────────┼───────────────────────────────────────┤
                               │controlStd      │ standard deviation  of  control  IPDs │
                               │                │ observed at this position             │
                               ├────────────────┼───────────────────────────────────────┤
                               │ipdRatio        │ tMean / modelPrediction               │
                               ├────────────────┼───────────────────────────────────────┤
                               │testStatistic   │ t-test statistic                      │
                               ├────────────────┼───────────────────────────────────────┤
                               │coverage        │ mean of case and control coverage     │
                               ├────────────────┼───────────────────────────────────────┤
                               │controlCoverage │ count  of  valid control IPDs at this │
                               │                │ position (see Filtering  section  for │
                               │                │ details)                              │
                               ├────────────────┼───────────────────────────────────────┤
                               │caseCoverage    │ count  of  valid  case  IPDs  at this │
                               │                │ position (see Filtering  section  for │
                               │                │ details)                              │
                               └────────────────┴───────────────────────────────────────┘

   modifications.gff
       The     modifications.gff    is    compliant    with    the    GFF    Version    3    specification    (‐
       http://www.sequenceontology.org/gff3.shtml). Each template position / strand pair whose  p-value  exceeds
       the  pvalue  threshold  appears as a row. The template position is 1-based, per the GFF spec.  The strand
       column refers to the strand carrying the detected modification, which is the opposite strand  from  those
       used to detect the modification. The GFF confidence column is a Phred-transformed pvalue of detection.

       Note on genome browser compatibility

       The  modifications.gff  file  will  not work directly with most genome browsers.  You will likely need to
       make a copy of the GFF file and convert the _seqid_ columns from the generic 'ref0000x'  names  generated
       by  PacBio,  to  the  FASTA  headers  present in the original reference FASTA file.  The mapping table is
       written in the header of the modifications.gff file  in   #sequence-header  tags.   This  issue  will  be
       resolved in the 1.4 release of kineticsTools.

       The  auxiliary  data  column  of  the  GFF  file contains other statistics which may be useful downstream
       analysis or filtering.  In particular the coverage level of the reads used to make the call, and +/- 20bp
       sequence context surrounding the site.

                                 ┌───────────┬───────────────────────────────────────┐
                                 │Column     │ Description                           │
                                 ├───────────┼───────────────────────────────────────┤
                                 │seqid      │ Fasta contig name                     │
                                 ├───────────┼───────────────────────────────────────┤
                                 │source     │ Name of tool -- 'kinModCall'          │
                                 ├───────────┼───────────────────────────────────────┤
                                 │type       │ Modification      type      --     in │
                                 │           │ identification mode this will be m6A, │
                                 │           │ m4C,  or m5C for identified bases, or │
                                 │           │ the generic tag 'modified_base' if  a │
                                 │           │ kinetic  event was detected that does │
                                 │           │ not  match   a   known   modification │
                                 │           │ signature                             │
                                 ├───────────┼───────────────────────────────────────┤
                                 │start      │ Modification position on contig       │
                                 ├───────────┼───────────────────────────────────────┤
                                 │end        │ Modification position on contig       │
                                 ├───────────┼───────────────────────────────────────┤
                                 │score      │ Phred    transformed    p-value    of │
                                 │           │ detection - this is  the  single-site │
                                 │           │ detection p-value                     │
                                 ├───────────┼───────────────────────────────────────┤
                                 │strand     │ Sample strand containing modification │
                                 ├───────────┼───────────────────────────────────────┤
                                 │phase      │ Not applicable                        │
                                 └───────────┴───────────────────────────────────────┘

                                 │attributes │ Extra  fields  relevant to base mods. │
                                 │           │ IPDRatio  is  traditional   IPDRatio, │
                                 │           │ context  is  the  reference  sequence │
                                 │           │ -20bp    to    +20bp    around    the │
                                 │           │ modification,  and  coverage level is │
                                 │           │ the number of IPD  observations  used │
                                 │           │ after   Mapping   QV   filtering  and │
                                 │           │ accuracy  filtering.   If   the   row │
                                 │           │ results     from     an    identified │
                                 │           │ modification  we  also   include   an │
                                 │           │ identificationQv  tag  with  the from │
                                 │           │ the    modification    identification │
                                 │           │ procedure.  identificationQv  is  the │
                                 │           │ phred-transformed probability  of  an │
                                 │           │ incorrect  identification,  for bases │
                                 │           │ that  were  identified  as  having  a │
                                 │           │ particular     modification.    frac, │
                                 │           │ fracLow,  fracUp  are  the  estimated │
                                 │           │ fraction  of  molecules  carrying the │
                                 │           │ modification, and the  5%  confidence │
                                 │           │ intervals   of   the   estimate.  The │
                                 │           │ methylated fraction estimation  is  a │
                                 │           │ beta-level  feature,  and should only │
                                 │           │ be used for exploratory purposes.     │
                                 └───────────┴───────────────────────────────────────┘

   motifs.gff
       If the Motif  Finder  tool  is  run,  it  will  generate  motifs.gff,  which  a  reprocessed  version  of
       modifications.gff  with  the  following changes. If a detected modification occurs on a motif detected by
       the motif finder, the modification is annotated with motif data. An attribute 'motif' is added containing
       the  motif  string, and an attribute 'id' is added containing the motif id, which is the motif string for
       unpaired motifs or 'motifString1/motifString2' for paired motifs. If  a  motif  instance  exists  in  the
       genome,  but  was  not  detected  in  modifications.gff,  an entry is added to motifs.gff, indicating the
       presence of that motif and the kinetics that were observed at that site.

   motif_summary.csv
       If the Motif Finder tool  is  run,  motif_summary.csv  is  generated,  summarizing  the  modified  motifs
       discovered by the tool. The CSV contains one row per detected motif, with the following columns

                             ┌───────────────────┬───────────────────────────────────────┐
                             │Column             │ Description                           │
                             ├───────────────────┼───────────────────────────────────────┤
                             │motifString        │ Detected motif sequence               │
                             ├───────────────────┼───────────────────────────────────────┤
                             │centerPos          │ Position  in  motif  of  modification │
                             │                   │ (0-based)                             │
                             ├───────────────────┼───────────────────────────────────────┤
                             │fraction           │ Fraction of instances of  this  motif │
                             │                   │ with  modification  QV  above  the QV │
                             │                   │ threshold                             │
                             ├───────────────────┼───────────────────────────────────────┤
                             │nDetected          │ Number of  instances  of  this  motif │
                             │                   │ with above threshold                  │
                             ├───────────────────┼───────────────────────────────────────┤
                             │nGenome            │ Number  of instances of this motif in │
                             │                   │ reference sequence                    │
                             ├───────────────────┼───────────────────────────────────────┤
                             │groupTag           │ A   string   idetifying   the   motif │
                             │                   │ grouping.  For  paired motifs this is │
                             │                   │ "<motifString1>/<motifString2>",  For │
                             │                   │ unpaired     motifs    this    equals │
                             │                   │ motifString                           │
                             └───────────────────┴───────────────────────────────────────┘

                             │partnerMotifString │ motifString of  paired  motif  (motif │
                             │                   │ with            reverse-complementary │
                             │                   │ motifString)                          │
                             ├───────────────────┼───────────────────────────────────────┤
                             │meanScore          │ Mean  Modification  Qv  of   detected │
                             │                   │ instances                             │
                             ├───────────────────┼───────────────────────────────────────┤
                             │meanIpdRatio       │ Mean IPD ratio of detected instances  │
                             ├───────────────────┼───────────────────────────────────────┤
                             │meanCoverage       │ Mean coverage of detected instances   │
                             ├───────────────────┼───────────────────────────────────────┤
                             │objectiveScore     │ Objective  score of this motif in the │
                             │                   │ motif finder algorithm                │
                             └───────────────────┴───────────────────────────────────────┘

SEE ALSO

       summarizeModifications(1)

                                                  December 2015                                    IPDSUMMARY(1)