Provided by: kineticstools_0.6.1+git20220223.1326a4d+dfsg-3_all bug

NAME

       ipdSummary - Detect DNA base-modifications from kinetic signatures.

DESCRIPTION

       kineticsTool  loads  IPDs observed at each position in the genome, and compares those IPDs
       to value expected for unmodified DNA, and outputs the result  of  this  statistical  test.
       The  expected IPD value for unmodified DNA can come from either an in-silico control or an
       amplified control. The in silico control  is  trained  by  PacBio  and  shipped  with  the
       package.  It predicts predicts the IPD using the local sequence context around the current
       position.  An amplified control dataset is generated by sequencing unmodified DNA with the
       same  sequence  as  the  test  sample. An amplified control sample is usually generated by
       whole-genome amplification of the original sample.

   Dependencies
       kineticsTools depends on:

              • pbcore ( <http://github.com/PacificBiosciences/pbcore> )

              • numpy

              • scipy

   Modification Detection
       The basic mode of kineticsTools does an independent comparison of IPDs at each position on
       the genome, for each strand, and emits various statistics to CSV and GFF (after applying a
       significance filter).

   Modifications Identification
       kineticsTools also has a Modification Identification mode that can decode  multi-site  IPD
       'fingerprints' into a reduced set of calls of specific modifications. This feature has the
       following benefits:

              • Different modifications occurring on the same  base  can  be  distinguished  (for
                example m5C and m4C)

              • The  signal  from  one  modification  is  combined  into one statistic, improving
                sensitivity, removing extra peaks, and correctly centering the call

OPTIONS

       Please call this program with --help to see the available options.

ALGORITHM

   Synthetic Control
       Studies of the relationship between IPD and sequence  context  reveal  that  most  of  the
       variation  in  mean  IPD  across a genome can be predicted from a 12-base sequence context
       surrounding the active site of the DNA polymerase. The  bounds  of  the  relevant  context
       window  correspond  to  the  window  of  DNA  in  contact  with the polymerase, as seen in
       DNA/polymerase crystal structures.  To simplify the process of finding  DNA  modifications
       with  PacBio  data,  the  tool  includes  a  pre-trained  lookup  table mapping 12-mer DNA
       sequences to mean IPDs observed in C2 chemistry.

   Filtering and Trimming
       kineticsTools uses the Mapping QV generated by BLASR or pbmm2 and stored in the alignments
       or BAM file (or AlignmentSet) to ignore reads that aren't confidently mapped.  The default
       minimum Mapping QV required is 10, implying that BLASR has 90\% confidence that  the  read
       is  correctly  mapped. Because of the range of readlengths inherent in PacBio dataThis can
       be changed in using the --mapQvThreshold command line  argument,  or  via  the  SMRTPortal
       configuration dialog for Modification Detection.

       There are a few features of PacBio data that require special attention in order to achieve
       good modification detection performance.  kineticsTools inspects the alignment between the
       observed  bases  and  the  reference  sequence  --  in  order for an IPD measurement to be
       included in the analysis, the PacBio read sequence must match the reference sequence for k
       around  the  cognate base. In the current module k=1 The IPD distribution at some locus be
       thought of as a mixture between the 'normal' incorporation process IPD, which is sensitive
       to  the  local  sequence context and DNA modifications and a contaminating 'pause' process
       IPD which have a much longer duration (mean >10x longer than normal),  but  happen  rarely
       (~1%  of  IPDs).   Note:  Our  current  understanding  is  that pauses do not carry useful
       information about the methylation state of the DNA, however a more careful analysis may be
       warranted.  Also  note  that  modifications  that  drastically  increase the Roughly 1% of
       observed IPDs are generated by pause events. Capping observed  IPDs  at  the  global  99th
       percentile  is motivated by theory from robust hypothesis testing.  Some sequence contexts
       may have naturally longer IPDs, to avoid capping too much data at those contexts, the  cap
       threshold   is   adjusted   per   context   as   follows:   capThreshold  =  max(global99,
       5*modelPrediction, percentile(ipdObservations, 75))

   Statistical Testing
       We test the hypothesis that IPDs observed at a particular  locus  in  the  sample  have  a
       longer means than IPDs observed at the same locus in unmodified DNA.  If we have generated
       a Whole Genome Amplified dataset, which removes DNA modifications, we use a  case-control,
       two-sample  t-test.   This  tool  also provides a pre-calibrated 'synthetic control' model
       which predicts the unmodified IPD, given a 12 base  sequence  context.  In  the  synthetic
       control  case  we  use a one-sample t-test, with an adjustment to account for error in the
       synthetic control model.

EXAMPLE USAGE

       Basic use with BAM input, GFF output:

          ipdSummary aligned.bam --reference ref.fasta --identify m6A,m4C --gff basemods.gff

       With dataset input, methyl fraction calculation and GFF+CSV output:

          ipdSummary mapped.alignmentset.xml --reference ref.fasta --identify m6A,m4C --methylFraction --gff basemods.gff --csv kinetics.csv

INPUTS

   Aligned Reads
       A standard PacBio alignment file - either AlignmentSet XML or BAM - containing  alignments
       and IPD information supplies the kinetic data used to perform modification detection.

   Reference Sequence
       The tool requires the reference sequence used to perform alignments.  This can be either a
       FASTA file or a ReferenceSet XML.

OUTPUTS

       The modification detection tool provides results in a  variety  of  formats  suitable  for
       in-depth  statistical  analysis,  quick  reference, and comsumption by visualization tools
       such as PacBio  SMRTView.   Results  are  generally  indexed  by  reference  position  and
       reference  strand.   In  all  cases  the  strand  value  refers to the strand carrying the
       modification in DNA sample. Remember that  the  kinetic  effect  of  the  modification  is
       observed  in  read  sequences  aligning  to  the opposite strand. So reads aligning to the
       positive strand carry information about modification  on  the  negative  strand  and  vice
       versa,   but  in  this  toolkit  we  alway  report  the  strand  containing  the  putative
       modification.

       The following output options are available:

          • --gff FILENAME: GFF format

          • --csv FILENAME: comma-separated value format

          • --bigwig FILENAME: BigWig file (mostly only useful for SMRTView)

       If you are running base modification analysis through SMRT Link or a pbsmrtpipe  pipeline,
       the GFF, HDF5, and BigWig outputs are automatically generated.

   modifications.gff
       The   modifications.gff   is   compliant   with   the   GFF   Version  3  specification  (
       <http://www.sequenceontology.org/gff3.shtml> ). Each template position / strand pair whose
       p-value  exceeds  the pvalue threshold appears as a row. The template position is 1-based,
       per the GFF  spec.   The  strand  column  refers  to  the  strand  carrying  the  detected
       modification, which is the opposite strand from those used to detect the modification. The
       GFF confidence column is a Phred-transformed pvalue of detection.

       Note on genome browser compatibility

       The modifications.gff file will not work directly with most  genome  browsers.   You  will
       likely  need  to  make  a  copy  of  the GFF file and convert the _seqid_ columns from the
       generic 'ref0000x' names generated by PacBio, to the FASTA headers present in the original
       reference FASTA file.  The mapping table is written in the header of the modifications.gff
       file in  #sequence-header tags.  This issue  will  be  resolved  in  the  1.4  release  of
       kineticsTools.

       The  auxiliary  data  column of the GFF file contains other statistics which may be useful
       downstream analysis or filtering.  In particular the coverage level of the reads  used  to
       make the call, and +/- 20bp sequence context surrounding the site.

                            ┌───────────┬──────────────────────────────────┐
                            │Column     │ Description                      │
                            ├───────────┼──────────────────────────────────┤
                            │seqid      │ Fasta contig name                │
                            ├───────────┼──────────────────────────────────┤
                            │source     │ Name of tool -- 'kinModCall'     │
                            ├───────────┼──────────────────────────────────┤
                            │type       │ Modification    type    --    in │
                            │           │ identification mode this will be │
                            │           │ m6A,  m4C, or m5C for identified │
                            │           │ bases,  or   the   generic   tag │
                            │           │ 'modified_base'   if  a  kinetic │
                            │           │ event was detected that does not │
                            │           │ match   a   known   modification │
                            │           │ signature                        │
                            ├───────────┼──────────────────────────────────┤
                            │start      │ Modification position on contig  │
                            ├───────────┼──────────────────────────────────┤
                            │end        │ Modification position on contig  │
                            ├───────────┼──────────────────────────────────┤
                            │score      │ Phred  transformed  p-value   of │
                            │           │ detection    -   this   is   the │
                            │           │ single-site detection p-value    │
                            ├───────────┼──────────────────────────────────┤
                            │strand     │ Sample     strand     containing │
                            │           │ modification                     │
                            ├───────────┼──────────────────────────────────┤
                            │phase      │ Not applicable                   │
                            ├───────────┼──────────────────────────────────┤
                            │attributes │ Extra  fields  relevant  to base │
                            │           │ mods.  IPDRatio  is  traditional │
                            │           │ IPDRatio,    context    is   the │
                            │           │ reference  sequence   -20bp   to │
                            │           │ +20bp  around  the modification, │
                            │           │ and coverage level is the number │
                            │           │ of  IPD  observations used after │
                            │           │ Mapping   QV    filtering    and │
                            │           │ accuracy  filtering.  If the row │
                            │           │ results   from   an   identified │
                            │           │ modification  we also include an │
                            │           │ identificationQv  tag  with  the │
                            │           │ from       the      modification │
                            │           │ identification        procedure. │
                            │           │ identificationQv      is     the │
                            │           │ phred-transformed probability of │
                            │           │ an incorrect identification, for │
                            │           │ bases that  were  identified  as │
                            │           │ having        a       particular │
                            │           │ modification.   frac,   fracLow, │
                            │           │ fracUp    are    the   estimated │
                            │           │ fraction of  molecules  carrying │
                            │           │ the  modification,  and  the  5% │
                            │           │ confidence  intervals   of   the │
                            │           │ estimate.     The     methylated │
                            │           │ fraction   estimation    is    a │
                            │           │ beta-level  feature,  and should │
                            │           │ only  be  used  for  exploratory │
                            │           │ purposes.                        │
                            └───────────┴──────────────────────────────────┘
   modifications.csv
       The  modifications.csv  file  contains  one row for each (reference position, strand) pair
       that appeared in the dataset  with  coverage  at  least  x.   x  defaults  to  3,  but  is
       configurable  with  '--minCoverage' flag to ipdSummary.py. The reference position index is
       1-based for compatibility with the gff file the R environment.  Note that this output type
       scales  poorly  and  is  not recommended for large genomes; the HDF5 output should perform
       much better in  these  cases.   We  have  preserved  the  CSV  option  to  support  legacy
       applications but this is no longer produce by the pipelines in SMRT Link/pbsmrtpipe.

   Output columns
       in-silico control mode

                          ┌────────────────┬──────────────────────────────────┐
                          │Column          │ Description                      │
                          ├────────────────┼──────────────────────────────────┤
                          │refId           │ reference  sequence  ID  of this │
                          │                │ observation                      │
                          ├────────────────┼──────────────────────────────────┤
                          │tpl             │ 1-based template position        │
                          ├────────────────┼──────────────────────────────────┤
                          │strand          │ native   sample   strand   where │
                          │                │ kinetics  were generated. '0' is │
                          │                │ the  strand  of   the   original │
                          │                │ FASTA,  '1'  is  opposite strand │
                          │                │ from FASTA                       │
                          ├────────────────┼──────────────────────────────────┤
                          │base            │ the   cognate   base   at   this │
                          │                │ position in the reference        │
                          ├────────────────┼──────────────────────────────────┤
                          │score           │ Phred-transformed  pvalue that a │
                          │                │ kinetic deviation exists at this │
                          │                │ position                         │
                          ├────────────────┼──────────────────────────────────┤
                          │tMean           │ capped  mean  of normalized IPDs │
                          │                │ observed at this position        │
                          ├────────────────┼──────────────────────────────────┤
                          │tErr            │ capped   standard    error    of │
                          │                │ normalized IPDs observed at this │
                          │                │ position (standard  deviation  / │
                          │                │ sqrt(coverage)                   │
                          ├────────────────┼──────────────────────────────────┤
                          │modelPrediction │ normalized mean IPD predicted by │
                          │                │ the synthetic control model  for │
                          │                │ this sequence context            │
                          ├────────────────┼──────────────────────────────────┤
                          │ipdRatio        │ tMean / modelPrediction          │
                          ├────────────────┼──────────────────────────────────┤
                          │coverage        │ count  of  valid  IPDs  at  this │
                          │                │ position (see Filtering  section │
                          │                │ for details)                     │
                          ├────────────────┼──────────────────────────────────┤
                          │frac            │ estimate   of  the  fraction  of │
                          │                │ molecules   that    carry    the │
                          │                │ modification                     │
                          ├────────────────┼──────────────────────────────────┤
                          │fracLow         │ 2.5%  confidence  bound  of frac │
                          │                │ estimate                         │
                          ├────────────────┼──────────────────────────────────┤
                          │fracUpp         │ 97.5% confidence bound  of  frac │
                          │                │ estimate                         │
                          └────────────────┴──────────────────────────────────┘
       case-control mode

                          ┌────────────────┬──────────────────────────────────┐
                          │Column          │ Description                      │
                          ├────────────────┼──────────────────────────────────┤
                          │refId           │ reference  sequence  ID  of this │
                          │                │ observation                      │
                          ├────────────────┼──────────────────────────────────┤
                          │tpl             │ 1-based template position        │
                          ├────────────────┼──────────────────────────────────┤
                          │strand          │ native   sample   strand   where │
                          │                │ kinetics  were generated. '0' is │
                          │                │ the  strand  of   the   original │
                          │                │ FASTA,  '1'  is  opposite strand │
                          │                │ from FASTA                       │
                          ├────────────────┼──────────────────────────────────┤
                          │base            │ the   cognate   base   at   this │
                          │                │ position in the reference        │
                          ├────────────────┼──────────────────────────────────┤
                          │score           │ Phred-transformed  pvalue that a │
                          │                │ kinetic deviation exists at this │
                          │                │ position                         │
                          ├────────────────┼──────────────────────────────────┤
                          │caseMean        │ mean  of  normalized  case  IPDs │
                          │                │ observed at this position        │
                          ├────────────────┼──────────────────────────────────┤
                          │controlMean     │ mean of normalized control  IPDs │
                          │                │ observed at this position        │
                          ├────────────────┼──────────────────────────────────┤
                          │caseStd         │ standard  deviation of case IPDs │
                          │                │ observed at this position        │
                          ├────────────────┼──────────────────────────────────┤
                          │controlStd      │ standard  deviation  of  control │
                          │                │ IPDs observed at this position   │
                          ├────────────────┼──────────────────────────────────┤
                          │ipdRatio        │ tMean / modelPrediction          │
                          ├────────────────┼──────────────────────────────────┤
                          │testStatistic   │ t-test statistic                 │
                          ├────────────────┼──────────────────────────────────┤
                          │coverage        │ mean   of   case   and   control │
                          │                │ coverage                         │
                          ├────────────────┼──────────────────────────────────┤
                          │controlCoverage │ count of valid control  IPDs  at │
                          │                │ this   position  (see  Filtering │
                          │                │ section for details)             │
                          ├────────────────┼──────────────────────────────────┤
                          │caseCoverage    │ count of valid case IPDs at this │
                          │                │ position  (see Filtering section │
                          │                │ for details)                     │
                          └────────────────┴──────────────────────────────────┘

SEE ALSO

       summarizeModifications(1)

                                          December 2015                             IPDSUMMARY(1)