Provided by: kineticstools_0.6.1+git20210811.276ca9c+dfsg-1build1_all 

NAME
ipdSummary - Detect DNA base-modifications from kinetic signatures.
DESCRIPTION
kineticsTool loads IPDs observed at each position in the genome, and compares those IPDs to value
expected for unmodified DNA, and outputs the result of this statistical test. The expected IPD value for
unmodified DNA can come from either an in-silico control or an amplified control. The in silico control
is trained by PacBio and shipped with the package. It predicts predicts the IPD using the local sequence
context around the current position. An amplified control dataset is generated by sequencing unmodified
DNA with the same sequence as the test sample. An amplified control sample is usually generated by
whole-genome amplification of the original sample.
Dependencies
kineticsTools depends on:
• pbcore (http://github.com/PacificBiosciences/pbcore)
• numpy
• scipy
Modification Detection
The basic mode of kineticsTools does an independent comparison of IPDs at each position on the genome,
for each strand, and emits various statistics to CSV and GFF (after applying a significance filter).
Modifications Identification
kineticsTools also has a Modification Identification mode that can decode multi-site IPD 'fingerprints'
into a reduced set of calls of specific modifications. This feature has the following benefits:
• Different modifications occurring on the same base can be distinguished (for example m5C and
m4C)
• The signal from one modification is combined into one statistic, improving sensitivity, removing
extra peaks, and correctly centering the call
OPTIONS
Please call this program with --help to see the available options.
ALGORITHM
Synthetic Control
Studies of the relationship between IPD and sequence context reveal that most of the variation in mean
IPD across a genome can be predicted from a 12-base sequence context surrounding the active site of the
DNA polymerase. The bounds of the relevant context window correspond to the window of DNA in contact with
the polymerase, as seen in DNA/polymerase crystal structures. To simplify the process of finding DNA
modifications with PacBio data, the tool includes a pre-trained lookup table mapping 12-mer DNA sequences
to mean IPDs observed in C2 chemistry.
Filtering and Trimming
kineticsTools uses the Mapping QV generated by BLASR or pbmm2 and stored in the alignments or BAM file
(or AlignmentSet) to ignore reads that aren't confidently mapped. The default minimum Mapping QV
required is 10, implying that BLASR has 90\% confidence that the read is correctly mapped. Because of the
range of readlengths inherent in PacBio dataThis can be changed in using the --mapQvThreshold command
line argument, or via the SMRTPortal configuration dialog for Modification Detection.
There are a few features of PacBio data that require special attention in order to achieve good
modification detection performance. kineticsTools inspects the alignment between the observed bases and
the reference sequence -- in order for an IPD measurement to be included in the analysis, the PacBio read
sequence must match the reference sequence for k around the cognate base. In the current module k=1 The
IPD distribution at some locus be thought of as a mixture between the 'normal' incorporation process IPD,
which is sensitive to the local sequence context and DNA modifications and a contaminating 'pause'
process IPD which have a much longer duration (mean >10x longer than normal), but happen rarely (~1% of
IPDs). Note: Our current understanding is that pauses do not carry useful information about the
methylation state of the DNA, however a more careful analysis may be warranted. Also note that
modifications that drastically increase the Roughly 1% of observed IPDs are generated by pause events.
Capping observed IPDs at the global 99th percentile is motivated by theory from robust hypothesis
testing. Some sequence contexts may have naturally longer IPDs, to avoid capping too much data at those
contexts, the cap threshold is adjusted per context as follows: capThreshold = max(global99,
5*modelPrediction, percentile(ipdObservations, 75))
Statistical Testing
We test the hypothesis that IPDs observed at a particular locus in the sample have a longer means than
IPDs observed at the same locus in unmodified DNA. If we have generated a Whole Genome Amplified
dataset, which removes DNA modifications, we use a case-control, two-sample t-test. This tool also
provides a pre-calibrated 'synthetic control' model which predicts the unmodified IPD, given a 12 base
sequence context. In the synthetic control case we use a one-sample t-test, with an adjustment to account
for error in the synthetic control model.
EXAMPLE USAGE
Basic use with BAM input, GFF output:
ipdSummary aligned.bam --reference ref.fasta --identify m6A,m4C --gff basemods.gff
With dataset input, methyl fraction calculation and GFF+CSV output:
ipdSummary mapped.alignmentset.xml --reference ref.fasta --identify m6A,m4C --methylFraction --gff basemods.gff --csv kinetics.csv
INPUTS
Aligned Reads
A standard PacBio alignment file - either AlignmentSet XML or BAM - containing alignments and IPD
information supplies the kinetic data used to perform modification detection.
Reference Sequence
The tool requires the reference sequence used to perform alignments. This can be either a FASTA file or
a ReferenceSet XML.
OUTPUTS
The modification detection tool provides results in a variety of formats suitable for in-depth
statistical analysis, quick reference, and comsumption by visualization tools such as PacBio SMRTView.
Results are generally indexed by reference position and reference strand. In all cases the strand value
refers to the strand carrying the modification in DNA sample. Remember that the kinetic effect of the
modification is observed in read sequences aligning to the opposite strand. So reads aligning to the
positive strand carry information about modification on the negative strand and vice versa, but in this
toolkit we alway report the strand containing the putative modification.
The following output options are available:
• --gff FILENAME: GFF format
• --csv FILENAME: comma-separated value format
• --bigwig FILENAME: BigWig file (mostly only useful for SMRTView)
If you are running base modification analysis through SMRT Link or a pbsmrtpipe pipeline, the GFF, HDF5,
and BigWig outputs are automatically generated.
modifications.gff
The modifications.gff is compliant with the GFF Version 3 specification (‐
http://www.sequenceontology.org/gff3.shtml). Each template position / strand pair whose p-value exceeds
the pvalue threshold appears as a row. The template position is 1-based, per the GFF spec. The strand
column refers to the strand carrying the detected modification, which is the opposite strand from those
used to detect the modification. The GFF confidence column is a Phred-transformed pvalue of detection.
Note on genome browser compatibility
The modifications.gff file will not work directly with most genome browsers. You will likely need to
make a copy of the GFF file and convert the _seqid_ columns from the generic 'ref0000x' names generated
by PacBio, to the FASTA headers present in the original reference FASTA file. The mapping table is
written in the header of the modifications.gff file in #sequence-header tags. This issue will be
resolved in the 1.4 release of kineticsTools.
The auxiliary data column of the GFF file contains other statistics which may be useful downstream
analysis or filtering. In particular the coverage level of the reads used to make the call, and +/- 20bp
sequence context surrounding the site.
──────────────────────────────────────────────────────
│ Column │ Description │
├────────────┼───────────────────────────────────────┤
│ seqid │ Fasta contig name │
├────────────┼───────────────────────────────────────┤
│ source │ Name of tool -- 'kinModCall' │
├────────────┼───────────────────────────────────────┤
│ type │ Modification type -- in │
│ │ identification mode this will be m6A, │
│ │ m4C, or m5C for identified bases, or │
│ │ the generic tag 'modified_base' if a │
│ │ kinetic event was detected that does │
│ │ not match a known modification │
│ │ signature │
├────────────┼───────────────────────────────────────┤
│ start │ Modification position on contig │
├────────────┼───────────────────────────────────────┤
│ end │ Modification position on contig │
├────────────┼───────────────────────────────────────┤
│ score │ Phred transformed p-value of │
│ │ detection - this is the single-site │
│ │ detection p-value │
├────────────┼───────────────────────────────────────┤
│ strand │ Sample strand containing modification │
├────────────┼───────────────────────────────────────┤
│ phase │ Not applicable │
├────────────┼───────────────────────────────────────┤
│ attributes │ Extra fields relevant to base mods. │
│ │ IPDRatio is traditional IPDRatio, │
│ │ context is the reference sequence │
│ │ -20bp to +20bp around the │
│ │ modification, and coverage level is │
│ │ the number of IPD observations used │
│ │ after Mapping QV filtering and │
│ │ accuracy filtering. If the row │
│ │ results from an identified │
│ │ modification we also include an │
│ │ identificationQv tag with the from │
│ │ the modification identification │
│ │ procedure. identificationQv is the │
│ │ phred-transformed probability of an │
│ │ incorrect identification, for bases │
│ │ that were identified as having a │
│ │ particular modification. frac, │
│ │ fracLow, fracUp are the estimated │
│ │ fraction of molecules carrying the │
│ │ modification, and the 5% confidence │
│ │ intervals of the estimate. The │
│ │ methylated fraction estimation is a │
│ │ beta-level feature, and should only │
│ │ be used for exploratory purposes. │
└────────────┴───────────────────────────────────────┘
modifications.csv
The modifications.csv file contains one row for each (reference position, strand) pair that appeared in
the dataset with coverage at least x. x defaults to 3, but is configurable with '--minCoverage' flag to
ipdSummary.py. The reference position index is 1-based for compatibility with the gff file the R
environment. Note that this output type scales poorly and is not recommended for large genomes; the HDF5
output should perform much better in these cases. We have preserved the CSV option to support legacy
applications but this is no longer produce by the pipelines in SMRT Link/pbsmrtpipe.
Output columns
in-silico control mode
───────────────────────────────────────────────────────────
Column Description
───────────────────────────────────────────────────────────
refId reference sequence ID of this
observation
───────────────────────────────────────────────────────────
tpl 1-based template position
───────────────────────────────────────────────────────────
strand native sample strand where kinetics
were generated. '0' is the strand of
the original FASTA, '1' is opposite
strand from FASTA
───────────────────────────────────────────────────────────
base the cognate base at this position in
the reference
───────────────────────────────────────────────────────────
score Phred-transformed pvalue that a
kinetic deviation exists at this
position
───────────────────────────────────────────────────────────
tMean capped mean of normalized IPDs
observed at this position
───────────────────────────────────────────────────────────
tErr capped standard error of normalized
IPDs observed at this position
(standard deviation / sqrt(coverage)
───────────────────────────────────────────────────────────
modelPrediction normalized mean IPD predicted by the
synthetic control model for this
sequence context
───────────────────────────────────────────────────────────
ipdRatio tMean / modelPrediction
───────────────────────────────────────────────────────────
coverage count of valid IPDs at this position
(see Filtering section for details)
───────────────────────────────────────────────────────────
frac estimate of the fraction of molecules
that carry the modification
───────────────────────────────────────────────────────────
fracLow 2.5% confidence bound of frac
estimate
───────────────────────────────────────────────────────────
fracUpp 97.5% confidence bound of frac
estimate
┌─────────────────┬───────────────────────────────────────┐
│ │ │
case-control mode │ │ │
├─────────────────┼───────────────────────────────────────┤
│ Column │ Description │
├─────────────────┼───────────────────────────────────────┤
│ refId │ reference sequence ID of this │
│ │ observation │
├─────────────────┼───────────────────────────────────────┤
│ tpl │ 1-based template position │
├─────────────────┼───────────────────────────────────────┤
│ strand │ native sample strand where kinetics │
│ │ were generated. '0' is the strand of │
│ │ the original FASTA, '1' is opposite │
│ │ strand from FASTA │
├─────────────────┼───────────────────────────────────────┤
│ base │ the cognate base at this position in │
│ │ the reference │
├─────────────────┼───────────────────────────────────────┤
│ score │ Phred-transformed pvalue that a │
│ │ kinetic deviation exists at this │
│ │ position │
├─────────────────┼───────────────────────────────────────┤
│ caseMean │ mean of normalized case IPDs observed │
│ │ at this position │
───────────────────────────────────────────────────────────
controlMean mean of normalized control IPDs
observed at this position
───────────────────────────────────────────────────────────
caseStd standard deviation of case IPDs
observed at this position
───────────────────────────────────────────────────────────
controlStd standard deviation of control IPDs
observed at this position
───────────────────────────────────────────────────────────
ipdRatio tMean / modelPrediction
───────────────────────────────────────────────────────────
testStatistic t-test statistic
───────────────────────────────────────────────────────────
coverage mean of case and control coverage
───────────────────────────────────────────────────────────
controlCoverage count of valid control IPDs at this
position (see Filtering section for
details)
───────────────────────────────────────────────────────────
caseCoverage count of valid case IPDs at this
position (see Filtering section for
details)
┌─────────────────┬───────────────────────────────────────┐
│ │ │
SEE ALSO │ │ │