lunar (1) pairtools.1.gz

Provided by: python3-pairtools_0.3.0-3.2build2_amd64 bug

NAME

       pairtools - pairtools Documentation

       pairtools  is  a  simple and fast command-line framework to process sequencing data from a
       Hi-C experiment. pairtools perform various operations on Hi-C pairs and occupy the  middle
       position in a typical Hi-C data processing pipeline:
         [image: The diagram of a typical processing pipeline for Hi-C data] [image] In a typical
         Hi-C pipeline, DNA sequences (reads) are aligned to the reference genome, converted into
         ligation junctions and binned, thus producing a Hi-C contact map..UNINDENT

         pairtools  aim  to  be  an  all-in-one  tool  for processing Hi-C pairs, and can perform
         following operations:

       • detect ligation junctions (a.k.a. Hi-C pairs) in aligned paired-end  sequences  of  Hi-C
         DNA molecules

       • sort .pairs files for downstream analyses

       • detect, tag and remove PCR/optical duplicates

       • generate extensive statistics of Hi-C datasets

       • select Hi-C pairs given flexibly defined criteria

       • restore .sam alignments from Hi-C pairs

       pairtools produce .pairs files compliant with the 4DN standard.

       The full list of available pairtools:

                            ┌────────────┬──────────────────────────────────┐
                            │Pairtool    │ Description                      │
                            ├────────────┼──────────────────────────────────┤
                            │dedup       │ Find   and   remove  PCR/optical │
                            │            │ duplicates.                      │
                            ├────────────┼──────────────────────────────────┤
                            │filterbycov │ Remove  pairs  from  regions  of │
                            │            │ high coverage.                   │
                            ├────────────┼──────────────────────────────────┤
                            │flip        │ Flip    pairs    to    get    an │
                            │            │ upper-triangular matrix.         │
                            ├────────────┼──────────────────────────────────┤
                            │markasdup   │ Tag pairs as duplicates.         │
                            ├────────────┼──────────────────────────────────┤
                            │merge       │ Merge   sorted   .pairs/.pairsam │
                            │            │ files.                           │
                            ├────────────┼──────────────────────────────────┤
                            │parse       │ Find ligation junctions in .sam, │
                            │            │ make .pairs.                     │
                            ├────────────┼──────────────────────────────────┤
                            │phase       │ Phase pairs mapped to a  diploid │
                            │            │ genome.                          │
                            ├────────────┼──────────────────────────────────┤
                            │restrict    │ Assign  restriction fragments to │
                            │            │ pairs.                           │
                            ├────────────┼──────────────────────────────────┤
                            │select      │ Select pairs according  to  some │
                            │            │ condition.                       │
                            ├────────────┼──────────────────────────────────┤
                            │sort        │ Sort a .pairs/.pairsam file.     │
                            └────────────┴──────────────────────────────────┘

                            │split       │ Split   a   .pairsam  file  into │
                            │            │ .pairs and .sam.                 │
                            ├────────────┼──────────────────────────────────┤
                            │stats       │ Calculate pairs statistics.      │
                            └────────────┴──────────────────────────────────┘

       Contents:

QUICKSTART

       Install pairtools and all of its dependencies using the  conda  package  manager  and  the
       bioconda channel for bioinformatics software.

          $ conda install -c conda-forge -c bioconda pairtools

       Setup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome:

          $ mkdir /tmp/test-pairtools
          $ cd /tmp/test-pairtools
          $ wget https://github.com/mirnylab/distiller-test-data/raw/master/bam/MATalpha_R1.bam

       Additionally, we will need a .chromsizes file, a TAB-separated plain text table describing
       the names, sizes and the order of chromosomes in the genome assembly used during mapping:

          $ wget https://raw.githubusercontent.com/mirnylab/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes

       With pairtools parse, we can convert paired-end sequence alignments  stored  in  .sam/.bam
       format into .pairs, a TAB-separated table of Hi-C ligation junctions:

          $ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam

       Inspect the resulting table:

          $ less MATalpha_R1.pairs.gz

INSTALLATION

   Requirements
       • Python 3.x

       • Python packages numpy and click

       • Command-line utilities sort (the Unix version), bgzip (shipped with tabix) and samtools.
         If available, pairtools can compress outputs with pbgzip and lz4.

   Install using conda
       We highly recommend using the conda package  manager  to  install  pre-compiled  pairtools
       together  with  all  its dependencies. To get it, you can either install the full Anaconda
       Python distribution or just the standalone conda package manager.

       With conda, you can install pre-compiled pairtools and all of its  dependencies  from  the
       bioconda channel:

          $ conda install -c conda-forge -c bioconda pairtools

   Install using pip
       Alternatively,  compile  and install pairtools and its Python dependencies from PyPI using
       pip:

          $ pip install pairtools

   Install the development version
       Finally, you can install the latest development version of pairtools from  github.  First,
       make a local clone of the github repository:

          $ git clone https://github.com/mirnylab/pairtools

       Then,  you  can  compile and install pairtools in the development mode, which installs the
       package without moving it to a system folder and thus allows  immediate  live-testing  any
       changes in the python code. Please, make sure that you have cython installed!

          $ cd pairtools
          $ pip install -e ./

PARSING SEQUENCE ALIGNMENTS INTO HI-C PAIRS

   Overview
       Hi-C  experiments  aim to measure the frequencies of contacts between all pairs of loci in
       the genome. In these experiments, the spacial structure of chromosomes if first fixed with
       formaldehyde  crosslinks,  after  which DNA is partially digested with restriction enzymes
       and then re-ligated back. Then,  DNA  is  shredded  into  smaller  pieces,  released  from
       nucleus,  sequenced and aligned to the reference genome. The resulting sequence alignments
       reveal if DNA molecules were formed through ligations between DNA from different locations
       in  the  genome.   These  ligation events imply that ligated loci were close to each other
       when the ligation enzyme was active, i.e. they formed "a contact".

       pairtools parse detects ligation events in the aligned sequences of DNA  molecules  formed
       in Hi-C experiments and reports them in the .pairs/.pairsam format.

   Terminology
       Throughout  this  document  we  will be using the same visual language to describe how DNA
       sequences (in the .fastq format) are transformed into sequence alignments (.sam/.bam)  and
       into ligation events (.pairs).
         [image:  The  visual  language  to  describe  transformation  of  Hi-C data] [image] DNA
         sequences (reads) are aligned to  the  reference  genome  and  converted  into  ligation
         events.UNINDENT

         Short-read  sequencing  determines  the  sequences  of  the both ends (or, sides) of DNA
         molecules (typically 50-300 bp), producing read pairs in .fastq  format  (shown  in  the
         first  row  on  the figure above).  In such reads, base pairs are reported from the tips
         inwards, which is also defined as the 5'->3' direction  (in  accordance  of  the  5'->3'
         direction of the DNA strand that sequence of the corresponding side of the read).

         Alignment  software  maps  both  reads  of  a  pair  to  the reference genome, producing
         alignments, i.e. segments of the reference genome with matching  sequences.   Typically,
         there  will be only two alignments per read pair, one on each side.  But, sometimes, the
         parts of one or both sides may map to different locations on the genome, producing  more
         than two alignments per DNA molecule (see Multiple ligations (walks)).

         pairtools  parse converts alignments into ligation events (aka Hi-C pairs aka pairs). In
         the simplest case, when each side has only one unique alignment  (i.e.  the  whole  side
         maps to a single unique segment of the genome), for each side, we report the chromosome,
         the genomic position of the outer-most (5') aligned base pair  and  the  strand  of  the
         reference  genome  that  the  read aligns to.  pairtools parse assigns to such pairs the
         type UU (unique-unique).

   Unmapped/multimapped reads
       Sometimes, one side or both sides of a read pair may not align to the reference genome:
         [image: Read pairs missing an alignment on one or both sides] [image] Read pairs missing
         an alignment on one or both sides.UNINDENT

         In  this case, pairtools parse fills in the chromosome of the corresponding side of Hi-C
         pair with !, the position with 0 and the strand with -.  Such pairs are reported as type
         NU (null-unique, when the other side has a unique alignment) or NN (null-null, when both
         sides lack any alignment).

         Similarly, when one or both sides map to many genome locations equally well (i.e.   have
         non-unique,  or,  multi-mapping  alignments),  pairtools parse reports the corresponding
         sides as (chromosome= !, position= 0, strand=  -)  and  type  MU  (multi-unique)  or  MM
         (multi-multi)  or  NM  (null-multi), depending on the type of the alignment on the other
         side.
         [image: Read pairs with a non-unique alignment on one or both sides] [image] Read  pairs
         with a non-unique (multi-) alignment on one side.UNINDENT

         pairtools  parse  calls  an  alignment  to  be  multi-mapping when its MAPQ score (which
         depends on the scoring gap between the two best candidate alignments for a  segment)  is
         equal or greater than the value specified with the --min-mapq flag (by default, 1).

   Multiple ligations (walks)
       Finally, a read pair may contain more than two alignments:
         [image:  A  sequenced  Hi-C  molecule  that was formed via multiple ligations] [image] A
         sequenced Hi-C molecule that was formed via multiple ligations.UNINDENT

         Molecules like these typically form via multiple ligation events and we call them  walks
         [1].  Currently,  pairtools  parse does not process such molecules and tags them as type
         WW.

   Interpreting gaps between alignments
       Reads that are only partially aligned to the genome can be interpreted  in  two  different
       ways.  One  possibility  is  to  assume  that  this  molecule  was formed via at least two
       ligations (i.e. it's a walk) but the  non-aligned  part  (a  gap)  was  missing  from  the
       reference  genome for one reason or another.  Another possibility is to simply ignore this
       gap (for example, because it could be an insertion or a technical artifact), thus assuming
       that our molecule was formed via a single ligation and has to be reported:
         [image:  A  gap  between alignments can be ignored or interpreted as a "null" alignment]
         [image] A gap between alignments can interpreted as  a  legitimate  segment  without  an
         alignment or simply ignored.UNINDENT

         Both  options have their merits, depending on a dataset, quality of the reference genome
         and sequencing. pairtools parse ignores shorter gaps and keeps  longer  ones  as  "null"
         alignments.  The  maximal  size of ignored gaps is set by the --max-inter-align-gap flag
         (by default, 20bp).

   Rescuing single ligations
       Importantly, some of DNA molecules containing only one ligation junction may still end  up
       with three alignments:
         [image: Not all read pairs with three alignments come from "walks"] [image] Not all read
         pairs with three alignments come from "walks".UNINDENT

         A molecule formed via a single ligation gets  three  alignments  when  one  of  the  two
         ligated  DNA  pieces  is  shorter  than  the  read  length,  such  that that read on the
         corresponding side sequences through the ligation junction and into the other piece [2].
         The  amount of such molecules depends on the type of the restriction enzyme, the typical
         size of DNA molecules in the Hi-C library and the read  length,  and  sometimes  can  be
         considerable.

         pairtools parse detects such molecules and rescues them (i.e.  changes their type from a
         walk to a single-ligation molecule). It tests walks with  three  aligments  using  three
         criteria:
         [image:  The  three  criteria  used  for  "rescue"]  [image]  The three criteria used to
         "rescue" three-alignment walks: cis, point towards each other, short distance.UNINDENT

       1. On the side with two alignments (the chimeric side), the  "inner"  (or,  3')  alignment
          must be on the same chromosome as the alignment on the non-chimeric side.

       2. The  "inner"  alignment on the chimeric side and the alignment on the non-chimeric side
          must point toward each other.

       3. These two alignments must be within the distance specified with the --max-molecule-size
          flag (by default, 2000bp).

       Sometimes,  the  "inner"  alignment on the chimeric side can be non-unique or "null" (i.e.
       when  the  unmapped  segment  is  longer  than  --max-inter-align-gap,  as  described   in
       Interpreting  gaps between alignments). pairtools parse ignores such alignments altogether
       and thus rescues such walks as well.
         [image: A walk with three alignments get rescued, when the middle alignment is multi- or
         null]  [image]  A  walk  with three alignments get rescued, when the middle alignment is
         multi- or null..UNINDENT

       [1]  Following the lead of C-walks

       [2]  This procedure was first introduced in HiC-Pro and the in Juicer .

SORTING PAIRS

       In order to enable efficient random access to Hi-C pairs, we flip and sort  pairs.   After
       sorting,  interactions  become arranged in the order of their genomic position, such that,
       for any given pair of regions, we easily find and extract all of their interactions.  And,
       after  flipping,  all  artificially  duplicated molecules (either during PCR or in optical
       sequencing) end up in adjacent rows in sorted lists of  interactions,  such  that  we  can
       easily identify and remove them.

   Sorting
       pairtools  sort arrange pairs in the order of (chrom1, chrom2, pos1, pos2).  This order is
       also known as block sorting, because all pairs  between  any  given  pair  of  chromosomes
       become  grouped  into one continuous block.  Additionally, pairtools sort also sorts pairs
       with identical positions by pair_type. This does not really do much for mapped reads,  but
       it nicely splits unmapped reads into blocks of null-mapped and multi-mapped reads.

       We note that there is an alternative to block sorting, called row sorting, where pairs are
       sorted by (chrom1, pos1, chrom2, pos2).  In pairtools sort, we prefer block-sorting  since
       it  cleanly separates cis interactions from trans ones and thus is a more optimal solution
       for typical use cases.

   Flipping
       In a typical paired-end experiment, side1 and side2 of a DNA molecule are defined  by  the
       order in which they got sequenced.  Since this order is essentially random, any given Hi-C
       pair, e.g.  (chr1, 1.1Mb; chr2, 2.1Mb), may appear in a reversed orientation, i.e.  (chr2,
       2.1Mb; chr1, 1.1Mb). If we were to preserve this order of sides, interactions between same
       loci would appear in two  different  locations  of  the  sorted  pair  list,  which  would
       complicate finding PCR/optical duplicates.

       To  ensure  that  Hi-C  pairs  with similar coordinates end up in the same location of the
       sorted list, we flip pairs, i.e. we choose side1 as  the  side  with  the  lowest  genomic
       coordinate.      Thus,    after    flipping,    for    trans    pairs    (chrom1!=chrom2),
       order(chrom1)<order(chrom2); and for cis pairs (chrom1==chrom2), pos1<=pos2.  In a  matrix
       representation, flipping is equal to reflecting the lower triangle of the Hi-C matrix onto
       its upper triangle, such that the resulting matrix is upper-triangular.

       In pairtools, flipping is done during parsing - that's  why  pairtools  parse  requires  a
       .chromsizes  file  that  specifies  the  order  of chromosomes for flipping.  Importantly,
       pairtools parse also flips one-sided  pairs  such  that  side1  is  always  unmapped;  and
       unmapped   pairs   such   that   side1   always   has   a   "poorer"  mapping  type  (i.e.
       null-mapping<multi-mapping).

   Chromosomal order for sorting and flipping
       Importantly, the  order  of  chromosomes  for  sorting  and  flipping  can  be  different.
       Specifically,  pairtools  sort  uses the lexicographic order for chromosomes (chr1, chr10,
       chr11, ..., chr2, chr21,...) instead of the "natural" order (chr1, chr2,  chr3,  ...);  at
       the  same  time,  flipping  is  done  in  pairsamtools  parse  using the chromosomal order
       specified by the user.

       pairtools sort uses the lexicographic order for sorting chromosomes.  This order  is  used
       universally  to  sorting  strings  in  all languages and tools [1], which makes it easy to
       design tools for indexing and searching in sorted pair lists.

       At the same time, pairtools parse uses a custom user-provided chromosomal  order  to  flip
       pairs.  For performance considerations, for flipping, we recommend ordering chromosomes in
       a way that will be used in plotting contact maps.

       [1]  Unfortunately, many existing genomes use rather unconventional choices in chromosomal
            naming  schemes.  For  example,  in  sacCer3,  chromosomes  are enumerated with Roman
            numerals; in dm3, big autosomes are split 4  different  contigs  each.  Thus,  it  is
            impossible  to  design  a  universal  algorithm  that  would  order  chromosomes in a
            "meaningful" way for all existing genomes.

FORMATS FOR STORING HI-C PAIRS

   .pairs
       .pairs is a simple tabular format for storing DNA contacts detected in a Hi-C  experiment.
       The detailed .pairs specification is defined by the 4DN Consortium.

       The body of a .pairs contains a table with a variable number of fields separated by a "\t"
       character (a horizontal tab). The .pairs specification fixes the content and the order  of
       the first seven columns:

                              ┌──────┬─────────┬──────────────────────────┐
                              │index │ name    │ description              │
                              ├──────┼─────────┼──────────────────────────┤
                              │1     │ read_id │ the  ID  of  the read as │
                              │      │         │ defined in fastq files   │
                              ├──────┼─────────┼──────────────────────────┤
                              │2     │ chrom1  │ the  chromosome  of  the │
                              │      │         │ alignment on side 1      │
                              ├──────┼─────────┼──────────────────────────┤
                              │3     │ pos1    │ the    1-based   genomic │
                              │      │         │ position     of      the │
                              │      │         │ outer-most  (5')  mapped │
                              │      │         │ bp on side 1             │
                              ├──────┼─────────┼──────────────────────────┤
                              │4     │ chrom2  │ the  chromosome  of  the │
                              │      │         │ alignment on side 2      │
                              ├──────┼─────────┼──────────────────────────┤
                              │5     │ pos2    │ the    1-based   genomic │
                              │      │         │ position     of      the │
                              │      │         │ outer-most  (5')  mapped │
                              │      │         │ bp on side 2             │
                              ├──────┼─────────┼──────────────────────────┤
                              │6     │ strand1 │ the   strand   of    the │
                              │      │         │ alignment on side 1      │
                              ├──────┼─────────┼──────────────────────────┤
                              │7     │ strand2 │ the    strand   of   the │
                              │      │         │ alignment on side 2      │
                              └──────┴─────────┴──────────────────────────┘

       A .pairs file starts with a header, an arbitrary number  of  lines  starting  with  a  "#"
       character.  By  convention,  the header lines have a format of "#field_name: field_value".
       The .pairs specification mandates  a  few  standard  header  lines  (e.g.,  column  names,
       chromosome  order,  sorting  order,  etc),  all  of  which  are automatically filled in by
       pairtools.

       The entries of a .pairs file can be flipped and sorted. "Flipping" means that the sides  1
       and  2 do not correspond to side1 and side2 in sequencing data.  Instead, side1 is defined
       as the side with the alignment with a lower sorting index (using the lexographic order for
       chromosome  names, followed by the numeric order for positions and the lexicographic order
       for pair types). This particular order  of  "flipping"  is  defined  as  "upper-triangular
       flipping",  or  "triu-flipping".  Finally,  pairs  are  typically block-sorted: i.e. first
       lexicographically by chrom1 and chrom2, then numerically by pos1 and pos2.

   Pairtools' flavor of .pairs
       .pairs files produced by pairtools extend .pairs format in a few ways.

       1. pairtools store null/ambiguous/chimeric alignments as chrom='!', pos=0, strand='-'.

       2. pairtools store the header of the source .sam files in the '#samheader:' fields of  the
          pairs  header.  When  multiple  .pairs  files  are merged, the respective '#samheader:'
          fields are checked for consistency and merged.

       3. Each pairtool applied to .pairs leaves a record in the '#samheader' fields (using a @PG
          sam tag), thus preserving the full history of data processing.

       4. pairtools append an extra column describing the type of a Hi-C pair:

                             ┌──────┬───────────┬─────────────────────────┐
                             │index │ name      │ description             │
                             ├──────┼───────────┼─────────────────────────┤
                             │8     │ pair_type │ the type of a Hi-C pair │
                             └──────┴───────────┴─────────────────────────┘

   Pair types
       pairtools use a simple two-character notation to define all possible pair types, according
       to the quality of alignment of  the  two  sides.  The  type  of  a  pair  can  be  defined
       unambiguously  using  the  table  below.  To  use  this  table, identify which side has an
       alignment of a "poorer" quality (unmapped < multimapped < unique alignment) and which side
       has a "better" alignment and find the corresponding row in the table.

     ┌───────────┬─────────────┬─────────────┬────────┬────────┬────────────────┬──────┬───────────┐
     │.          │ Less        │ More        │ .      │ .      │ .              │      │           │
     │           │ informative │ informative │        │        │                │      │           │
     │           │ alignment   │ alignment   │        │        │                │      │           │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │>2         │ Mapped      │ Unique      │ Mapped │ Unique │ Pair type      │ Code │ Sidedness │
     │alignments │             │             │        │        │                │      │           │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │✔          │ ❌          │ ❌          │ ❌     │ ❌     │ walk-walk      │ WW   │ 0 [1]     │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │❌         │ ❌          │             │ ❌     │        │ null           │ NN   │ 0         │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │❌         │ ❌          │             │ ❌     │        │ corrupt        │ XX   │ 0         │
     │           │             │             │        │        │                │      │ [2]_      │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │❌         │ ❌          │             │ ✔      │ ❌     │ null-multi     │ NM   │ 0         │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │✔          │ ❌          │             │ ✔      │ ✔      │ null-rescued   │ NR   │ 1 [3]     │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │❌         │ ❌          │             │ ✔      │ ✔      │ null-unique    │ NU   │ 1         │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │❌         │ ✔           │ ❌          │ ✔      │ ❌     │ multi-multi    │ MM   │ 0         │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │✔          │ ✔           │ ❌          │ ✔      │ ✔      │ multi-rescued  │ MR   │ 1 [3]     │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │❌         │ ✔           │ ❌          │ ✔      │ ✔      │ multi-unique   │ MU   │ 1         │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │✔          │ ✔           │ ✔           │ ✔      │ ✔      │ rescued-unique │ RU   │ 2 [3]     │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │✔          │ ✔           │ ✔           │ ✔      │ ✔      │ unique-rescued │ UR   │ 2 [3]     │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │❌         │ ✔           │ ✔           │ ✔      │ ✔      │ unique-unique  │ UU   │ 2         │
     ├───────────┼─────────────┼─────────────┼────────┼────────┼────────────────┼──────┼───────────┤
     │❌         │ ✔           │ ✔           │ ✔      │ ✔      │ duplicate      │ DD   │ 2         │
     │           │             │             │        │        │                │      │ [4]_      │
     └───────────┴─────────────┴─────────────┴────────┴────────┴────────────────┴──────┴───────────┘

       [1]  "walks",  or,  C-walks  are  Hi-C molecules formed via multiple ligation events which
            cannot be reported as a single pair.

       [2]  "corrupt" pairs are those with technical issues - e.g. missing a  FASTQ  sequence/SAM
            entry from one side of the molecule.

       [2]  "rescued"  pairs  have  two  non-overlapping alignments on one of the sides (referred
            below as the chimeric side/read), but the inner (3'-) one extends the only  alignment
            on the other side (referred as the non-chimeric side/read).  Such pairs form when one
            of the two ligated DNA fragments is shorter than the read length. In this  case,  one
            of  the  reads  contains  this  short  fragment  entirely, together with the ligation
            junction and a chunk of the other DNA fragment (thus, this read ends  up  having  two
            non-overlapping  alignments).   Following  the  procedure  introduced  in HiC-Pro and
            Juicer, pairtools parse rescues such Hi-C molecules, reports the position of  the  5'
            alignment  on the chimeric side, and tags them as "NU", "MU", "UR" or "RU" pair type,
            depending on the type of the 5' alignment on the chimeric side.  Such  molecules  can
            and  should be used in downstream analysis.  Read more on the rescue procedure in the
            section on parsing.

       [3]  pairtools dedup detects molecules that could be formed via PCR duplication  and  tags
            them as "DD" pair type. These pairs should be excluded from downstream analyses.

   .pairsam
       pairtools  also  define  .pairsam,  a valid extension of the .pairs format.  On top of the
       pairtools' flavor of .pairs,  .pairsam  format  adds  two  extra  columns  containing  the
       alignments from which the Hi-C pair was extracted:

                               ┌──────┬──────┬──────────────────────────┐
                               │index │ name │ description              │
                               ├──────┼──────┼──────────────────────────┤
                               │9     │ sam1 │ the  sam alignment(s) on │
                               │      │      │ side     1;     separate │
                               │      │      │ supplemental  alignments │
                               │      │      │ by NEXT_SAM              │
                               ├──────┼──────┼──────────────────────────┤
                               │10    │ sam2 │ the sam alignment(s)  on │
                               │      │      │ side     2;     separate │
                               │      │      │ supplemental  alignments │
                               │      │      │ by NEXT_SAM              │
                               └──────┴──────┴──────────────────────────┘

       Note  that,  normally,  the  fields  of  a sam alignment are separated by a horizontal tab
       character (\t), which we already use to separate .pairs columns. To  avoid  confusion,  we
       replace  the  tab  character  in  sam  entries stored in sam1 and sam2 columns with a UNIT
       SEPARATOR character (\031).

       Finally, sam1 and  sam2  can  store  multiple  .sam  alignments,  separated  by  a  string
       '\031NEXT_SAM\031'

TECHNICAL NOTES

       Designing  scientific  software  and  formats  requires  making a multitude of tantalizing
       technical decisions and compromises. Often, the reasons  behind  a  certain  decision  are
       non-trivial  and  convoluted,  involving  many  factors.   Here,  we collect the notes and
       observations made during the desing stage of pairtools and  provide  a  justification  for
       most  non-trivial  decisions.   We  hope  that  this document will elucidate the design of
       pairtools and may prove useful to developers in their future projects.

   .pairs format
       The motivation behind some  of  the  technical  decisions  in  the  pairtools'  flavor  of
       .pairs/.pairsam:

       • pairtools  can  store  SAM  entries  together with the Hi-C pair information in .pairsam
         files. Storing pairs and alignments in the same row enables easy tagging  and  filtering
         of paired-end alignments based on their Hi-C information.

       • pairtools  use  the  exclamation  mark  "!"  instead of '.' as 'chrom' of unmapped reads
         because it has the lowest lexicographic sorting order among all characters. The  use  of
         '0'  and  '-'  in  the 'pos' and 'strand' fields of unmapped reads allows us to keep the
         types of these fields as 'unsigned int' and enum{'+','-'}, respectively.

       • "rescued" pairs have two types "UR" and "RU" instead of just "RU". We chose this  design
         because  rescued pairs are two-sided and thus are flipped based on (chrom, pos), and not
         based on the side types. With two pair types "RU" and "UR", pairtools can keep track  of
         which side of the pair was rescued.

       • in  "rescued"  pairs,  the type "R" is assigned to the non-chimeric side.  This may seem
         counter-intuitive at first, since it is the chimeric side that gets  rescued,  but  this
         way  pairtools  can keep track of the type of the 5' alignment on the chimeric side (the
         alignment on the non-chimeric side has to be unique for the pair to be rescued).

       • pairtools rely on a text format, .pairs, instead of hdf5/parquet-based tables or  custom
         binaries. We went with a text format for a few reasons:

         • text  tables  enable  easy  access  to  data  from any language and any tool.  This is
           especially important at the level of Hi-C pairs, the "rawest"  format  of  information
           from a Hi-C experiment.

         • hdf5 and parquet have a few shortcomings that hinder their immediate use in pairtools.
           Specifically, hdf5 cannot  compress  variable-length  strings  (which  are,  in  turn,
           required  to  store sam alignments and some optional information on pairs) and parquet
           cannot append columns to existing files, modify datasets in place  or  store  multiple
           tables  in  one  file  (which  is required to keep table indices in the same file with
           pairs).

         • text tables have a set of well-developed and highly-optimized tools for sorting  (Unix
           sort), compression (bgzip/lz4) and random access (tabix).

         • text formats enable easy streaming between individual command-line tools.

         Having said that, text formats have many downsides - they are bulky when not compressed,
         compression and parsing requires extra computational resources, they cannot be  modified
         in  place  and  random  access requires extra tools. In the future, we plan to develop a
         binary format based on existing container formats, which would mitigate these downsides.

   CLI
       • many pairtools perform multiple actions at once, which contradicts the  "do  one  thing"
         philosophy of Unix command line. We packed multiple (albeit, related) functions into one
         tool to improve the performance of pairtools.  Specifically, given  the  large  size  of
         Hi-C  data,  a  significant  fraction  of  time  is  spent on compression/decompression,
         parsing, loading data into memory and sending  it  over  network  (for  cloud/clusters).
         Packing  multiple  functions  into  one tool cuts down the amount of such time consuming
         operations.

       • pairtools parse requires a .chromsizes file to know the order of chromosomes and perform
         pair flipping.

       • pairtools  use  bgzip  compression  by default instead of gzip. Using bgzip allows us to
         create an index with pairix and get random access to data.

       • paritools have an option to compress outputs with lz4.  Lz4  is  much  faster  and  only
         slighly  less  efficient  than  gzip.   This  makes lz4 a better choice for passing data
         between individual pairtools before producing final result  (which,  in  turn,  requires
         bgzip compression).

COMMAND-LINE API

Index

AUTHOR

       Mirny Lab

       2017-2022, Mirny Lab