bionic (1) pbbarcode.1.gz

Provided by: pbbarcode_0.8.0-4ubuntu1_amd64 bug

NAME

       pbbarcode - annotate PacBio sequencing reads with barcode information

DESCRIPTION

       The  pbbarcode  package  provides  utilities  for annotating individual ZMWs directly from a bas.h5 file,
       emitting fast[a|q] files for each barcode, labeling alignments stored  in  a  cmp.h5  file,  and  calling
       consensus on small amplicons (requires pbdagcon(1))

       At  the  moment,  Barcodes  can  be  scored  in  two different ways: symmetric and paired. Symmetric mode
       supports barcode designs with two identical barcodes on both sides of a SMRTbell, e.g., for barcodes  (A,
       B), molecules are labeled as A--A or B--B. The paired mode supports designs with two distinct barcodes on
       each side of the molecule, but neither barcode appears without its mate. The  minimum  example  is  given
       with  the  following  barcodes:  (ALeft,  ARight,  BLeft,  BRight),  where the following barcode sets are
       checked: ALeft--ARight, BLeft--BRight.

       It is important to highlight that a barcode  FASTA  file  specifies  a  list  of  available  barcodes  to
       evaluate.  Depending  on  the  scoring  mode,  the  barcodes  are grouped together in different ways. For
       instance, in the symmetric case, the number of  possible  barcode  outcomes  are  simply  the  number  of
       barcodes that are supplied to the routine in the FASTA file (see below for usage) plus an additional NULL
       barcode indicating that no barcode could be evaluated (denoted by: '--'). Labels  like  this  (A--A)  are
       used  in  the  final  outputs.  In  the paired mode, the number of possible barcode outcomes are half the
       number of the sequences in the FASTA file plus the NULL barcode.  The  NULL  barcode  indicates  that  no
       attempt  was  made  to  score the molecule or it was filtered out by the user's criteria. The majority of
       cases when a molecule is not scored are related to not observing any adapters. If a user has  executed  a
       "hot-start"  run,  the  user can try the '--scoreFirst' parameter to attempt to label the first adapter's
       barcode. This increases the yield of the labeleing procedure  at  the  expense  of  some  probably  false
       positives.

       The software is implemented as a standard python package. Barcodes are labeled according to the following
       high-level logic. For each molecule, all adapters are found. For each adapter, we align  (using  standard
       Smith-Watterman  alignment)  each barcode and its reverse complement to flanking sequence of the adapter.
       If two complete flanking sequences are available, we divide by 2, else 1 if only  one  flanking  sequence
       was  available (average score at adapter). This allows the scores across adapters to be on the same scale
       (chimera detection). Depending on the mode, we then determine which barcode(s) are maximally scoring.  We
       store  the  two  maximally  scoring  barcodes, the sum of their alignment scores across the adapters. The
       average barcode score then can be given approximately by: total-score/number-of-adapters. At the  moment,
       the alignment parameters are fixed at:

                                                  ┌──────────┬───────┐
                                                  │type      │ score │
                                                  ├──────────┼───────┤
                                                  │insertion │ -1    │
                                                  ├──────────┼───────┤
                                                  │deletion  │ -1    │
                                                  ├──────────┼───────┤
                                                  │mismatch  │ -2    │
                                                  ├──────────┼───────┤
                                                  │match     │ 2     │
                                                  └──────────┴───────┘

   Input and output
   labelZmws
          usage: pbbarcode labelZmws [-h] [--outDir OUTDIR] [--outFofn OUTFOFN]
                 [--adapterSidePad      ADAPTERSIDEPAD]     [--insertSidePad     INSERTSIDEPAD]     [--scoreMode
                 {symmetric,paired}]    [--maxAdapters    MAXADAPTERS]     [--scoreFirst]     [--startTimeCutoff
                 STARTTIMECUTOFF]   [--nZmws   NZMWS]   [--nProcs   NPROCS]  [--saveExtendedInfo]  barcode.fasta
                 input.fofn

          Creates a barcode.h5 file from base h5 files.

          positional arguments:
                 barcode.fasta         Input barcode fasta file input.fofn            Input base fofn

          optional arguments:

                 -h, --help
                        show this help message and exit

                 --outDir OUTDIR
                        Where    to    write    the    newly    created     barcode.h5     files.      (default:
                        /home/UNIXHOME/jbullard/projects/software/bioinformatics/tools/pbbarcode/doc)

                 --outFofn OUTFOFN
                        Write to outFofn (default: barcode.fofn)

                 --adapterSidePad ADAPTERSIDEPAD
                        Pad with adapterSidePad bases (default: 4)

                 --insertSidePad INSERTSIDEPAD
                        Pad with insertSidePad bases (default: 4)

                 --scoreMode {symmetric,paired}
                        The mode in which the barcodes should be scored.  (default: symmetric)

                 --maxAdapters MAXADAPTERS
                        Only score the first maxAdapters (default: 20)

                 --scoreFirst
                        Whether to try to score the leftmost barcode in a trace. (default: False)

                 --startTimeCutoff STARTTIMECUTOFF
                        Reads  must  start  before  this  value  in order to be included when scoreFirst is set.
                        (default: 10.0)

                 --nZmws NZMWS
                        Use the first n ZMWs for testing (default: -1)

                 --nProcs NPROCS
                        How many processes to use (default: 8)

                 --saveExtendedInfo
                        Whether to save extended information tothe barcode.h5 files; this information is  useful
                        for debugging and chimera detection (default: False)

       The labelZmws command takes an input.fofn representing a set of bas.h5 files to operate on. Additionally,
       it takes a barcode.fasta file. Depending on scoreMode, the FASTA file  will  be  processed  in  different
       ways. Specifically, in paired mode, each two consecutive barcodes in the file are considered a set.

       The  parameters,  adapterSidePad and insertSidePad represents how many bases should be considered on each
       side  of  the  putative  barcode.  These  parameters  are  constrained  such  that:  |adapterSidePad|   +
       |insertSidePad| + |barcode| < 65.

       Users  have  the option to specify a different output location for the various outputs. Specifically, for
       each bas.h5 file in input.fofn, a bc.h5 (barcode hdf5) file is generated. These files are listed  in  the
       file outFofn which is typically just called barcode.fofn. See below for a description of the barcode hdf5
       file.

   labelAlignments
          usage: pbbarcode labelAlignments [-h]
                 [--minAvgBarcodeScore MINAVGBARCODESCORE]  [--minNumBarcodes  MINNUMBARCODES]  [--minScoreRatio
                 MINSCORERATIO] barcode.fofn aligned_reads.cmp.h5

          Adds information about barcode alignments to a cmp.h5 file from a previous call to "labelZmws".

          positional arguments:
                 barcode.fofn           input barcode fofn file aligned_reads.cmp.h5  cmp.h5 file to add barcode
                 labels

          optional arguments:

                 -h, --help
                        show this help message and exit

                 --minAvgBarcodeScore MINAVGBARCODESCORE
                        ZMW Filter: exclude ZMW if average barcode score is less than this value (default: 0.0)

                 --minNumBarcodes MINNUMBARCODES
                        ZMW Filter: exclude ZMW if number of barcodes observed is less than this value (default:
                        1)

                 --minScoreRatio MINSCORERATIO
                        ZMW  Filter:  exclude  ZMWs  whose best score divided by the 2nd best score is less than
                        this ratio (default: 1.0)

       The labelAlignments command takes as input a barcode.fofn computed from a call to labelZMWs and a  cmp.h5
       file  where  the  barcode  information  is  written  to.  See  below for a description of the cmp.h5 file
       additions.

   emitFastqs
          usage: pbbarcode emitFastqs [-h] [--outDir output.dir] [--subreads]
                 [--unlabeledZmws]   [--trim   TRIM]   [--fasta]    [--minMaxInsertLength    MINMAXINSERTLENGTH]
                 [--hqStartTime     HQSTARTTIME]     [--minReadScore     MINREADSCORE]     [--minAvgBarcodeScore
                 MINAVGBARCODESCORE]   [--minNumBarcodes   MINNUMBARCODES]    [--minScoreRatio    MINSCORERATIO]
                 input.fofn barcode.fofn

          Takes a bas.h5 fofn and a barcode.h5 fofn and produces a fast[a|q] file for each barcode.

          positional arguments:
                 input.fofn             input  base or CCS fofn file barcode.fofn          input barcode.h5 fofn
                 file

          optional arguments:

                 -h, --help
                        show this help message and exit

                 --outDir output.dir output directory to write fastq files (default: /home/
                        UNIXHOME/jbullard/projects/software/bioinformatics/too ls/pbbarcode/doc)

                 --subreads
                        whether to produce fastq files for the subreads;the default is to  use  the  CCS  reads.
                        This  option  onlyapplies when input.fofn has both consensus and raw reads,otherwise the
                        read type from input.fofn will be returned. (default: False)

                 --unlabeledZmws
                        whether to emit a fastq file for the unlabeled  ZMWs.   These  are  the  ZMWs  where  no
                        adapters are found typically (default: False)

                 --trim TRIM
                        trim off barcodes and any excess constant sequence (default: 20)

                 --fasta
                        whether the files produced should be FASTA files asopposed to FASTQ (default: False)

                 --minMaxInsertLength MINMAXINSERTLENGTH
                        ZMW Filter: exclude ZMW if the longest subreadis less than this amount (default: 0)

                 --hqStartTime HQSTARTTIME
                        ZMW  Filter:  exclude  ZMW  if  start time of HQ regiongreater than this value (seconds)
                        (default: inf)

                 --minReadScore MINREADSCORE
                        ZMW Filter: exclude ZMW if readScore is less thanthis value (default: 0)

                 --minAvgBarcodeScore MINAVGBARCODESCORE
                        ZMW Filter: exclude ZMW if average barcode score is less than this value (default: 0.0)

                 --minNumBarcodes MINNUMBARCODES
                        ZMW Filter: exclude ZMW if number of barcodes observed is less than this value (default:
                        1)

                 --minScoreRatio MINSCORERATIO
                        ZMW  Filter:  exclude  ZMWs  whose best score divided by the 2nd best score is less than
                        this ratio (default: 1.0)

       The emitFastqs command takes as input both an input.fofn for the bas.h5 files as well as  a  barcode.fofn
       from  a  call  to  labelZmws. The optional parameter outDir dictates where the files will be written. For
       each detected barcode, a fast[a|q] file will be emitted with all of the reads for that barcode. The  trim
       parameter  dictates  how  much  of  the read should be trimmed off. The default parameter for trim is the
       length of the barcode (which is stored in the barcode hdf5 files). At the moment,  all  barcodes  in  the
       barcode  FASTA  file  must  be  the  same  length,  therefore only a constant trim value is supported. In
       practice, one can aggressively trim in order to ensure that extra bases aren't left on the ends of reads.
       Finally,  the  subreads  parameter  dictates  whether  subreads  or CCS reads should be returned with the
       default being the appropriate reads according to the input  file  type,  either  CCS  or  subreads.  This
       parameter  is  only  inspected  if  the  input.fofn contains both CCS and subread data, if the input.fofn
       contains only subread or CCS data then that is returned irrespective of the state  of  the  the  subreads
       parameter and a warning is issued.

   consensus
          usage: pbbarcode consensus [-h] [--subsample SUBSAMPLE] [--nZmws NZMWS]
                 [--outDir   OUTDIR]   [--keepTmpDir]   [--ccsFofn   CCSFOFN]   [--nProcs  NPROCS]  [--noQuiver]
                 [--minMaxInsertLength   MINMAXINSERTLENGTH]   [--hqStartTime    HQSTARTTIME]    [--minReadScore
                 MINREADSCORE]   [--minAvgBarcodeScore   MINAVGBARCODESCORE]  [--minNumBarcodes  MINNUMBARCODES]
                 [--minScoreRatio MINSCORERATIO] [--barcode BARCODE [BARCODE ...]]  input.fofn barcode.fofn

          Compute consensus sequences for each barcode.

          positional arguments:
                 input.fofn            input bas.h5 fofn file barcode.fofn          input bc.h5 fofn file

          optional arguments:

                 -h, --help
                        show this help message and exit

                 --subsample SUBSAMPLE
                        Subsample ZMWs (default: 1)

                 --nZmws NZMWS
                        Take n ZMWs (default: -1)

                 --outDir OUTDIR
                        Use this directory to output results (default: .)

                 --keepTmpDir --ccsFofn CCSFOFN     Obtain CCS data from ccsFofn instead of input.fofn
                     (default: )

                 --nProcs NPROCS
                        Use nProcs to execute. (default: 16)

                 --noQuiver --minMaxInsertLength MINMAXINSERTLENGTH
                     ZMW Filter: exclude ZMW if the longest subreadis less than this amount (default: 0)

                 --hqStartTime HQSTARTTIME
                        ZMW Filter: exclude ZMW if start time of HQ  regiongreater  than  this  value  (seconds)
                        (default: inf)

                 --minReadScore MINREADSCORE
                        ZMW Filter: exclude ZMW if readScore is less thanthis value (default: 0)

                 --minAvgBarcodeScore MINAVGBARCODESCORE
                        ZMW Filter: exclude ZMW if average barcode score is less than this value (default: 0.0)

                 --minNumBarcodes MINNUMBARCODES
                        ZMW Filter: exclude ZMW if number of barcodes observed is less than this value (default:
                        1)

                 --minScoreRatio MINSCORERATIO
                        ZMW Filter: exclude ZMWs whose best score divided by the 2nd best  score  is  less  than
                        this ratio (default: 1.0)

                 --barcode BARCODE [BARCODE ...]
                        Use this to extract consensus for just one barcode.  (default: None)

       The  emitFastqs  command takes as input both an input.fofn for the bas.h5 files as well as a barcode.fofn
       from a call to labelZmws. The results are a FASTA file with an entry  for  each  barcode  containing  the
       consensus amplicon sequence. This mode utilizes Quiver and pbdagcon to compute consensus.

       In  cases  where  the  amplicon  is fewer than 2.5k bases, using CCS data is quite helpful. The --ccsFofn
       allows one to pass directly the ccs files. In many cases, both the CCS and raw basecalls are in the  same
       file so you can check by passing the same parameter to input.fofn as to ccsFofn.

   Dependencies
       The      pbbarcode      package      depends     on     a     standard     pbcore     installation     (‐
       https://github.com/PacificBiosciences/pbcore). If one wishes to use the consensus tool, pbdagcon needs to
       be installed (https://github.com/PacificBiosciences/pbdagcon).

   Barcode HDF5 File
       The  barcode hdf5 file, bc.h5, represents a simple data store for barcode calls and their scores for each
       ZMW. Generally, a user need not interact with barcode hdf5 files, but  can  use  the  results  stored  in
       either  the  resulting  cmp.h5  file  or  fast[a|q]  files.  The barcode hdf5 file contains the following
       structure:

       /BarcodeCalls/best - (nZMWs, 6)[32-bit integer] dataset with the following columns:
          holeNumber,nAdapters,barcodeIdx1,barcodeScore1,barcodeIdx2,barcodeScore2

       Additionally, the best dataset has the following attributes:

                    ┌────────────┬─────────────────────────────────────────────────────────────────┐
                    │movieName   │ m120408_042614_richard_c100309392550000001523011508061222_s1_p0 │
                    ├────────────┼─────────────────────────────────────────────────────────────────┤
                    │columnNames │ holeNumber,nAdapters,barcodeIdx1,barcodeScore1,barcodeIdx2,     │
                    │            │ barcodeScore2                                                   │
                    ├────────────┼─────────────────────────────────────────────────────────────────┤
                    │scoreMode   │ [symmetric|paired]                                              │
                    ├────────────┼─────────────────────────────────────────────────────────────────┤
                    │barcodes    │ 'bc_1', 'bc_2', ...., 'bc_N'                                    │
                    └────────────┴─────────────────────────────────────────────────────────────────┘

       The two barcodeIdx1 and barcodeIdx2 columns are indices into barcodes attribute. The scoreMode is scoring
       mode used to align the barcodes. The barcodes attribute correspond to the barcode.fasta sequence names.

       Additionally, in some circumstances, it is useful to retain the entire history of the scoring, i.e., each
       barcode scored to each adapter across all ZMWs. In oder to retain this information, one must call:
          pbbarcode labelZmws --saveExtendedInfo ...

       In this mode, the resultant HDF5 file will have an  additional  dataset  under  the  BarcodeCalls  group,
       named: all. This dataset has the following format:

       /BarcodeCalls/all - (nbarcodes * nadapters[zmw_i], 4) forall i in 1 ... nZMWs
          `holeNumber, adapterIdx, barcodeIdx, score`

       The  adapterIdx  is  the index of the adapter along the molecule, i.e., adapterIdx 1 is the first adapter
       scored.

   Additions to the compare HDF5 (cmp.h5) File
       In addition to the barcode hdf5 file, a call  to  labelAlignments  will  annotate  a  cmp.h5  file.  This
       annotation is stored in ways consistent with the cmp.h5 file format. Specifically, a new group:
       /BarcodeInfo/
         ID   (nBarcodeLabels + 1, 1)[32-bit integer]
         Name (nBarcodeLabels + 1, 1)[variable length string]

       In  addition  to the /BarcodeInfo/ group, the key dataset which assigns alignments to barcodes is located
       at:

       /AlnInfo/Barcode (nAlignments, 3)[32-bit integer] with the following columns:
          index,count,bestIndex,bestScore,secondBestIndex,secondBestScore

       Here index refers to the index into the Name vector, score corresponds to the sum of the scores  for  the
       barcodes, and finally, count refers to the number of adapters found in the molecule.

                                                  December 2015                                     PBBARCODE(1)