Provided by: pbbarcode_0.8.0-4ubuntu1_amd64 bug

NAME

       pbbarcode - annotate PacBio sequencing reads with barcode information

DESCRIPTION

       The  pbbarcode  package  provides utilities for annotating individual ZMWs directly from a
       bas.h5 file, emitting fast[a|q] files for each barcode, labeling alignments  stored  in  a
       cmp.h5 file, and calling consensus on small amplicons (requires pbdagcon(1))

       At  the  moment,  Barcodes  can  be  scored  in  two different ways: symmetric and paired.
       Symmetric mode supports barcode designs with two identical barcodes on  both  sides  of  a
       SMRTbell,  e.g.,  for  barcodes  (A, B), molecules are labeled as A--A or B--B. The paired
       mode supports designs with two distinct barcodes on each side of the molecule, but neither
       barcode  appears  without  its  mate.  The  minimum  example  is  given with the following
       barcodes: (ALeft, ARight, BLeft, BRight), where the following barcode  sets  are  checked:
       ALeft--ARight, BLeft--BRight.

       It  is  important  to  highlight  that  a barcode FASTA file specifies a list of available
       barcodes to evaluate. Depending on the scoring mode, the barcodes are grouped together  in
       different  ways.  For  instance,  in  the  symmetric  case, the number of possible barcode
       outcomes are simply the number of barcodes that are supplied to the routine in  the  FASTA
       file  (see  below  for  usage)  plus an additional NULL barcode indicating that no barcode
       could be evaluated (denoted by: '--'). Labels like this  (A--A)  are  used  in  the  final
       outputs.  In  the paired mode, the number of possible barcode outcomes are half the number
       of the sequences in the FASTA file plus the NULL barcode. The NULL barcode indicates  that
       no  attempt  was made to score the molecule or it was filtered out by the user's criteria.
       The majority of cases when a molecule is not scored  are  related  to  not  observing  any
       adapters.  If  a  user has executed a "hot-start" run, the user can try the '--scoreFirst'
       parameter to attempt to label the first adapter's barcode. This increases the yield of the
       labeleing procedure at the expense of some probably false positives.

       The  software  is implemented as a standard python package. Barcodes are labeled according
       to the following high-level logic. For each molecule, all adapters  are  found.  For  each
       adapter,  we align (using standard Smith-Watterman alignment) each barcode and its reverse
       complement to flanking sequence of the adapter. If two  complete  flanking  sequences  are
       available,  we  divide  by  2, else 1 if only one flanking sequence was available (average
       score at adapter). This allows the scores across adapters to be on the same scale (chimera
       detection).  Depending  on  the  mode,  we  then  determine which barcode(s) are maximally
       scoring. We store the two maximally scoring barcodes, the sum of  their  alignment  scores
       across  the  adapters.  The  average  barcode  score  then  can be given approximately by:
       total-score/number-of-adapters. At the moment, the alignment parameters are fixed at:

                                          ┌──────────┬───────┐
                                          │type      │ score │
                                          ├──────────┼───────┤
                                          │insertion │ -1    │
                                          ├──────────┼───────┤
                                          │deletion  │ -1    │
                                          ├──────────┼───────┤
                                          │mismatch  │ -2    │
                                          ├──────────┼───────┤
                                          │match     │ 2     │
                                          └──────────┴───────┘

   Input and output
   labelZmws
          usage: pbbarcode labelZmws [-h] [--outDir OUTDIR] [--outFofn OUTFOFN]
                 [--adapterSidePad ADAPTERSIDEPAD] [--insertSidePad  INSERTSIDEPAD]  [--scoreMode
                 {symmetric,paired}]       [--maxAdapters       MAXADAPTERS]       [--scoreFirst]
                 [--startTimeCutoff   STARTTIMECUTOFF]   [--nZmws   NZMWS]   [--nProcs    NPROCS]
                 [--saveExtendedInfo] barcode.fasta input.fofn

          Creates a barcode.h5 file from base h5 files.

          positional arguments:
                 barcode.fasta          Input barcode fasta file input.fofn            Input base
                 fofn

          optional arguments:

                 -h, --help
                        show this help message and exit

                 --outDir OUTDIR
                        Where  to  write  the  newly   created   barcode.h5   files.    (default:
                        /home/UNIXHOME/jbullard/projects/software/bioinformatics/tools/pbbarcode/doc)

                 --outFofn OUTFOFN
                        Write to outFofn (default: barcode.fofn)

                 --adapterSidePad ADAPTERSIDEPAD
                        Pad with adapterSidePad bases (default: 4)

                 --insertSidePad INSERTSIDEPAD
                        Pad with insertSidePad bases (default: 4)

                 --scoreMode {symmetric,paired}
                        The mode in which the barcodes should be scored.  (default: symmetric)

                 --maxAdapters MAXADAPTERS
                        Only score the first maxAdapters (default: 20)

                 --scoreFirst
                        Whether to try to score the leftmost barcode in a trace. (default: False)

                 --startTimeCutoff STARTTIMECUTOFF
                        Reads must  start  before  this  value  in  order  to  be  included  when
                        scoreFirst is set. (default: 10.0)

                 --nZmws NZMWS
                        Use the first n ZMWs for testing (default: -1)

                 --nProcs NPROCS
                        How many processes to use (default: 8)

                 --saveExtendedInfo
                        Whether  to  save  extended  information  tothe  barcode.h5  files;  this
                        information is useful  for  debugging  and  chimera  detection  (default:
                        False)

       The  labelZmws  command  takes an input.fofn representing a set of bas.h5 files to operate
       on. Additionally, it takes a barcode.fasta file. Depending on scoreMode,  the  FASTA  file
       will  be  processed  in different ways. Specifically, in paired mode, each two consecutive
       barcodes in the file are considered a set.

       The parameters, adapterSidePad and insertSidePad  represents  how  many  bases  should  be
       considered  on  each  side  of the putative barcode. These parameters are constrained such
       that: |adapterSidePad| + |insertSidePad| + |barcode| < 65.

       Users have the option to specify a different output  location  for  the  various  outputs.
       Specifically,  for  each  bas.h5  file  in  input.fofn,  a  bc.h5  (barcode  hdf5) file is
       generated. These files are listed in the file  outFofn  which  is  typically  just  called
       barcode.fofn. See below for a description of the barcode hdf5 file.

   labelAlignments
          usage: pbbarcode labelAlignments [-h]
                 [--minAvgBarcodeScore   MINAVGBARCODESCORE]   [--minNumBarcodes  MINNUMBARCODES]
                 [--minScoreRatio MINSCORERATIO] barcode.fofn aligned_reads.cmp.h5

          Adds information about barcode alignments to a cmp.h5 file  from  a  previous  call  to
          "labelZmws".

          positional arguments:
                 barcode.fofn           input barcode fofn file aligned_reads.cmp.h5  cmp.h5 file
                 to add barcode labels

          optional arguments:

                 -h, --help
                        show this help message and exit

                 --minAvgBarcodeScore MINAVGBARCODESCORE
                        ZMW Filter: exclude ZMW if average barcode score is less than this  value
                        (default: 0.0)

                 --minNumBarcodes MINNUMBARCODES
                        ZMW  Filter: exclude ZMW if number of barcodes observed is less than this
                        value (default: 1)

                 --minScoreRatio MINSCORERATIO
                        ZMW Filter: exclude ZMWs whose best score divided by the 2nd  best  score
                        is less than this ratio (default: 1.0)

       The  labelAlignments  command  takes  as  input  a  barcode.fofn  computed  from a call to
       labelZMWs and a cmp.h5 file where the barcode information is written to. See below  for  a
       description of the cmp.h5 file additions.

   emitFastqs
          usage: pbbarcode emitFastqs [-h] [--outDir output.dir] [--subreads]
                 [--unlabeledZmws]      [--trim     TRIM]     [--fasta]     [--minMaxInsertLength
                 MINMAXINSERTLENGTH] [--hqStartTime  HQSTARTTIME]  [--minReadScore  MINREADSCORE]
                 [--minAvgBarcodeScore   MINAVGBARCODESCORE]   [--minNumBarcodes  MINNUMBARCODES]
                 [--minScoreRatio MINSCORERATIO] input.fofn barcode.fofn

          Takes a bas.h5 fofn and a barcode.h5 fofn  and  produces  a  fast[a|q]  file  for  each
          barcode.

          positional arguments:
                 input.fofn             input  base  or CCS fofn file barcode.fofn          input
                 barcode.h5 fofn file

          optional arguments:

                 -h, --help
                        show this help message and exit

                 --outDir output.dir output directory to write fastq files (default: /home/
                        UNIXHOME/jbullard/projects/software/bioinformatics/too ls/pbbarcode/doc)

                 --subreads
                        whether to produce fastq files for the subreads;the default is to use the
                        CCS reads. This option onlyapplies when input.fofn has both consensus and
                        raw reads,otherwise the read  type  from  input.fofn  will  be  returned.
                        (default: False)

                 --unlabeledZmws
                        whether  to emit a fastq file for the unlabeled ZMWs.  These are the ZMWs
                        where no adapters are found typically (default: False)

                 --trim TRIM
                        trim off barcodes and any excess constant sequence (default: 20)

                 --fasta
                        whether the files produced should  be  FASTA  files  asopposed  to  FASTQ
                        (default: False)

                 --minMaxInsertLength MINMAXINSERTLENGTH
                        ZMW  Filter:  exclude  ZMW if the longest subreadis less than this amount
                        (default: 0)

                 --hqStartTime HQSTARTTIME
                        ZMW Filter: exclude ZMW if start time of HQ regiongreater than this value
                        (seconds) (default: inf)

                 --minReadScore MINREADSCORE
                        ZMW Filter: exclude ZMW if readScore is less thanthis value (default: 0)

                 --minAvgBarcodeScore MINAVGBARCODESCORE
                        ZMW  Filter: exclude ZMW if average barcode score is less than this value
                        (default: 0.0)

                 --minNumBarcodes MINNUMBARCODES
                        ZMW Filter: exclude ZMW if number of barcodes observed is less than  this
                        value (default: 1)

                 --minScoreRatio MINSCORERATIO
                        ZMW  Filter:  exclude ZMWs whose best score divided by the 2nd best score
                        is less than this ratio (default: 1.0)

       The emitFastqs command takes as input both an input.fofn for the bas.h5 files as well as a
       barcode.fofn  from  a  call to labelZmws. The optional parameter outDir dictates where the
       files will be written. For each detected barcode, a fast[a|q] file will  be  emitted  with
       all of the reads for that barcode. The trim parameter dictates how much of the read should
       be trimmed off. The default parameter for trim is the length  of  the  barcode  (which  is
       stored  in  the barcode hdf5 files). At the moment, all barcodes in the barcode FASTA file
       must be the same length, therefore only a constant trim value is supported.  In  practice,
       one  can  aggressively trim in order to ensure that extra bases aren't left on the ends of
       reads. Finally, the subreads parameter dictates whether subreads or CCS  reads  should  be
       returned  with  the  default being the appropriate reads according to the input file type,
       either CCS or subreads. This parameter is only inspected if the input.fofn  contains  both
       CCS  and  subread  data,  if the input.fofn contains only subread or CCS data then that is
       returned irrespective of the state of the the subreads parameter and a warning is issued.

   consensus
          usage: pbbarcode consensus [-h] [--subsample SUBSAMPLE] [--nZmws NZMWS]
                 [--outDir  OUTDIR]  [--keepTmpDir]   [--ccsFofn   CCSFOFN]   [--nProcs   NPROCS]
                 [--noQuiver]     [--minMaxInsertLength     MINMAXINSERTLENGTH]    [--hqStartTime
                 HQSTARTTIME]      [--minReadScore      MINREADSCORE]       [--minAvgBarcodeScore
                 MINAVGBARCODESCORE]     [--minNumBarcodes    MINNUMBARCODES]    [--minScoreRatio
                 MINSCORERATIO] [--barcode BARCODE [BARCODE ...]]  input.fofn barcode.fofn

          Compute consensus sequences for each barcode.

          positional arguments:
                 input.fofn            input bas.h5 fofn file barcode.fofn           input  bc.h5
                 fofn file

          optional arguments:

                 -h, --help
                        show this help message and exit

                 --subsample SUBSAMPLE
                        Subsample ZMWs (default: 1)

                 --nZmws NZMWS
                        Take n ZMWs (default: -1)

                 --outDir OUTDIR
                        Use this directory to output results (default: .)

                 --keepTmpDir  --ccsFofn  CCSFOFN      Obtain  CCS  data  from ccsFofn instead of
                 input.fofn
                     (default: )

                 --nProcs NPROCS
                        Use nProcs to execute. (default: 16)

                 --noQuiver --minMaxInsertLength MINMAXINSERTLENGTH
                     ZMW Filter: exclude ZMW if the  longest  subreadis  less  than  this  amount
                     (default: 0)

                 --hqStartTime HQSTARTTIME
                        ZMW Filter: exclude ZMW if start time of HQ regiongreater than this value
                        (seconds) (default: inf)

                 --minReadScore MINREADSCORE
                        ZMW Filter: exclude ZMW if readScore is less thanthis value (default: 0)

                 --minAvgBarcodeScore MINAVGBARCODESCORE
                        ZMW Filter: exclude ZMW if average barcode score is less than this  value
                        (default: 0.0)

                 --minNumBarcodes MINNUMBARCODES
                        ZMW  Filter: exclude ZMW if number of barcodes observed is less than this
                        value (default: 1)

                 --minScoreRatio MINSCORERATIO
                        ZMW Filter: exclude ZMWs whose best score divided by the 2nd  best  score
                        is less than this ratio (default: 1.0)

                 --barcode BARCODE [BARCODE ...]
                        Use this to extract consensus for just one barcode.  (default: None)

       The emitFastqs command takes as input both an input.fofn for the bas.h5 files as well as a
       barcode.fofn from a call to labelZmws. The results are a FASTA file with an entry for each
       barcode containing the consensus amplicon sequence. This mode utilizes Quiver and pbdagcon
       to compute consensus.

       In cases where the amplicon is fewer than 2.5k bases, using CCS data is quite helpful. The
       --ccsFofn  allows  one to pass directly the ccs files. In many cases, both the CCS and raw
       basecalls are in the same file  so  you  can  check  by  passing  the  same  parameter  to
       input.fofn as to ccsFofn.

   Dependencies
       The    pbbarcode    package    depends    on    a    standard   pbcore   installation   (‐
       https://github.com/PacificBiosciences/pbcore). If one wishes to use  the  consensus  tool,
       pbdagcon needs to be installed (https://github.com/PacificBiosciences/pbdagcon).

   Barcode HDF5 File
       The  barcode  hdf5 file, bc.h5, represents a simple data store for barcode calls and their
       scores for each ZMW. Generally, a user need not interact with barcode hdf5 files, but  can
       use the results stored in either the resulting cmp.h5 file or fast[a|q] files. The barcode
       hdf5 file contains the following structure:

       /BarcodeCalls/best - (nZMWs, 6)[32-bit integer] dataset with the following columns:
          holeNumber,nAdapters,barcodeIdx1,barcodeScore1,barcodeIdx2,barcodeScore2

       Additionally, the best dataset has the following attributes:

            ┌────────────┬─────────────────────────────────────────────────────────────────┐
            │movieName   │ m120408_042614_richard_c100309392550000001523011508061222_s1_p0 │
            ├────────────┼─────────────────────────────────────────────────────────────────┤
            │columnNames │ holeNumber,nAdapters,barcodeIdx1,barcodeScore1,barcodeIdx2,     │
            │            │ barcodeScore2                                                   │
            └────────────┴─────────────────────────────────────────────────────────────────┘

            │scoreMode   │ [symmetric|paired]                                              │
            ├────────────┼─────────────────────────────────────────────────────────────────┤
            │barcodes    │ 'bc_1', 'bc_2', ...., 'bc_N'                                    │
            └────────────┴─────────────────────────────────────────────────────────────────┘

       The two barcodeIdx1 and barcodeIdx2 columns  are  indices  into  barcodes  attribute.  The
       scoreMode is scoring mode used to align the barcodes. The barcodes attribute correspond to
       the barcode.fasta sequence names.

       Additionally, in some circumstances, it is useful to retain  the  entire  history  of  the
       scoring, i.e., each barcode scored to each adapter across all ZMWs. In oder to retain this
       information, one must call:
          pbbarcode labelZmws --saveExtendedInfo ...

       In this mode,  the  resultant  HDF5  file  will  have  an  additional  dataset  under  the
       BarcodeCalls group, named: all. This dataset has the following format:

       /BarcodeCalls/all - (nbarcodes * nadapters[zmw_i], 4) forall i in 1 ... nZMWs
          `holeNumber, adapterIdx, barcodeIdx, score`

       The  adapterIdx  is the index of the adapter along the molecule, i.e., adapterIdx 1 is the
       first adapter scored.

   Additions to the compare HDF5 (cmp.h5) File
       In addition to the barcode hdf5 file, a call to labelAlignments  will  annotate  a  cmp.h5
       file.  This  annotation  is  stored  in  ways  consistent  with  the  cmp.h5  file format.
       Specifically, a new group:
       /BarcodeInfo/
         ID   (nBarcodeLabels + 1, 1)[32-bit integer]
         Name (nBarcodeLabels + 1, 1)[variable length string]

       In addition to the /BarcodeInfo/ group,  the  key  dataset  which  assigns  alignments  to
       barcodes is located at:

       /AlnInfo/Barcode (nAlignments, 3)[32-bit integer] with the following columns:
          index,count,bestIndex,bestScore,secondBestIndex,secondBestScore

       Here  index  refers to the index into the Name vector, score corresponds to the sum of the
       scores for the barcodes, and finally, count refers to the number of adapters found in  the
       molecule.

                                          December 2015                              PBBARCODE(1)