Provided by: bcbio_1.1.2-3_all bug

NAME

       bcbio_nextgen - bcbio_nextgen Documentation [image: bcbio banner] [image]

       A  python  toolkit  providing  best-practice pipelines for fully automated high throughput
       sequencing analysis. You write a high level configuration file specifying your inputs  and
       analysis  parameters.   This  input  drives  a  parallel pipeline that handles distributed
       execution, idempotent processing restarts and safe transactional steps.  The  goal  is  to
       provide  a  shared  community  resource  that  handles  the  data  processing component of
       sequencing analysis, providing researchers with more  time  to  focus  on  the  downstream
       biology.

       Contents:

       bcbio-nextgen  provides  best-practice pipelines for automated analysis of high throughput
       sequencing data with the goal of being:

       · Quantifiable: Doing good science requires being able to accurately assess the quality of
         results and re-verify approaches as new algorithms and software become available.

       · Analyzable: Results feed into tools to make it easy to query and visualize the results.

       · Scalable:  Handle  large  datasets  and  sample populations on distributed heterogeneous
         compute environments.

       · Reproducible: Track configuration, versions, provenance  and  command  lines  to  enable
         debugging, extension and reproducibility of results.

       · Community developed: The development process is fully open and sustained by contributors
         from multiple institutions. By working together on a shared framework, we  can  overcome
         the  challenges associated with maintaining complex pipelines in a rapidly changing area
         of research.

       · Accessible: Bioinformaticians, biologists and the general public should be able  to  run
         these  tools  on  inputs ranging from research materials to clinical samples to personal
         genomes.

USERS

       A sample of institutions using  bcbio-nextgen  for  solving  biological  problems.  Please
       submit your story if you're using the pipeline in your own research.

       · Harvard School of Public Health: We use bcbio-nextgen within the bioinformatics core for
         variant calling on large population studies related to human health like  Breast  Cancer
         and  Alzheimer's disease.  Increasing scalability of the pipeline has been essential for
         handling study sizes of more than 1400 whole genomes.

       · Massachusetts General Hospital: The Department of Molecular Biology uses the pipeline to
         automatically process samples coming off Illumina HiSeq instruments. Automated pipelines
         perform alignment and sample-specific analysis, with results directly  uploaded  into  a
         local Galaxy instance.

       · Science  for  Life  Laboratory:  The  genomics  core  platform  in  the Swedish National
         Infrastructure (NGI) for genomics, has crunched over 16TBp (terabasepairs) and processed
         almost  7000+  samples  from  the  beginning  of 2013 until the end of July. UPPMAX, our
         cluster located in Uppsala runs the pipeline in production since 2010.

       · Institute of Human Genetics, UCSF: The Genomics Core Facility utilizes bcbio-nextgen  in
         processing  more  than  2000 whole genome, exome, RNA-seq, ChIP-seq on various projects.
         This pipeline tremendously lowers the barrier  of  getting  access  to  next  generation
         sequencing technology. The community engaged here is also very helpful in providing best
         practices advices and up-to-date solution to ease scientific discovery.

       · IRCCS "Mario Negri" Institute for Pharmacological Research: The  Translational  Genomics
         Unit  in  the Department of Oncology uses bcbio-nextgen for targeted resequencing (using
         an Illumina MiSeq) to  identify  mutations  and  other  variants  in  tumor  samples  to
         investigate  their  link to tumor progression, patient survival and drug sensitivity and
         resistance. A poster from the 2014 European Society of Human Genetics  meeting  provides
         more  details  on  usage in ovarian cancer. A paper on the study of longitudinal ovarian
         cancer biopsies, which makes extensive use of bcbio-nextgen, was published  in  2015  in
         Annals of Oncology.

       · The  Translational  Genomics Research Institute (TGen): Members of the Huentelman lab at
         TGen apply bcbio-nextgen to a wide variety of studies of  with  a  major  focus  in  the
         neurobiology  of  aging  and  neurodegeneration  in  collaboration  with the The Arizona
         Alzheimer's Consortium (AAC) and  the McKnight Brain Research Foundation.  We  also  use
         bcbio  in  studies of rare diseases in children through TGen's Center for Rare Childhood
         Disorders (C4RCD),  and other rare diseases  such  as  Multiple  System  Atrophy  (MSA).
         bcbio-nextgen  has  also  been  instrumental  in  projects for TGen's Program for Canine
         Health & Performance (PCHP) and numerous RNA-seq projects using rodent models. Our  work
         with bcbio started with a parnership with Dell and The Neuroblastoma and Medulloblastoma
         Translational Research Consortium (NMTRC), and TGen as part of a Phase I clinical  trial
         in these rare childhood cancers.

       · Computer  Science  and  Artificial Intelligence Laboratory (CSAIL), MIT: The Gifford lab
         uses the bcbio-nextgen pipeline to analyze a variety of sequencing  datasets  for  their
         research in genetics and regulatory genomics (including the SysCode and stemcell.mit.edu
         projects).  The pipeline applies collaboratively-developed best practices  for  analysis
         as  well as computation, which enables the lab to run the pipeline on local clusters and
         Amazon EC2.

       · Sheffield Bioinformatics Core, The University of Sheffield: The Sheffield Bioinformatics
         Core  is  a  relatively  new Core facility at The University of Sheffield, and bcbio has
         been instrumental in setting-up a  best-practice  Bioinformatics  analysis  service.  We
         employ  bcbio  to  automate the analyses of RNA-seq, small RNA and ChiP-Seq datasets for
         researchers at The University of Sheffield  and  NIHR  Biomedical  Research  Centre.  In
         conjunction  with  the  bcbioRNASeq Bioconductor package, we deliver publication-quality
         reports to our researchers based on reproducible analyses.

AUTOMATED

       We provide an automated script that installs third party analysis tools,  required  genome
       data  and  python  library  dependencies  for  running human variant and RNA-seq analysis,
       bundled into an isolated directory or virtual environment:

          wget https://raw.github.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
          python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir=/usr/local \
            --genomes GRCh37 --aligners bwa --aligners bowtie2

       bcbio should install cleanly on Linux systems. For Mac OSX,  we  suggest  trying  bcbio-vm
       which  runs  bcbio  on  docs-cloud  or  isolates all the third party tools inside a Docker
       container. bcbio-vm is still a work in progress but not all of the dependencies bcbio uses
       install cleanly on OSX.

       With   the   command   line   above,   indexes   and   associated   data   files   go   in
       /usr/local/share/bcbio-nextgen and tools are  in  /usr/local.  If  you  don't  have  write
       permissions to install into the /usr/local directories you can install in a user directory
       like ~/local or use sudo chmod to give your standard user permissions.  Please  don't  run
       the  installer  with sudo or as the root user.  Do not use directories with : in the name,
       it is not POSIX compliant and will cause installation failures.

       The installation is highly customizable, and you can install additional software and  data
       later  using  bcbio_nextgen.py  upgrade.   Run  python  bcbio_nextgen_install.py  with  no
       arguments to see options for configuring the installation process. Some  useful  arguments
       are:

       · --isolate Avoid updating the user's ~/.bashrc if installing in a non-standard PATH. This
         facilitates creation of isolated modules without  disrupting  the  user's  environmental
         setup. Manually edit your ~/.bashrc to allow bcbio runs with:

            export PATH=/path_to_bcbio/bin:$PATH

       · --nodata Do not install genome data.

       The machine will need to have some basic requirements for installing and running bcbio:

       · Python 2.7, Python 3.x, or Python 2.6 plus the argparse dependency.

       · Basic system setup for unpacking files: tar, gzip, unzip, bzip2, xz-utils.

       · The git version control system (http://git-scm.com/)

       · wget for file retrieval (https://www.gnu.org/software/wget/)

       Optional tool specific requirements:

       · Java  1.7, needed when running GATK < 3.6 or MuTect. This must be available in your path
         so typing java -version resolves a 1.7 version. bcbio distributes Java 8 as part of  the
         anaconda installation for recent versions of GATK and MuTect2. You can override the Java
         8 installed with bcbio by setting BCBIO_JAVA_HOME=/path/to/your/javadir if you have  the
         java you want in /path/to/your/javadir/bin/java.

       · An  OpenGL  library,  like  Mesa  (On  Ubuntu/deb  systems:  libglu1-mesa, On RedHat/rpm
         systems: mesa-libGLU-devel). This is only required  for  cancer  heterogeneity  analysis
         with BubbleTree.

       · The Pisces tumor-only variant callers requires the Microsoft .NET runtime.

       The  bcbio-nextgen  Dockerfile  contains  the  packages  needed  to install on bare Ubuntu
       systems.

       The automated installer creates a fully integrated environment  that  allows  simultaneous
       updates  of  the  framework,  third  party  tools  and  biological  data.  This offers the
       advantage over manual installation of  being  able  to  manage  and  evolve  a  consistent
       analysis  environment as algorithms continue to evolve and improve. Installing this way is
       as isolated and self-contained as possible without virtual machines or lightweight  system
       containers  like  Docker.  The  Upgrade  section has additional documentation on including
       additional genome data for supported bcbio genomes. For genome builds not included in  the
       defaults, see the documentation on config-custom.  Following installation, you should edit
       the          pre-created          system          configuration          file           in
       /usr/local/share/bcbio-nextgen/galaxy/bcbio_system.yaml  to  match  your  local  system or
       cluster configuration (see tuning-cores).

UPGRADE

       We use the same automated installation process for performing upgrades of tools,  software
       and  data  in  place.  Since  there  are  multiple  targets and we want to avoid upgrading
       anything unexpectedly, we have specific arguments  for  each.  Generally,  you'd  want  to
       upgrade the code, tools and data together with:

          bcbio_nextgen.py upgrade -u stable --tools --data

       Tune the upgrade with these options:

       · -u  Type  of  upgrade to do for bcbio-nextgen code. stable gets the most recent released
         version and development retrieves the latest code from GitHub.

       · --datatarget Customized installed data or download  additional  files  not  included  by
         default: Customizing data installation

       · --toolplus  Specify  additional  tools to include. See the section on Extra software for
         more details.

       · --genomes and --aligners options add additional aligner indexes to download and prepare.
         bcbio_nextgen.py upgrade -h lists available genomes and aligners. If you want to install
         multiple genomes or aligners at once, specify --genomes or  --aligners  multiple  times,
         like this: --genomes GRCh37 --genomes mm10 --aligners bwa --aligners bowtie2

       · Leave  out  the  --tools option if you don't want to upgrade third party tools. If using
         --tools, it will use the same directory as  specified  during  installation.  If  you're
         using  an  older  version  that  has  not  yet  gone  through  a  successful  upgrade or
         installation and saved the tool directory, you should manually specify --tooldir for the
         first upgrade. You can also pass --tooldir to install to a different directory.

       · Leave  out  the --data option if you don't want to get any upgrades of associated genome
         data.

       · Some aligners such as STAR don't have pre-built indices due to the large file  sizes  of
         these. You set the number of cores to use for indexing with --cores 8.

CUSTOMIZING DATA INSTALLATION

       bcbio installs associated data files for sequence processing, and you're able to customize
       this  to  install  larger  files  or  change  the  defaults.  Use  the  --datatarget  flag
       (potentially multiple times) to customize or add new targets.

       By  default,  bcbio will install data files for variation, rnaseq and smallrna but you can
       sub-select a single one of these if  you  don't  require  other  analyses.  The  available
       targets are:

       · variation  --  Data  files  required  for  variant  calling: SNPs, indels and structural
         variants. These include files for annotation like dbSNP, associated  files  for  variant
         filtering, coverage and annotation files.

       · rnaseq  --  Transcripts  and  indices for running RNA-seq. The transcript files are also
         used for annotating and prioritizing structural variants.

       · smallrna -- Data files for doing small RNA analysis.

       · gemini -- The GEMINI  framework  associates  publicly  available  metadata  with  called
         variants,  and  provides  utilities  for  query  and  analysis. This target installs the
         required GEMINI data files, including ExAC.

       · gnomad -- gnomAD is a large scale collection of genome variants, expanding  on  ExAC  to
         include whole genome and more exome inputs. This is a large 25Gb download, available for
         human genome builds GRCh37, hg19 and hg38.

       · cadd -- CADD evaluates the potential impact of variations. It is  freely  available  for
         non-commercial  research,  but  requires licensing for commercial usage. The download is
         30Gb and GEMINI will include CADD annotations if present.

       · vep -- Data files for the Variant Effects Predictor (VEP). To use VEP as an  alternative
         to the default installed snpEff, set vep in the variant-config configuration.

       · dbnsfp  Like  CADD,  dbNSFP  provides  integrated  and generalized metrics from multiple
         sources to help with prioritizing variations for follow up. The files are large:  dbNSFP
         is  10Gb,  expanding  to 100Gb during preparation. VEP will use dbNSFP for annotation of
         VCFs if included.

       · dbscsnv dbscSNV includes all potential human SNVs within splicing consensus regions  (−3
         to  +8  at the 5’ splice site and −12 to +2 at the 3’ splice site), i.e. scSNVs, related
         functional annotations and two ensemble prediction scores for predicting their potential
         of altering splicing.  VEP will use dbscSNV for annotation of VCFs if included.

       · battenberg Data files for Battenberg, which detects subclonality and copy number changes
         in whole genome cancer samples.

       · kraken Database for Kraken, optionally used for contamination detection.

       · ericscript Database for EricScript, based gene fusion detection. Supports hg38, hg19 and
         GRCh37.

       For  somatic  analyses, bcbio includes COSMIC v68 for hg19 and GRCh37 only. Due to license
       restrictions, we cannot include updated versions of this dataset and hg38 support with the
       installer.  To  prepare  these datasets yourself you can use a utility script shipped with
       cloudbiolinux that downloads, sorts and merges the  VCFs,  then  copies  into  your  bcbio
       installation:

          export COSMIC_USER="your@registered.email.edu"
          export COSMIC_PASS="cosmic_password"
          bcbio_python prepare_cosmic.py 83 /path/to/bcbio

EXTRA SOFTWARE

       We're  not  able to automatically install some useful tools due to licensing restrictions,
       so we provide a mechanism to manually download and add these to  bcbio-nextgen  during  an
       upgrade with the --toolplus command line.

   GATK and MuTect/MuTect2
       bcbio  includes an installation of GATK4, which is freely available for all uses.  This is
       the default runner for HaplotypeCaller or MuTect2. If you want to use an older version  of
       GATK,  it  requires  manual installation. This is freely available for academic users, but
       requires a license for commerical use. It is not  freely  redistributable  so  requires  a
       manual  download from the GATK download site.  You also need to include tools_off: [gatk4]
       in your configuration for runs: see config-changing-defaults.

       To install GATK3, register with the pre-installed gatk bioconda wrapper:

          gatk-register /path/to/GenomeAnalysisTK.tar.bz2

       If you're not using the most recent post-3.6 version of GATK, or using  a  nightly  build,
       you can add --noversioncheck to the command line to skip comparisons to the GATK version.

       MuTect2 is distributed with GATK in versions 3.5 and later.

       To  install  versions  of  GATK < 3.6, download and unzip the latest version from the GATK
       distribution. Then make this jar available to bcbio-nextgen with:

          bcbio_nextgen.py upgrade --tools --toolplus gatk=/path/to/gatk/GenomeAnalysisTK.jar

       This will copy the jar and update your bcbio_system.yaml and manifest files to reflect the
       new version.

       MuTect  also  has similar licensing terms and requires a license for commerical use. After
       downloading the MuTect jar, make it available to bcbio:

          bcbio_nextgen.py upgrade --tools --toolplus mutect=/path/to/mutect/mutect-1.1.7.jar

       Note that muTect does not provide an easy way to query for the current  version,  so  your
       input jar needs to include the version in the name.

SYSTEM REQUIREMENTS

       bcbio-nextgen  provides a wrapper around external tools and data, so the actual tools used
       drive the system requirements. For small projects, it should install  on  workstations  or
       laptops  with  a  couple  Gb  of memory, and then scale as needed on clusters or multicore
       machines.

       Disk space requirements for the tools,  including  all  system  packages  are  under  4Gb.
       Biological  data  requirements  will depend on the genomes and aligner indices used, but a
       suggested install with GRCh37 and bowtie/bwa2 indexes uses approximately 35Gb  of  storage
       during preparation and ~25Gb after:

          $ du -shc genomes/Hsapiens/GRCh37/*
          3.8G  bowtie2
          5.1G  bwa
          3.0G  rnaseq-2014-05-02
          3.0G  seq
          340M  snpeff
          4.2G  variation
          4.4G  vep
          23.5G total

TROUBLESHOOTING

   Proxy or firewall problems
       Some  steps  retrieve  third  party tools from GitHub, which can run into issues if you're
       behind a proxy or block git ports. To instruct git to use  https://  globally  instead  of
       git://:

          $ git config --global url.https://github.com/.insteadOf git://github.com/

   GATK or Java Errors
       Most  software  tools  used  by  bcbio require Java 1.8. bcbio distributes an OpenJDK Java
       build and uses it so you don't need to install anything. Older versions of  GATK  (<  3.6)
       and  MuTect  require  a locally installed Java 1.7. If you have version incompatibilities,
       you'll see errors like:

          Unsupported major.minor version 51.0

       Fixing this requires either installing Java 1.7  for  old  GATK  and  MuTect  or  avoiding
       pointing  to  an  incorrect  java  (unset  JAVA_HOME). You can also tweak the java used by
       bcbio, described in the Automated installation section.

   ImportErrors
       Import  errors  with  tracebacks  containing  Python  libraries  outside  of   the   bcbio
       distribution   (/path/to/bcbio/anaconda)   are  often  due  to  other  conflicting  Python
       installations. bcbio tries to isolate itself as much as possible  but  external  libraries
       can  get  included  during  installation due to the PYTHONHOME or PYTHONPATH environmental
       variables or local site libraries.  These commands will temporary unset those to get bcbio
       installed, after which it should ignore them automatically:

          $ unset PYTHONHOME
          $ unset PYTHONPATH
          $ export PYTHONNOUSERSITE=1

       Finally,  having  a  .pydistutils.cfg  file in your home directory can mess with where the
       libraries get installed. If you  have  this  file  in  your  home  directory,  temporarily
       renaming it to something else may fix your installation issue.

MANUAL PROCESS

       The manual process does not allow the in-place updates and management of third party tools
       that the automated installer makes possible. It's a more error-prone and  labor  intensive
       process.  If  you  find  you can't use the installer we'd love to hear why to make it more
       amenable to your system. If you'd like to develop against a bcbio  installation,  see  the
       documentation on setting up a code-devel-infrastructure.

   Tool Requirements
       The  code  drives  a  number of next-generation sequencing analysis tools that you need to
       install on any machines involved in the processing.  The  CloudBioLinux  toolkit  provides
       automated scripts to help with installation for both software and associated data files:

          fab -f cloudbiolinux/fabfile.py -H localhost install_biolinux:flavor=ngs_pipeline_minimal

       You  can  also install them manually, adjusting locations in the resources section of your
       bcbio_system.yaml configuration file as needed. The CloudBioLinux infrastructure  provides
       a full list of third party software installed with bcbio-nextgen in
       `packages-conda.yaml`_
       , which lists all third party tools installed through Bioconda

   Data requirements
       In  addition  to  existing  bioinformatics  software the pipeline requires associated data
       files for reference genomes, including pre-built indexes for aligners.  The  CloudBioLinux
       toolkit again provides an automated way to download and prepare these reference genomes:

          fab -f data_fabfile.py -H localhost -c your_fabricrc.txt install_data_s3:your_biodata.yaml

       The   biodata.yaml   file  contains  information  about  what  genomes  to  download.  The
       fabricrc.txt describes where to install the genomes by adjusting the data_files  variable.
       This  creates  a  tree  structure  that  includes  a set of Galaxy-style location files to
       describe locations of indexes:

          ├── galaxy
          │   ├── tool-data
          │   │   ├── alignseq.loc
          │   │   ├── bowtie_indices.loc
          │   │   ├── bwa_index.loc
          │   │   ├── sam_fa_indices.loc
          │   │   └── twobit.loc
          │   └── tool_data_table_conf.xml
          ├── genomes
          │   ├── Hsapiens
          │   │   ├── GRCh37
          │   │   └── hg19
          │   └── phiX174
          │       └── phix
          └── liftOver

       Individual genome directories contain indexes for aligners in  individual  sub-directories
       prefixed by the aligner name. This structured scheme helps manage aligners that don't have
       native Galaxy  .loc  files.  The  automated  installer  will  download  and  set  this  up
       automatically:

          `-- phix
              |-- bowtie
              |   |-- phix.1.ebwt
              |   |-- phix.2.ebwt
              |   |-- phix.3.ebwt
              |   |-- phix.4.ebwt
              |   |-- phix.rev.1.ebwt
              |   `-- phix.rev.2.ebwt
              |-- bowtie2
              |   |-- phix.1.bt2
              |   |-- phix.2.bt2
              |   |-- phix.3.bt2
              |   |-- phix.4.bt2
              |   |-- phix.rev.1.bt2
              |   `-- phix.rev.2.bt2
              |-- bwa
              |   |-- phix.fa.amb
              |   |-- phix.fa.ann
              |   |-- phix.fa.bwt
              |   |-- phix.fa.pac
              |   |-- phix.fa.rbwt
              |   |-- phix.fa.rpac
              |   |-- phix.fa.rsa
              |   `-- phix.fa.sa
              |-- novoalign
              |   `-- phix
              |-- seq
              |   |-- phix.dict
              |   |-- phix.fa
              |   `-- phix.fa.fai
              `-- ucsc
                  `-- phix.2bit

GERMLINE VARIANT CALLING

       bcbio  implements  configurable  SNP,  indel  and  structural variant calling for germline
       populations. We include whole genome and exome evaluations against  reference  calls  from
       the  Genome  in  a  Bottle  consortium  and  Illumina  Platinum  Genomes project, enabling
       continuous assessment of new alignment and variant calling algorithms. We regularly report
       on  these  comparisons and continue to improve approaches as the community makes new tools
       available. Here is some of the research that contributes to the current implementation:

       · An introduction to the variant evaluation framework. This includes a comparison  of  the
         bwa mem and novoalign aligners. We also compared the FreeBayes, GATK HaplotypeCaller and
         GATK UnifiedGenotyper variant callers.

       · An in-depth evaluation of FreeBayes and BAM post-alignment  processing.  We  found  that
         FreeBayes  quality  was  equal  to  GATK  HaplotypeCaller.  Additionally,  a lightweight
         post-alignment preparation method using only de-duplication  was  equivalent  to  GATK's
         recommended  Base Quality Score Recalibration (BQSR) and realignment around indels, when
         using good quality input datasets and callers that do local realignment.

       · Additional work to improve variant filtering, providing methods to remove low complexity
         regions  (LCRs)  that can bias indel results. We also tuned GATK's Variant Quality Score
         Recalibrator (VQSR) and compared it with cutoff-based soft filtering.  VQSR  requires  a
         large  number  of  variants  and  we use it in bcbio with GATK HaplotypeCaller when your
         algorithm-config contains high depth samples (coverage_depth is not  low)  and  you  are
         calling  on the whole genome (coverage_interval is genome) or have more than 50 regional
         or exome samples called concurrently.

       · An evaluation of joint  calling  with  GATK  HaplotypeCaller,  FreeBayes,  Platypus  and
         samtools.  This  validates  the  joint calling implementation, allowing scaling of large
         population germline experiments.  It  also  demonstrates  improved  performance  of  new
         callers: samtools 1.0 and Platypus.

       · Support for build 38 of the human genome, improving precision of detection thanks to the
         improved genome representation.

       bcbio automates post-variant calling  annotation  to  make  the  outputs  easier  to  feed
       directly  into  your  biological  analysis.  We  annotate  variant effects using snpEff or
       Variant Effect Predictor (VEP), and prepare a GEMINI  database  that  associates  variants
       with  multiple  external annotations in a SQL-based query interface. GEMINI databases have
       the most associated external information for human samples (GRCh37/hg19 and hg38) but  are
       available  for  any  organism  with  the  database populated using the VCF INFO column and
       predicted effects.

   Basic germline calling
       The best approach to build a  bcbio  docs-config  for  germline  calling  is  to  use  the
       automated-sample-config with one of the default templates:

       · FreeBayes template -- Call variants using FreeBayes with a minimal preparation pipeline.
         This  is  a  freely  available  unrestricted  pipeline  fully  included  in  the   bcbio
         installation.

       · GATK  HaplotypeCaller  template -- Run GATK best practices, including Base Quality Score
         Recalibration, realignment and HaplotypeCaller variant calling. This requires a  license
         from  Broad for commercial use. You need to manually install GATK along with bcbio using
         downloads from the GATK Broad site or Appistry (see toolplus-install).

       You may also want to enable Structural variant calling for  detection  of  larger  events,
       which  work  with  either caller. Another good source of inspiration are the configuration
       files from the example-pipelines, which may help identify other configuration variables of
       interest.  A  more complex setup with multiple callers and resolution of ensemble calls is
       generally only useful with a small population where you  are  especially  concerned  about
       sensitivity.  Single  caller detection with FreeBayes or GATK HaplotypeCaller provide good
       resolution of events.

   Population calling
       When  calling  multiple  samples,  we  recommend  calling  together  to  provide  improved
       sensitivity  and  a  fully  squared  off final callset. To associate samples together in a
       population add a metadata batch to the sample-configuration:

          - description: Sample1
            metadata:
              batch: Batch1
          - description: Sample2
            metadata:
              batch: Batch1

       Batching samples results in output VCFs and GEMINI databases containing all merged  sample
       calls. bcbio has two methods to call samples together:

       · Batch  or pooled calling -- This calls all samples simultaneously by feeding them to the
         variant  caller.  This  works  for  smaller  batch  sizes  (<  100  samples)  as  memory
         requirements  become  limiting  in larger pools. This is the default approach taken when
         you specify a variantcaller in the variant-config configuration.

       · Joint calling -- This calls samples independently, then combines them  together  into  a
         single  callset  by  integrating  the individual calls. This scales to larger population
         sizes by avoiding the computational bottlenecks of pooled calling.  We  recommend  joint
         calling  with  HaplotypeCaller  if  you  have a license for GATK usage, but also support
         joint calling with FreeBayes using a custom  implementation.  Specifying  a  jointcaller
         along  with  the  appropriate  variantcaller in the variant-config configuration enables
         this:

            - description: Sample1
              algorithm:
                variantcaller: gatk-haplotype
                jointcaller: gatk-haplotype-joint
              metadata:
                batch: Batch1
            - description: Sample2
              algorithm:
                variantcaller: gatk-haplotype
                jointcaller: gatk-haplotype-joint
              metadata:
                batch: Batch1

CANCER VARIANT CALLING

       bcbio supports somatic cancer calling with tumor and optionally matched normal pairs using
       multiple  SNP,  indel  and structural variant callers. A full evaluation of cancer calling
       validates callers against synthetic dataset 3 from the ICGC-TCGA DREAM  challenge.   bcbio
       uses  a  majority  voting ensemble approach to combining calls from multiple SNP and indel
       callers, and also flattens structural variant calls into a combined representation.

       The example configuration for the example-cancer validation is a good starting  point  for
       setting  up  a  tumor/normal run on your own dataset. The configuration works similarly to
       population based calling. Supply a consistent batch for tumor/normal pairs and  mark  them
       with the phenotype:

          - description: your-tumor
            algorithm:
              variantcaller: [vardict, strelka2, mutect2]
            metadata:
              batch: batch1
              phenotype: tumor
          - description: your-normal
            algorithm:
              variantcaller: [vardict, strelka2, mutect2]
            metadata:
              batch: batch1
              phenotype: normal

       Other config-cancer configuration options allow tweaking of the processing parameters. For
       pairs you want to analyze together, specify a consistent set of variantcaller options  for
       both samples.

       Cancer calling handles both tumor-normal paired calls and tumor-only calling. To specify a
       tumor-only sample, provide a single sample labeled with phenotype:  tumor.  Otherwise  the
       configuration and setup is the same as with paired analyses. For tumor-only samples, bcbio
       will try to remove likely germline variants present in  the  public  databases  like  1000
       genomes  and  ExAC,  and  not in COSMIC. This runs as long as you have a local GEMINI data
       installation (--datatarget gemini) and marks likely germline variants with  a  LowPriority
       filter. This post has more details on the approach and validation.

       The  standard  variant  outputs (sample-caller.vcf.gz) for tumor calling emphasize somatic
       differences, those likely variants unique to the cancer. If you have a  tumor-only  sample
       and  GEMINI data installed, it will also output sample-caller-germline.vcf.gz, which tries
       to identify germline background mutations based on presence in public  databases.  If  you
       have  tumor/normal  data  and  would  like  to also call likely germline mutations see the
       documentation on specifying a germline caller: Somatic with germline variants.

       We're actively working on improving calling to better account for  the  heterogeneity  and
       structural variability that define cancer genomes.

SOMATIC WITH GERMLINE VARIANTS

       For  tumor/normal  somatic  samples,  bcbio  can  call  both  somatic (tumor-specific) and
       germline (pre-existing) variants. The typical outputs of Cancer variant calling are likely
       somatic variants acquired by the cancer, but pre-existing germline risk variants are often
       also diagnostic.

       For tumor-only cases we suggest  running  standard  Cancer  variant  calling.   Tumor-only
       inputs  mix  somatic  and  germline  variants, making it difficult to separate events. For
       small variants (SNPs and indels) bcbio will attempt to distinguish  somatic  and  germline
       mutations using the presence of variants in population databases.

       To  option  somatic and germline calls for your tumor/normal inputs, specify which callers
       to use for each step in the variant-config configuration:

          description: your-normal
          variantcaller:
             somatic: vardict
             germline: freebayes

       bcbio does a single alignment for the normal sample, then splits at  the  variant  calling
       steps  using  this normal sample to do germline calling. In this example, the output files
       are:

       · your-tumor/your-tumor-vardict.vcf.gz -- Somatic calls from the tumor samples  using  the
         normal as background to subtract existing calls.

       · your-normal/your-normal-freebayes.vcf.gz -- Germline calls on the normal sample.

       Germline  calling supports multiple callers, and other configuration options like ensemble
       and structural variant calling inherit from the remainder configuration. For  example,  to
       use 3 callers for somatic and germline calling, create ensemble calls for both and include
       germline and somatic events from two structural variant callers:

          variantcaller:
             somatic: [vardict, strelka2, mutect2]
             germline: [freebayes, gatk-haplotype, strelka2]
          ensemble:
             numpass: 2
          svcaller: [manta, cnvkit]

       In addition to the somatic and germline outputs attached to the tumor  and  normal  sample
       outputs as described above, you'll get:

       · your-tumor/your-tumor-manta.vcf.gz   --   Somatic  structural  variant  calls  for  each
         specified svcaller. These will have genotypes for both the  tumor  and  normal  samples,
         with somatic calls labeled as PASS variants.

       · your-normal/your-normal-manta.vcf.gz  --  Germline  structural  variant  calls  for each
         specified svcaller. We expect these to be noisier than the somatic calls due to the lack
         of a reference sample to help remove technical noise.

STRUCTURAL VARIANT CALLING

       bcbio  can  detect  larger  structural variants like deletions, insertions, inversions and
       copy number changes for both germline population and  cancer  variant  calling,  based  on
       validation against existing truth sets:

       · Validation  of  germline  structural variant detection using multiple calling methods to
         validate against deletions in NA12878. This implements a pipeline that works  in  tandem
         with  SNP  and  indel  calling  to  detect  larger structural variations like deletions,
         duplications, inversions and copy number variants (CNVs).

       · Validation of tumor/normal calling  using  the  synthetic  DREAM  validation  set.  This
         includes   validation   of  additional  callers  against  duplications,  insertions  and
         inversions.

       To enable structural variant calling, specify svcaller options in the algorithm section of
       your configuration:

          - description: Sample
            algorithm:
              svcaller: [lumpy, manta, cnvkit]

       The  best  supported  callers  are Lumpy and Manta, for paired end and split read calling,
       CNVkit for read-depth based CNV calling, and WHAM for association testing. We also support
       DELLY,  another  excellent  paired end and split read caller, although it is slow on large
       whole genome datasets.

       In addition to results from individual callers, bcbio can  create  a  summarized  ensemble
       callset  using  MetaSV. We're actively working on improved structural variant reporting to
       highlight potential variants of interest.

RNA-SEQ

       bcbio can also be used to analyze RNA-seq data. It includes  steps  for  quality  control,
       adapter   trimming,   alignment,   variant   calling,   transcriptome  reconstruction  and
       post-alignment quantitation at the level of the gene and isoform.

       We recommend using the STAR aligner for all genomes where there are no  alt  alleles.  For
       genomes  such  as hg38 that have alt alleles, hisat2 should be used as it handles the alts
       correctly and STAR does not yet. Use Tophat2 only if you do not have enough RAM  available
       to run STAR (about 30 GB).

       Our  current  recommendation is to run adapter trimming only if using the Tophat2 aligner.
       Adapter trimming is very slow, and aligners that soft clip the ends of reads such as  STAR
       and  hisat2,  or  algorithms  using  pseudoalignments  like  Sailfish  handle  contaminant
       sequences at the ends properly. This makes trimming unnecessary. Tophat2 does not  perform
       soft clipping so if using Tophat2, trimming must still be done.

       Salmon,  which  is an extremely fast alignment-free method of quantitation, is run for all
       experiments. Salmon can accurately quantitate the expression of genes, even ones which are
       hard  to  quantitate  with  other  methods (see this paper for example for Sailfish, which
       performs similarly to Salmon.). Salmon can also quantitate at the transcript  level  which
       can  help gene-level analyses (see this paper for example).  We recommend using the Salmon
       quantitation  rather  than  the  counts   from   featureCounts   to   perform   downstream
       quantification.

       Although  we  do  not  recommend  using the featureCounts based counts, the alignments are
       still useful because they give you many more quality  metrics  than  the  quasi-alignments
       from Salmon.

       After  a  bcbio  RNA-seq  run  there  will be in the upload directory a directory for each
       sample which contains a BAM file of the aligned and unaligned reads, a sailfish  directory
       with the output of Salmon, including TPM values, and a qc directory with plots from FastQC
       and qualimap.

       In addition to directories for each sample, in the upload directory  there  is  a  project
       directory  which  contains  a YAML file describing some summary statistics for each sample
       and some provenance data about the bcbio run. In that directory is also a  combined.counts
       file with the featureCounts derived counts per cell.

FAST RNA-SEQ

       This mode of bcbio-nextgen quantitates transcript expression using Salmon and does nothing
       else. It is an order of magnitude faster or more than running the full  RNA-seq  analysis.
       The  cost  of  the  increased speed is that you will have much less information about your
       samples at the end of the run, which  can  make  troubleshooting  trickier.   Invoke  with
       analysis: fastrna-seq.

SINGLE-CELL RNA-SEQ

       bcbio-nextgen  supports  universal  molecular  identifiers (UMI) based single-cell RNA-seq
       analyses. If your single-cell prep does not use universal molecular identifiers (UMI), you
       can  most likely just run the standard RNA-seq pipeline and use the results from that. The
       UMI are used to discard reads which are possibly PCR duplicates and is  very  helpful  for
       removing some of the PCR duplicate noise that can dominate single-cell experiments.

       Unlike  the  standard  RNA-seq  pipeline, the single-cell pipeline expects the FASTQ input
       files to not be separated by cellular barcode, so each file is a mix of  cells  identified
       by  a cellular barcode (CB), and unique reads from a transcript are identified with a UMI.
       bcbio-nextgen inspects each read, identifies the cellular barcode and UMI and puts them in
       the  read name. Then the reads are aligned to the transcriptome with RapMap and the number
       of reads aligning to each transcript is counted for each cellular barcode. The output is a
       table of counts with transcripts as the rows and columns as the cellular barcodes for each
       input FASTQ file.

       Optionally the reads can be quantitated with kallisto to output  transcript  compatibility
       counts rather than counts per gene (TCC paper).

       To  extract the UMI and cellular barcodes from the read, bcbio-nextgen needs to know where
       the UMI and the cellular barcode are expected to  be  in  the  read.  Currently  there  is
       support  for two schemes, the inDrop system from the Harvard single-cell core facility and
       CEL-seq. If bcbio-nextgen does not support your UMI and barcoding scheme, please  open  up
       an issue and we will help implement support for it.

       Most  of  the  heavy  lifting  for  this  part of bcbio-nextgen is implemented in the umis
       repository.

SMALLRNA-SEQ

       bcbio-nextgen also implements a  configurable  best-practices  pipeline  for  smallRNA-seq
       quality  controls,  adapter  trimming,  miRNA/isomiR  quantification  and  other small RNA
       detection.

       · Adapter trimming:

         · atropos

         · dnapi for adapter de-novo detection

       · Sequence alignment:

         · STAR for genome annotation

         · bowtie, bowtie2 and  hisat2 for genome annotation as an option

       · Specific small RNAs quantification (miRNA/tRNAs...):

         · seqbuster for miRNA annotation

         · MINTmap for tRNA fragments annotation

         ·

           miRge2 for alternative small RNA quantification. To setup this tool, you need
                  install manually miRge2.0, and download the  library  data  for  your  species.
                  Read  how  to  install and download the data here.  If you have human folder at
                  /mnt/data/human the option to pass to resources will be /mnt/data.  Then  setup
                  resources:

           resources:

                  mirge: options: ["-lib $PATH_TO_PARENT_SPECIES_LIB"]

       · Quality control:

         · FastQC

       · Other small RNAs quantification:

         · seqcluster

         · mirDeep2 for miRNA prediction

       The  pipeline  generates  a  _RMD_ template file inside report folder that can be rendered
       with knitr. An example of the report is at  here.   Count  table  (counts_mirna.tst)  from
       mirbase  miRNAs  will  be inside mirbase or final project folder.  Input files for isomiRs
       package for isomiRs analysis will be inside each sample in mirbase folder..   If  mirdeep2
       can  run, count table (counts_mirna_novel.tsv) for novel miRNAs will be inside mirdeep2 or
       final project folder.  tdrmapper results will be inside each sample  inside  tdrmapper  or
       final project folder.

CHIP-SEQ

       The  bcbio-nextgen  implementation  of  ChIP-seq aligns, removes multimapping reads, calls
       peaks with a paired input file using MACS2 and outputs  a  set  of  greylist  regions  for
       filtering possible false peaks in regions of high depth in the input file.

       · Adapter trimming: - atropos

       · Sequence alignment: - bowtie2, bwa mem

       · Peak calling: - macs2

       · Greylisting: - chipseq-greylist

       · Quality control: - FastQC

STANDARD

       This  pipeline  implements  alignment and qc tools. Furthermore, it will run qsignature to
       detect possible duplicated samples, or mislabeling. It uses SNPs  signature  to  create  a
       distance  matrix  that  helps easily to create groups. The project yaml file will show the
       number of total samples analyzed, the number of very similar  samples,  and  samples  that
       could be duplicated.

   Configuration
       We  will assume that you installed bcbio-nextgen with the automated installer, and so your
       default bcbio_system.yaml file is configured correctly with all of the tools  pointing  to
       the  right  places. If that is the case, to run bcbio-nextgen on a set of samples you just
       need to set up a YAML file that describes your samples and what you would like  to  do  to
       them. Let's say that you have a single paired-end control lane, prepared with the Illumina
       TruSeq Kit from a human. Here is what a well-formed sample  YAML  file  for  that  RNA-seq
       experiment would look like:

          fc_date: '070113'
          fc_name: control_experiment
          upload:
            dir: final
          details:
            - files: [/full/path/to/control_1.fastq, /full/path/to/control_2.fastq]
              description: 'Control_rep1'
              genome_build: GRCh37
              analysis: RNA-seq
              algorithm:
                   aligner: tophat2
                   quality_format: Standard
                   trim_reads: read_through
                   adapters: [truseq, polya]
                   strandedness: unstranded

       fc_date  and fc_name will be combined to form a prefix to name intermediate files, and can
       be set to whatever you  like.  upload  is  explained  pretty  well  in  the  configuration
       documentation  and  the  above  will direct bcbio-nextgen to put the output files from the
       pipeine into the final directory.  Under details is a list of sections each  describing  a
       sample  to  process.   You can set many parameters under each section but most of the time
       just setting a few like the above is all that is necessary.  analysis tells  bcbio-nextgen
       to run the best-practice RNA-seq pipeline on this sample.

       In  the  above,  since  there  are  two files, control_1.fastq and control_2.fastq will be
       automatically run as paired-end data. If you have single end data you can just supply  one
       file  and  it  will  run  as  single-end. The description field will be used to eventually
       rename the files, so make it very evocative since you will be looking at it a  lot  later.
       genome_build is self-explanatory.

       Sometimes  you  need  a  little  bit  more flexibility than the standard pipeline, and the
       algorithm  section  has  many  options  to  fine-tune  the  behavior  of  the   algorithm.
       quality_format  tells  bcbio-nextgen  what  quality format your FASTQ inputs are using, if
       your samples were sequenced any time past 2009 or so, you  probably  want  to  set  it  to
       Standard.  Adapter  read-through is a problem in RNA-seq libraries, so we want to trim off
       possible adapter sequences on the ends of reads, so trim_reads  is  set  to  read_through,
       which  will  also  trim  off  poor  quality  ends. Since your library is a RNA-seq library
       prepared with the TruSeq kit, the set of adapters to trim off are the TruSeq adapters  and
       possible polyA tails, so adapters is set to both of those. strandedness can be set if your
       library was  prepared  in  a  strand-specific  manner  and  can  be  set  to  firststrand,
       secondstrand or unstranded (the default).

   Multiple samples
       Lets  say  you  have a set of mouse samples to analyze and each sample is a single lane of
       single-end RNA-seq reads prepared using the NextEra kit.   There  are  two  case  and  two
       control samples. Here is a sample configuration file for that analysis:

          fc_date: '070113'
          fc_name: mouse_analysis
          upload:
            dir: final
          details:
            - files: [/full/path/to/control_rep1.fastq]
              description: 'Control_rep1'
              genome_build: mm10
              analysis: RNA-seq
              algorithm:
                   aligner: tophat2
                   quality_format: Standard
                   trim_reads: read_through
                   adapters: [nextera, polya]
            - files: [/full/path/to/control_rep2.fastq]
              description: 'Control_rep2'
              genome_build: mm10
              analysis: RNA-seq
              algorithm:
                   aligner: tophat2
                   quality_format: Standard
                   trim_reads: read_through
                   adapters: [nextera, polya]
            - files: [/full/path/to/case_rep1.fastq]
              description: 'Case_rep1'
              genome_build: mm10
              analysis: RNA-seq
              algorithm:
                   aligner: tophat2
                   quality_format: Standard
                   trim_reads: read_through
                   adapters: [nextera, polya]
            - files: [/full/path/to/case_rep2.fastq]
              description: 'Case_rep2'
              genome_build: mm10
              analysis: RNA-seq
              algorithm:
                   aligner: tophat2
                   quality_format: Standard
                   trim_reads: read_through
                   adapters: [nextera, polya]

       More  samples  are  added  just by adding more entries under the details section.  This is
       tedious and error prone to do by hand, so there is an automated template system for common
       experiments.  You  could  set  up the previous experiment by making a mouse version of the
       illumina-rnaseq   template   file   and   saving   it   to   a   local   file   such    as
       illumina-mouse-rnaseq.yaml.  Then  you  can  set  up  the sample file using the templating
       system:

          bcbio_nextgen.py -w template illumina-mouse-rnaseq.yaml mouse_analysis
          /full/path/to/control_rep1.fastq /full/path/to/control_rep2.fastq
          /full/path/to/case_rep1.fastq /full/path/to/case_rep2.fastq

       If you had paired-end samples instead  of  single-end  samples,  you  can  still  use  the
       template  system as long as the forward and reverse read filenames are the same, barring a
       _1 and _2. For example: control_1.fastq and control_2.fastq will be detected as paired and
       combined in the YAML file output by the templating system.

OVERVIEW

       1. Create  a  sample  configuration  file for your project (substitute the example BAM and
          fastq names below with the full path to your sample files):

             bcbio_nextgen.py -w template gatk-variant project1 sample1.bam sample2_1.fq sample2_2.fq

          This uses a standard template (GATK best practice variant calling) to automate creation
          of  a  full configuration for all samples. See automated-sample-config for more details
          on running the script, and manually edit the base template  or  final  output  file  to
          incorporate  project  specific  configuration.  The  example  pipelines  provide a good
          starting point and the sample-configuration documentation has full details on available
          options.

       2. Run analysis, distributed across 8 local cores:

             bcbio_nextgen.py bcbio_sample.yaml -n 8

       3. Read  the  docs-config  documentation for full details on adjusting both the sample and
          system configuration files to match your experiment and computational setup.

PROJECT DIRECTORY

       bcbio encourages a project structure like:

          my-project/
          ├── config
          ├── final
          └── work

       with the input configuration in the config directory, the outputs of the pipeline  in  the
       final  directory,  and  the  actual  processing  done  in  the  work  directory.  Run  the
       bcbio_nextgen.py script from inside the work directory to keep  all  intermediates  there.
       The  final  directory,  relative  to  the  parent  directory of the work directory, is the
       default location specified in the example configuration  files  and  gets  created  during
       processing.  The  final  directory  has all of the finished outputs and you can remove the
       work intermediates to cleanup disk space  after  confirming  the  results.  All  of  these
       locations are configurable and this project structure is only a recommendation.

LOGGING

       There are 3 logging files in the log directory within your working folder:

       · bcbio-nextgen.log  High  level logging information about the analysis.  This provides an
         overview of major processing steps and useful checkpoints for assessing run times.

       · bcbio-nextgen-debug.log Detailed information  about  processes  including  stdout/stderr
         from  third  party  software  and  error  traces for failures. Look here to identify the
         status of running pipelines or to debug errors. It labels each line with the hostname of
         the machine it ran on to ease debugging in distributed cluster environments.

       · bcbio-nextgen-commands.log Full command lines for all third party software tools run.

EXAMPLE PIPELINES

       We  supply  example  input configuration files for validation and to help in understanding
       the pipeline.

   Whole genome trio (50x)
       This input configuration runs whole genome variant calling using bwa, GATK HaplotypeCaller
       and  FreeBayes.  It uses a father/mother/child trio from the CEPH NA12878 family: NA12891,
       NA12892, NA12878.  Illumina's Platinum genomes project has 50X whole genome sequencing  of
       the  three members. The analysis compares results against a reference NA12878 callset from
       NIST's Genome in a Bottle initiative.

       To run the analysis do:

          mkdir -p NA12878-trio-eval/config NA12878-trio-eval/input NA12878-trio-eval/work
          cd NA12878-trio-eval/config
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-validate.yaml
          cd ../input
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-validate-getdata.sh
          bash NA12878-trio-wgs-validate-getdata.sh
          cd ../work
          bcbio_nextgen.py ../config/NA12878-trio-wgs-validate.yaml -n 16

       This is a large whole genome  analysis  and  meant  to  test  both  pipeline  scaling  and
       validation  across  the  entire  genome.  It  can  take  multiple days to run depending on
       available cores. It requires 300Gb for the input files and 1.3Tb for the  work  directory.
       Smaller   examples   below   exercise  the  pipeline  with  less  disk  and  computational
       requirements.

       We also have a more extensive evaluation  that  includes  2  additional  variant  callers,
       Platypus and samtools, and 3 different methods of calling variants: single sample, pooled,
       and incremental joint calling. This uses the same input data  as  above  but  a  different
       input configuration file:

          mkdir -p NA12878-trio-eval/work_joint
          cd NA12878-trio-eval/config
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-joint.yaml
          cd ../work_joint
          bcbio_nextgen.py ../config/NA12878-trio-wgs-joint.yaml -n 16

   Exome with validation against reference materials
       This example calls variants on NA12878 exomes from EdgeBio's clinical sequencing pipeline,
       and compares them against reference materials from NIST's Genome in a  Bottle  initiative.
       This supplies a full regression pipeline to ensure consistency of calling between releases
       and updates of third party software. The pipeline performs  alignment  with  bwa  mem  and
       variant calling with FreeBayes, GATK UnifiedGenotyper and GATK HaplotypeCaller. Finally it
       integrates all 3 variant calling approaches into a combined ensemble callset.

       This is a large full exome example with multiple variant callers, so can take more than 24
       hours on machines using multiple cores.

       First  get  the  input  configuration  file, fastq reads, reference materials and analysis
       regions:

          mkdir -p NA12878-exome-eval
          cd NA12878-exome-eval
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-exome-methodcmp-getdata.sh
          bash NA12878-exome-methodcmp-getdata.sh

       Finally run the analysis, distributed on 8 local cores, with:

          cd work
          bcbio_nextgen.py ../config/NA12878-exome-methodcmp.yaml -n 8

       The grading-summary.csv contains detailed comparisons of the results to the NIST reference
       materials, enabling rapid comparisons of methods.

   Cancer tumor normal
       This  example  calls  variants  using  multiple approaches in a paired tumor/normal cancer
       sample from the ICGC-TCGA DREAM challenge. It uses synthetic dataset 3 which has  multiple
       subclones,  enabling  detection  of  lower frequency variants. Since the dataset is freely
       available and has a truth set, this allows us to do a full evaluation of variant callers.

       To get the data:

          mkdir -p cancer-dream-syn3/config cancer-dream-syn3/input cancer-dream-syn3/work
          cd cancer-dream-syn3/config
          wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/cancer-dream-syn3.yaml
          cd ../input
          wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/cancer-dream-syn3-getdata.sh
          bash cancer-dream-syn3-getdata.sh

       Run with:

          cd ../work
          bcbio_nextgen.py ../config/cancer-dream-syn3.yaml -n 8

       The configuration and data file has downloads for exome only and whole genome analyses. It
       enables  exome  by  default,  but  you  can  use  the  larger  whole  genome evaluation by
       uncommenting the relevant parts of the configuration and retrieval script.

   Cancer-like mixture with Genome in a Bottle samples
       This example simulates somatic cancer calling using a mixture of two Genome  in  a  Bottle
       samples, NA12878 as the "tumor" mixed with NA24385 as the background.  The Hartwig Medical
       Foundation and Utrecht Medical Center  generated  this  "tumor/normal"  pair  by  physical
       mixing  of  samples  prior  to  sequencing. The GiaB FTP directory has more details on the
       design and truth sets.  The sample has variants at 15% and 30%, providing the  ability  to
       look at lower frequency mutations.

       To get the data:

          wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/cancer-giab-na12878-na24385-getdata.sh
          bash cancer-giab-na12878-na24385-getdata.sh

       Then run the analysis with:

          cd work
          bcbio_nextgen.py ../config/cancer-giab-na12878-na24385.yaml -n 16

   Structural variant calling -- whole genome NA12878 (50x)
       This  example  runs  structural  variant  calling  with multiple callers (Lumpy, Manta and
       CNVkit), providing a combined output summary file and validation metrics  against  NA12878
       deletions. It uses the same NA12878 input as the whole genome trio example.

       To run the analysis do:

          mkdir -p NA12878-sv-eval
          cd NA12878-sv-eval
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-sv-getdata.sh
          bash NA12878-sv-getdata.sh
          cd work
          bcbio_nextgen.py ../config/NA12878-sv.yaml -n 16

       This  is  large  whole  genome analysis and the timing and disk space requirements for the
       NA12878 trio analysis above apply here as well.

   RNAseq example
       This example aligns and creates count files for  use  with  downstream  analyses  using  a
       subset of the SEQC data from the FDA's Sequencing Quality Control project.

       Get  the  setup  script  and run it, this will download six samples from the SEQC project,
       three from the HBRR panel and three from the UHRR panel. This will require about 100GB  of
       disk  space  for these input files.  It will also set up a configuration file for the run,
       using the templating system:

          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/rnaseq-seqc-getdata.sh
          bash rnaseq-seqc-getdata.sh

       Now go into the work directory and run the analysis:

          cd seqc/work
          bcbio_nextgen.py ../config/seqc.yaml -n 8

       This will run a full scale RNAseq experiment using Tophat2 as the aligner and will take  a
       long  time  to  finish  on  a  single machine. At the end it will output counts, Cufflinks
       quantitation and a set of QC results about each lane.  If  you  have  a  cluster  you  can
       parallelize it to speed it up considerably.

       A  nice  looking  standalone  report  of  the  bcbio-nextgen  run  can  be generated using
       bcbio.rnaseq. Check that repository for details.

   Human genome build 38
       Validate variant calling on human genome build 38, using two different  builds  (with  and
       without  alternative  alleles) and three different validation datasets (Genome in a Bottle
       prepared with two methods and Illumina platinum genomes).  To run:

          mkdir -p NA12878-hg38-val
          cd NA12878-hg38-val
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-hg38-validate-getdata.sh
          bash NA12878-hg38-validate-getdata.sh
          mkdir -p work
          cd work
          bcbio_nextgen.py ../config/NA12878-hg38-validate.yaml -n 16

   Whole genome (10x)
       An input configuration for running whole gnome variant calling with bwa  and  GATK,  using
       Illumina's  Platinum  genomes project (NA12878-illumina.yaml). See this blog post on whole
       genome scaling for expected run times and more information about the pipeline. To run  the
       analysis:

       · Create an input directory structure like:

            ├── config
            │   └── NA12878-illumina.yaml
            ├── input
            └── work

       · Retrieve inputs and comparison calls:

            cd input
            wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_1.fastq.gz
            wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_2.fastq.gz

       · Retrieve configuration input file:

            cd config
            wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-illumina.yaml

       · Run analysis on 16 core machine:

            cd work
            bcbio_nextgen.py ../config/NA12878-illumina.yaml -n 16

       · Examine   summary   of   concordance  and  discordance  to  comparison  calls  from  the
         grading-summary.csv file in the work directory.

TEST SUITE

       The test suite exercises the scripts driving the analysis, so are a good starting point to
       ensure  correct  installation.  Tests use the pytest framework. The tests are available in
       the bcbio source code:

          $ git clone https://github.com/bcbio/bcbio-nextgen.git

       There  is  a  small  wrapper  script  that  finds  the  py.test  and  other   dependencies
       pre-installed with bcbio you can use to run tests:

          $ cd tests
          $ ./run_tests.sh

       You can use this to run specific test targets:

          $ ./run_tests.sh cancer
          $ ./run_tests.sh rnaseq
          $ ./run_tests.sh devel
          $ ./run_tests.sh docker

       Optionally,  you  can run pytest directly from the bcbio install to tweak more options. It
       will be in /path/to/bcbio/anaconda/bin/py.test. Pass -s to py.test to see the stdout  log,
       and -v to make py.test output more verbose. The tests are marked with labels which you can
       use to run a specific subset of the tests using the -m argument:

          $ py.test -m rnaseq

       To run unit tests:

          $ py.test tests/unit

       To run integration pipeline tests:

          $ py.test tests/integration

       To run tests which use bcbio_vm:

          $ py.test tests/bcbio_vm

       To see the test coverage, add the --cov=bcbio argument to py.test.

       By default the test suite will use your installed system configuration for running  tests,
       substituting  the  test  genome  information  instead of using full genomes. If you need a
       specific  testing  environment,  copy   tests/data/automated/post_process-sample.yaml   to
       tests/data/automated/post_process.yaml to provide a test-only configuration.

       Two  configuration  files, in easy to write YAML format, specify details about your system
       and samples to run:

       · bcbio_sample.yaml Details about a set of samples to process, including input  files  and
         analysis  options.  You configure these for each set of samples to process. This will be
         the main file  prepared  for  each  sample  run  and  the  documentation  below  details
         techniques to help prepare them.

       · bcbio_system.yaml  High  level  information  about  the  system,  including locations of
         installed programs like GATK and cores and memory usage (see tuning-cores). These  apply
         across multiple runs. The automated installer creates a ready to go system configuration
         file that can be manually edited to match the  system.  Find  the  file  in  the  galaxy
         sub-directory       within      your      installation      data      location      (ie.
         /usr/local/share/bcbio-nextgen/galaxy). To modify system parameters for a specific  run,
         supply Sample or run specific resources in your bcbio_sample.yaml file.

       Commented  system  and  sample  example  files  are available in the config directory. The
       example-pipelines section contains additional examples of ready to run sample files.

AUTOMATED SAMPLE CONFIGURATION

       bcbio-nextgen provides a utility to create configuration files for multiple sample  inputs
       using  a base template. Start with one of the best-practice templates, or define your own,
       then apply to multiple samples using the template workflow command:

          bcbio_nextgen.py -w template freebayes-variant project1.csv sample1.bam sample2_1.fq sample2_2.fq

       · freebayes-variant is the name of the standard freebayes-variant.yaml  input,  which  the
         script  fetches  from  GitHub.  This argument can also be a path to a locally customized
         YAML configuration. In both cases, the script  replicates  the  single  sample  template
         configuration to all input samples.

       · project1.csv  is  a  comma separated value file containing sample metadata, descriptions
         and algorithm tweaks:

            samplename,description,batch,phenotype,sex,variant_regions
            sample1,ERR256785,batch1,normal,female,/path/to/regions.bed
            sample2,ERR256786,batch1,tumor,,/path/to/regions.bed

         The first column links the metadata to a specific input file. The template command tries
         to  identify  the samplename from read group information in a BAM file, or uses the base
         filename if no read group information is present. For  BAM  files,  this  would  be  the
         filename  without  the extension and path (/path/to/yourfile.bam => yourfile). For FASTQ
         files, the template functionality will identify pairs using standard conventions (_1 and
         _2,  including  Illumina  extensions  like  _R1), so use the base filename without these
         (/path/to/yourfile_R1.fastq => yourfile).  Note  that  paired-end  samples  sequentially
         numbered    without    leading    zeros   (e.g.,   sample_1_1.fastq,   sample_1_2.fastq,
         sample_2_1.fastq, sample_2_2.fastq, etc., will likely not be parsed correctly; see #1919
         for  more info). In addition, . characters could be problematic, so it's better to avoid
         this character and use it only as separation for the file extension.
                The remaining columns can contain:

            · description Changes the sample description, originally supplied by the file name or
              BAM  read  group,  to  this value. You can also set the lane, although this is less
              often done as the default sequential numbering works here.  See  the  documentation
              for Sample information on how these map to BAM read groups.

            · Algorithm  parameters  specific  for  this  sample.  If  the column name matches an
              available Algorithm  parameters,  then  this  value  substitutes  into  the  sample
              algorithm,  replacing  the  defaults  from the template.  You can also change other
              information in the BAM read group through the algorithm parameters.  See  Alignment
              configuration documentation for details on how these map to read group information.

            · Sample information metadata key/value pairs. Any columns not falling into the above
              cases will go into the metadata section. A ped specification will  allow  bcbio  to
              read  family, gender and phenotype information from a PED input file and use it for
              batch, sex and phenotype, respectively. The PED inputs supplement information  from
              the  standard  template file, so if you specify a value in the template CSV the PED
              information will no overwrite  it.  Alternatively,  ped  fields  can  be  specified
              directly  in  the metadata as columns. If family_id is specified it will be used as
              the family_id for that sample, otherwise batch will be used. The description column
              is  used  as  the individual_id column and the phenotype column will be used for as
              the affected column in the PED format:

                  samplename,description,phenotype,batch,sex,ethnicity,maternal_id,paternal_id,family_id
                  NA12878.bam,NA12878,-9,CEPH,female,-9,NA12892,NA12891,NA12878FAM

         Individual column items can  contain  booleans  (true  or  false),  integers,  or  lists
         (separated  by  semi-colons).  These  get converted into the expected time in the output
         YAML file. For instance, to specify a sample that should go into multiple batches:

            samplename,description,phenotype,batch
            normal.bam,two_normal,normal,Batch1;Batch2

         For dictionary inputs like somatic-w-germline-variants setups, you can separate items in
         a dictionary with colons and double colons, and also use semicolons for lists:

            samplename,description,phenotype,variantcaller
            tumor.bam,sample1,tumor,germline:freebayes;gatk-haplotype::somatic:vardict;freebayes

         The name of the metadata file, minus the .csv extension, is a short name identifying the
         current  project.  The  script  creates  a  project1  directory  containing  the  sample
         configuration in project1/config/project1.yaml.

       · The  remaining  arguments  are  input  BAM  or FASTQ files. The script pairs FASTQ files
         (identified by _1 and _2) and extracts sample names  from  input  BAMs,  populating  the
         files  and  description  field in the final configuration file. Specify the full path to
         sample files on your current machine.

       To make it easier to define your own project specific template, an optional first step  is
       to download and edit a local template. First retrieve a standard template:

          bcbio_nextgen.py -w template freebayes-variant project1

       This  pulls  the  current  GATK  best  practice variant calling template into your project
       directory in project1/config/project1-template.yaml. Manually edit  this  file  to  define
       your  options,  then  run  the  full  template creation for your samples, pointing to this
       custom configuration file:

          bcbio_nextgen.py -w template project1/config/project1-template.yaml project1.csv folder/*

       If your sample folder contains additional BAM or FASTQ files you do not wish to include in
       the  sample YAML configuration, you can restrict the output to only include samples in the
       metadata CSV with --only-metadata. The  output  will  print  warnings  about  samples  not
       present in the metadata file, then leave these out of the final output YAML:

          bcbio_nextgen.py -w template --only-metadata project1/config/project1-template.yaml project1.csv folder/*

MULTIPLE FILES PER SAMPLE

       In   case   you   have   multiple  FASTQ  or  BAM  files  for  each  sample  you  can  use
       bcbio_prepare_samples.py.  The main parameters are:

       · --out: the folder where the merged files will be

       · --csv: the CSV file that is exactly the same as described previously, but having as many
         duplicate lines for each sample as files to be merged:

            samplename,description,batch,phenotype,sex,variant_regions
            file1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
            file2.fastq,sample1,batch1,normal,female,/path/to/regions.bed
            file1.fastq,sample2,batch1,tumor,,/path/to/regions.bed

       An example of usage is:

          bcbio_prepare_samples.py --out merged --csv project1.csv

       The script will create the sample1.fastq,sample2.fastq in the merged folder, and a new CSV
       file in the same folder than the input CSV :project1-merged.csv. Later, it can be used for
       bcbio:

          bcbio_nextgen.py -w template project1/config/project1-template.yaml project1-merged.csv merged/*fastq

       The new CSV file will look like:

          samplename,description,batch,phenotype,sex,variant_regions
          sample1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
          sample2.fastq,sample2,batch1,tumor,,/path/to/regions.bed

       It supports parallelization the same way bcbio_nextgen.py does:

          python $BCBIO_PATH/scripts/utils/bcbio_prepare_samples.py --out merged --csv project1.csv -t ipython -q queue_name -s lsf -n 1

       See more examples at parallelize pipeline.

       In case of paired reads, the CSV file should contain all files:

          samplename,description,batch,phenotype,sex,variant_regions
          file1_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
          file2_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
          file1_R2.fastq,sample1,batch1,normal,femela,/path/to/regions.bed
          file2_R2.fastq,sample1,batch1,normal,female,/path/to/regions.bed

       The  script  will  try  to  guess  the  paired files the same way that bcbio_nextgen.py -w
       template does. It would detect paired files if the difference  among  two  files  is  only
       _R1/_R2 or -1/-2 or _1/_2 or .1/.2

       The output CSV will look like and is compatible with bcbio:

          samplename,description,batch,phenotype,sex,variant_regions
          sample1,sample1,batch1,normal,female,/path/to/regions.bed

SAMPLE INFORMATION

       The sample configuration file defines details of each sample to process:

          details:
            - analysis: variant2
              lane: 1
              description: Example1
              files: [in_pair_1.fq, in_pair_2.fq]
              genome_build: hg19
              algorithm:
                platform: illumina
              metadata:
                batch: Batch1
                sex: female
                platform_unit: flowcell-barcode.lane
                library: library_type

       · analysis Analysis method to use [variant2, RNA-seq, smallRNA-seq]

       · lane A unique number within the project. Corresponds to the ID parameter in the BAM read
         group.

       · description Unique name for this sample, corresponding to the SM parameter  in  the  BAM
         read group. Required.

       · files  A  list  of  files to process. This currently supports either a single end or two
         paired-end FASTQ files, or a single BAM file. It does not yet handle merging  BAM  files
         or more complicated inputs.

       · genome_build  Genome  build  to align to, which references a genome keyword in Galaxy to
         find location build files.

       · algorithm Parameters to configure algorithm inputs. Options  described  in  more  detail
         below:

         · platform Sequencing platform used. Corresponds to the PL parameter in BAM read groups.
           Optional, defaults to illumina.

       · metadata Additional descriptive metadata about the sample:

            · batch defines a group that the sample falls in.  We  perform  multi-sample  variant
              calling  on all samples with the same batch name. This can also be a list, allowing
              specification of a single normal sample to pair  with  multiple  tumor  samples  in
              paired cancer variant calling (batch: [MatchWithTumor1, MatchWithTumor2]).

            · sex specifies the sample gender used to correctly prepare X/Y chromosomes. Use male
              and female or PED style inputs (1=male, 2=female).

            · phenotype stratifies cancer samples into tumor and  normal  or  case/controls  into
              affected  and  unaffected.  Also  accepts  PED  style specifications (1=unaffected,
              2=affected). CNVkit uses case/control status to determine  how  to  set  background
              samples for CNV calling.

            · prep_method  A  free  text  description  of the method used in sample prep. Used to
              group together samples during CNV calling for background.  This is not required and
              when not present bcbio assumes all samples in an analysis use the same method.

            · svclass  defines  a  classification for a sample for use in SV case/control setups.
              When set as  control  will  put  samples  into  the  background  samples  used  for
              normalization.

            · ped   provides  a  PED  phenotype  file  containing  sample  phenotype  and  family
              information. Template creation uses this to supplement  batch,  sex  and  phenotype
              information  provided  in  the  template CSV. GEMINI database creation uses the PED
              file as input.

            · platform_unit -- Unique identifier for sample. Optional, defaults to  lane  if  not
              specified.

            · library -- Name of library preparation used. Optional, empty if not present.

            · validate_batch  --  Specify  a  batch  name to group samples together for preparing
              validation plots. This is useful  if  you  want  to  process  samples  in  specific
              batches, but include multiple batches into the same validation plot.

            · validate_combine  --  Specify  a  batch  name  to  combine multiple samples into an
              additional validation summary. Useful  for  larger  numbers  of  small  samples  to
              evaluate together.

UPLOAD

       The  upload  section  of  the  sample  configuration file describes where to put the final
       output files of the pipeline. At its simplest, you can configure bcbio-nextgen  to  upload
       results  to  a  local  directory,  for  example a folder shared amongst collaborators or a
       Dropbox account. You can also configure it to upload results  automatically  to  a  Galaxy
       instance,  to  Amazon  S3  or to iRODS. Here is the simplest configuration, uploading to a
       local directory:

          upload:
            dir: /local/filesystem/directory

       General parameters, always required:

       · method Upload method to employ. Defaults to local filesystem.  [filesystem, galaxy,  s3,
         irods]

       · dir Local filesystem directory to copy to.

       Galaxy parameters:

       · galaxy_url  URL  of  the  Galaxy  instance  to upload to. Upload assumes you are able to
         access a shared directory also present on the Galaxy machine.

       · galaxy_api_key User API key to access Galaxy: see the Galaxy API documentation.

       · galaxy_library Name of the Galaxy Data Library  to  upload  to.  You  can  specify  this
         globally  for  a  project  in  upload  or  for  individual samples in the sample details
         section.

       · galaxy_role Specific Galaxy access roles to assign to the  uploaded  datasets.  This  is
         optional  and will default to the access of the parent data library if not supplied. You
         can specify this globally for a project in upload  or  for  individual  samples  in  the
         sample details section. The Galaxy Admin documentation has more details about roles.

       Here is an example configuration for uploading to a Galaxy instance. This assumes you have
       a shared mounted filesystem that your Galaxy instance can also access:

          upload:
            method: galaxy
            dir: /path/to/shared/galaxy/filesystem/folder
            galaxy_url: http://url-to-galaxy-instance
            galaxy_api_key: YOURAPIKEY
            galaxy_library: data_library_to_upload_to

       Your Galaxy universe_wsgi.ini configuration needs to have allow_library_path_paste =  True
       set to enable uploads.

       S3 parameters:

       · bucket AWS bucket to direct output.

       · folder A folder path within the AWS bucket to prefix the output.

       · region AWS region name to use. Defaults to us-east-1

       · reduced_redundancy Flag to determine if we should store S3 data with reduced redundancy:
         cheaper but less reliable [false, true]

       For S3 access credentials, set the standard  environmental  variables,  AWS_ACCESS_KEY_ID,
       AWS_SECRET_ACCESS_KEY,  and  AWS_DEFAULT_REGION  or  use IAM access roles with an instance
       profile on EC2 to give your instances permission to create temporary S3 access.

       iRODS parameters:

       · folder Full directory name within iRODS to prefix the output.

       · resource (optional) iRODS resource name, if other than default.

       example configuration

          upload:
                 method:   irods   dir:   ../final   folder:   /irodsZone/your/path/    resource:
                 yourResourceName

       Uploads  to iRODS depend on a valid installation of the iCommands CLI, and a preconfigured
       connection through the iinit command.

GLOBALS

       You can define files used multiple times in the algorithm section of your configuration in
       a  top  level  globals dictionary. This saves copying and pasting across the configuration
       and makes it easier to manually adjust the configuration if inputs change:

          globals:
            my_custom_locations: /path/to/file.bed
          details:
            - description: sample1
              algorithm:
                variant_regions: my_custom_locations
            - description: sample2
              algorithm:
                variant_regions: my_custom_locations

ALGORITHM PARAMETERS

       The YAML configuration file provides a number of hooks to customize analysis in the sample
       configuration file. Place these under the algorithm keyword.

   Alignment
       · platform  Sequencing  platform used. Corresponds to the PL parameter in BAM read groups.
         Default 'Illumina'.

       · aligner Aligner to use: [bwa, bowtie, bowtie2, hisat2, minimap2, novoalign, snap,  star,
         tophat2,  false]  To  use pre-aligned BAM files as inputs to the pipeline, set to false,
         which will also skip duplicate marking by default.  Using  pre-aligned  inputs  requires
         proper  assignment  of  BAM  read  groups  and sorting. The bam_clean argument can often
         resolve issues with problematic input BAMs.

       · bam_clean Clean an input BAM when skipping alignment  step.  This  handles  adding  read
         groups,  sorting to a reference genome and filtering problem records that cause problems
         with GATK. Options:

            · remove_extracontigs -- Remove non-standard chromosomes (for human, anything that is
              not  chr1-22,X,Y)  from  the  BAM  file.  This  allows  compatibility  when the BAM
              reference genome has different contigs  from  the  reference  file  but  consistent
              ordering  for  standard chromosomes.  Also fixes the read groups in the BAM file as
              in fixrg. This is faster than the full picard cleaning option.

            · fixrg -- only  adjust  read  groups,  assuming  everything  else  in  BAM  file  is
              compatible.

            · picard  --  Picard/GATK  based  cleaning.  Includes  read  group changes, fixing of
              problematic reads and re-ordering chromosome order to match the  reference  genome.
              To  fix  misencoded  input  BAMs  with  non-standard  scores, set quality_format to
              illumina.

       · bam_sort Allow sorting of input BAMs  when  skipping  alignment  step  (aligner  set  to
         false).  Options are coordinate or queryname. For additional processing through standard
         pipelines requires coordinate sorted inputs. The default is to not do additional sorting
         and assume pre-sorted BAMs.

       · disambiguate For mixed or explant samples, provide a list of genome_build identifiers to
         check and remove from alignment. Currently supports  cleaning  a  single  organism.  For
         example,  with  genome_build:  hg19  and disambiguate: [mm10], it will align to hg19 and
         mm10, run disambiguation and discard reads confidently aligned to  mm10  and  not  hg19.
         Affects  fusion  detection  when star is chosen as the aligner. Aligner must be set to a
         non false value for this to run.

       · align_split_size: Increase parallelization of alignment. As of 0.9.8, bcbio will try  to
         determine  a  useful  parameter and you don't need to set this.  If you manually set it,
         bcbio will respect your specification. Set to false to avoid splitting entirely. If set,
         this  defines  the  number  of  records to feed into each independent parallel step (for
         example, 5000000 = 5 million reads per chunk). It  converts  the  original  inputs  into
         bgzip  grabix  indexed  FASTQ  files,  and then retrieves chunks for parallel alignment.
         Following alignment, it combines all chunks back into the final merged  alignment  file.
         This  allows  parallelization  at  the  cost  of additional work of preparing inputs and
         combining split outputs. The tradeoff makes sense when you have large files and lots  of
         distributed compute. When you have fewer large multicore machines this parameter may not
         help speed up processing.

       · quality_format Quality format of FASTQ or BAM inputs [standard, illumina]

       · strandedness For RNA-seq  libraries,  if  your  library  is  strand  specific,  set  the
         appropriate  flag from [unstranded, firststrand, secondstrand].  Defaults to unstranded.
         For dUTP marked libraries, firststrand is correct;  for  Scriptseq  prepared  libraries,
         secondstrand is correct.

       · save_diskspace  Remove  align prepped bgzip and split BAM files after merging into final
         BAMs.  Helps  reduce  space  on   limited   filesystems   during   a   run.   tools_off:
         [upload_alignment] may also be useful in conjunction with this. [false, true]

   Read trimming
       · trim_reads Trims low quality or adapter sequences or at the ends of reads using atropos.
         adapters and custom_trim specify the sequences to trim.  For RNA-seq,  it's  recommended
         to  leave  as  False unless running Tophat2.  For variant calling, we recommend trimming
         only in special cases where standard  soft-clipping  does  not  resolve  false  positive
         problems. Supports trimming with
         `<https://github.com/jdidion/atropos> atropos`_
          or fastp. fastp is currently not compatible with alignment splitting in variant calling
         and requires align_split_size: false. The old parameter read_through defaults  to  using
         atropos trimming.  [False, atropos, fastp]. Default to False,

       · adapters If trimming adapter read through, trim a set of stock adapter sequences. Allows
         specification of multiple items in a list, for example [truseq, polya]  will  trim  both
         TruSeq  adapter  sequences  and  polyA  tails.  polyg  trimming  removes  high quality G
         stretches present in NovaSeq and NextSeq data. In the small RNA pipeline, bcbio will try
         to  detect  the  adapter  using DNApi. If you set up this parameter, then bcbio will use
         this value instead.  Choices: [truseq, illumina, nextera, polya, polyx, polyg, nextera2,
         truseq2].

         · nextera2: Illumina NEXTera DNA prep kit from NEB

         · truseq2: SMARTer Universal Low Input RNA Kit

       · custom_trim  A  list of sequences to trim from the end of reads, for example: [AAAATTTT,
         GGGGCCCC]

       · min_read_length Minimum read length  to  maintain  when  read_through  trimming  set  in
         trim_reads. Defaults to 25.

       · trim_ends Specify values for trimming at ends of reads, using a fast approach built into
         fastq preparation. This does not do quality or adapter trimming but will quickly cut off
         a  defined  set  of  values  from either the 5' or 3' end of the first and second reads.
         Expects a list of 4 values: [5' trim read1, 3'  trim  read1,  5'  trim  read2,  3'  trim
         read2]. Set values to 0 if you don't need trimming (ie. [6, 0, 12, 0] will trim 6bp from
         the start of read 1 and 12bp from the start of read  2.  Only  implemented  for  variant
         calling pipelines.

   Alignment postprocessing
       · mark_duplicates  Mark  duplicated  reads [true, false].  If true, will perform streaming
         duplicate marking with biobambam's bammarkduplicates or bamsormadup.  Uses samblaster as
         an alternative if you have paired reads and specifying lumpy as an svcaller. Defaults to
         true for variant calling and false for RNA-seq and small RNA analyses. Also defaults  to
         false if you're not doing alignment (aligner: false).

       · recalibrate  Perform base quality score recalibration on the aligned BAM file, adjusting
         quality scores to reflect alignments and known variants. Supports both GATK and Sentieon
         recalibration.  Defaults to false, no recalibration. [false, gatk, sentieon]

       · realign Perform GATK's realignment around indels on the aligned BAM file. Defaults to no
         realignment since realigning callers like FreeBayes and GATK HaplotypeCaller handle this
         as part of the calling process. [false, gatk]

   Coverage information
       · coverage_interval  Regions  covered  by  sequencing. bcbio calculates this automatically
         from alignment coverage information, so you  only  need  to  specify  it  in  the  input
         configuration if you have specific needs or bcbio does not determine coverage correctly.
         genome specifies full genome sequencing, regional identifies  partial-genome  pull  down
         sequencing  like  exome  analyses,  and  amplicon  is partial-genome sequencing from PCR
         amplicon sequencing. This influences GATK options for filtering: we use Variant  Quality
         Score  Recalibration  when  set to genome, otherwise we apply cutoff-based soft filters.
         Also affects copy number calling with CNVkit, structural variant calling and deep  panel
         calling  in  cancer  samples,  where  we  tune  regional/amplicon  analyses  to maximize
         sensitivity.  [genome, regional, amplicon]

       · maxcov_downsample bcbio downsamples whole genome runs with >10x average  coverage  to  a
         maximum  coverage, avoiding slow runtimes in collapsed repeats and poly-A/T/G/C regions.
         This parameter specified the multiplier  of  average  coverage  to  downsample  at.  For
         example,  200 downsamples at 6000x coverage for a 30x whole genome. Set to false or 0 to
         disable downsampling. Current defaults to false pending runtime improvements.

       · coverage_depth_min Minimum depth of coverage. When calculating regions to call in, bcbio
         may  exclude regions with less than this many reads. It is not a hard filter for variant
         calling, but rather a guideline for determining callable regions. It's primarily  useful
         when  trying to call on very low depth samples. Defaults to 4. Setting lower than 4 will
         trigger low-depth calling options for GATK.

   Analysis regions
       These  BED  files  define  the  regions  of  the  genome  to  analyze   and   report   on.
       variant_regions  adjusts  regions  for  small variant (SNP and indel) calling.  sv_regions
       defines regions for structural variant calling  if  different  than  variant_regions.  For
       coverage-based  quality  control  metrics,  we  first  use  coverage  if  specified,  then
       sv_regions if specified, then variant_regions. See the section on Input  file  preparation
       for  tips  on ensuring chromosome naming in these files match your reference genome. bcbio
       pre-installs some standard BED files for human analyses. Reference these using the  naming
       schemes described in the reference data repository.

       · variant_regions BED file of regions to call variants in.

       · sv_regions -- A specification of regions to target during structural variant calling. By
         default, bcbio  uses  regions  specified  in  variant_regions  but  this  allows  custom
         specification  for  structural  variant  calling. This can be a pointer to a BED file or
         special inputs: exons for only exon regions, transcripts for transcript regions (the min
         start  and  max  end  of exons) or transcriptsXXXX for transcripts plus a window of XXXX
         size  around  it.  The  size  can  be  an  integer  (transcripts1000)   or   exponential
         (transcripts1e5). This applies to CNVkit and heterogeneity analysis.

       · coverage  A  BED file of regions to check for coverage and completeness in QC reporting.
         This can also be a shorthand for a BED file installed by bcbio (see  Structural  variant
         calling for options).

       · exclude_regions  List of regions to remove as part of analysis. This allows avoidance of
         slow and potentially misleading regions. This is a list of the following options:

            · polyx Avoid calling variants in regions of single nucleotide stretches greater than
              50.  These  can  contribute  to  long variant calling runtimes when errors in polyX
              stretches align in high depth to these regions and take a lot of work  to  resolve.
              Since  we don't expect decent resolution through these types of repeats, this helps
              avoid extra calculations for  assessing  the  noise.  This  is  an  alternative  to
              trimming polyX from the 3' ends for reads with trim_reads and adapters. Requires an
              organism with a defined polyx file in  genome  resources.  For  structural  variant
              calling,  adding  polyx  avoids  calling  small  indels  for Manta, where these can
              contribute to long runtimes.

            · lcr Avoid calling variants in low complexity regions  (LCRs).   Heng  Li's  variant
              artifacts  paper  provides  these  regions,  which  cover  ~2%  of  the  genome but
              contribute to a large fraction of  problematic  calls  due  to  the  difficulty  of
              resolving  variants  in repetitive regions. Removal can help facilitate comparisons
              between methods and reduce false positives if you don't need calls in LCRs for your
              biological  analysis.  Requires  an  organism  with  a  defined  lcr file in genome
              resources.

            · highdepth Remove high depth regions during variant calling, identified by collapsed
              repeats  around  centromeres  in  hg19  and  GRCh37  as characterized in the ENCODE
              blacklist.  This is on by default for VarDict and FreeBayes whole genome calling to
              help  with  slow runtimes in these regions, and also on for whole genome structural
              variant calling to avoid false positives from high depth repeats.

            · altcontigs Skip calling entirely in alternative and unplaced contigs.  This  limits
              analysis  to standard chromosomes -- chr1-22,X,Y,MT for human -- to avoid slowdowns
              on the additional contigs.

   Variant calling
       · variantcaller Variant calling algorithm. Can be a list of multiple options or  false  to
         skip  [false, freebayes, gatk-haplotype, haplotyper, platypus, mutect, mutect2, scalpel,
         tnhaplotyper, tnscope, vardict, varscan, samtools, gatk]

            · Paired (typically somatic, tumor-normal) variant calling is currently supported  by
              vardict,  freebayes, mutect2, mutect (see disclaimer below), scalpel (indels only),
              tnhaplotyper  (Sentieon),  tnscope  (Sentieon)  and  varscan.  See   the   pipeline
              documentation on cancer-calling for details on pairing tumor and normal samples.

            · You  can  generate  both somatic and germline calls for paired tumor-normal samples
              using  different  sets  of  callers.  The   pipeline   documentation   on   calling
              somatic-w-germline-variants details how to do this.

            · mutect,  a SNP-only caller, can be combined with indels from scalpel or sid. Mutect
              operates in both tumor-normal and tumor-only modes.  In tumor-only mode the  indels
              from scalpel will reflect all indels in the sample, as there is currently no way of
              separating the germline from somatic indels in tumor-only mode.

       ·

         indelcaller For the MuTect SNP only variant caller it is possible to add
                calls from an indelcaller such as scalpel, pindel and somatic indel detector (for
                Appistry  MuTect  users  only).  Currently an experimental option that adds these
                indel calls to MuTect's SNP-only output. Only  one  caller  supported.   Omit  to
                ignore. [scalpel, pindel, sid, false]

       · jointcaller  Joint  calling  algorithm,  combining  variants  called  with the specified
         variantcaller. Can be a list of multiple options but needs  to  match  with  appropriate
         variantcaller.  Joint  calling  is  only  needed  for  larger  input  sample sizes (>100
         samples), otherwise use standard pooled population-calling:

            · gatk-haplotype-joint GATK incremental joint discovery with  HaplotypeCaller.  Takes
              individual gVCFs called by gatk-haploype and perform combined genotyping.

            · freebayes-joint Combine freebayes calls using bcbio.variation.recall with recalling
              at all positions found  in  each  individual  sample.  Requires  freebayes  variant
              calling.

            · platypus-joint  Combine  platypus  calls using bcbio.variation.recall with squaring
              off at all positions found in each individual  sample.  Requires  platypus  variant
              calling.

            · samtools-joint  Combine  samtools  calls using bcbio.variation.recall with squaring
              off at all positions found in each individual  sample.  Requires  samtools  variant
              calling.

       · joint_group_size  Specify the maximum number of gVCF samples to feed into joint calling.
         Currently applies to GATK  HaplotypeCaller  joint  calling  and  defaults  to  the  GATK
         recommendation  of  200.  Larger  numbers  of  samples  will first get combined prior to
         genotyping.

       · ploidy Ploidy of called reads. Defaults to 2 (diploid). You  can  also  tweak  specialty
         ploidy like mitochondrial calling by setting ploidy as a dictionary. The defaults are:

            ploidy:
              default: 2
              mitochondrial: 1
              female: 2
              male: 1

       · background  Provide  pre-calculated files to use as backgrounds for different processes.
         Organized as a dictionary with individual keys for different components of the pipeline.
         You can enter as many or few as needed:

            · variant  A  VCF  file with variants to use as a background reference during variant
              calling. For tumor/normal paired calling use this  to  supply  a  panel  of  normal
              individuals.

            · cnv_reference Background reference file for copy number calling.

   Somatic variant calling
       · min_allele_fraction  Minimum  allele  fraction to detect variants in heterogeneous tumor
         samples, set as the float or integer percentage to resolve (i.e. 10 = alleles in 10%  of
         the sample). Defaults to 10. Specify this in the tumor sample of a tumor/normal pair.

   Variant annotation
       · effects  Method  used to calculate expected variant effects; defaults to snpEff. Ensembl
         variant effect predictor (VEP) is also available with support for dbNSFP  and
         `dbscSNV`_
          annotation, when downloaded using datatarget-install. [snpeff, vep, false]

       · effects_transcripts Define the transcripts to  use  for  effect  prediction  annotation.
         Options  all:  Standard  Ensembl transcript list (the default); canonical: Report single
         canonical transcripts (-canon in  snpEff,  -pick  in  VEP);  canonical_cancer  Canonical
         transcripts with hand curated changes for more common cancer transcripts (effects snpEff
         only).

       · vcfanno  Configuration  files  for  vcfanno,  allowing  the  application  of  additional
         annotations to variant calls. By default, bcbio will try and apply:

            · gemini  --  External population level annotations from GEMINI. This is only run for
              human samples with gemini data installed (datatarget-install).

            · somatic -- Somatic annotations from COSMIC, ClinVar  and  friends.  COSMIC  need  a
              custom  installation  within  bcbio  (datatarget-install).  Only added for tumor or
              tumor/normal somatic calling.

            · rnaedit -- RNA editing sites for RNA-seq variant calling runs.

         bcbio installs pre-prepared configuration files in genomes/build/config/vcfanno  or  you
         can specify the full path to a /path/your/anns.conf and optionally an equivalently named
         /path/your/anns.lua file. This value can be a list for multiple inputs.

   Structural variant calling
       · svcaller -- List of structural variant callers to use.  [lumpy,  manta,  cnvkit,  seq2c,
         delly, battenberg]. LUMPY and Manta require paired end reads.

       · svprioritize  --  Produce a tab separated summary file of structural variants in regions
         of interest. This complements  the  full  VCF  files  of  structural  variant  calls  to
         highlight  changes in known genes. See the paper on cancer genome prioritization for the
         full details. This can be either the path to a BED file (with chrom start end gene_name,
         see  Input  file  preparation)  or  the  name of one of the pre-installed prioritization
         files:

            · cancer/civic (hg19, GRCh37, hg38) -- Known cancer associated genes from CIViC.

            · cancer/az300 (hg19, GRCh37, hg38) -- 300 cancer  associated  genes  contributed  by
              AstraZeneca oncology.

            · cancer/az-cancer-panel  (hg19,  GRCh37,  hg38)  --  A  text  file  of  genes in the
              AstraZeneca cancer panel. This is only usable for svprioritize  which  can  take  a
              list of gene names instead of a BED file.

            · actionable/ACMG56  --  Medically  actionable genes from the The American College of
              Medical Genetics and Genomics

            · coding/ccds (hg38) -- Consensus CDS (CCDS) regions  with  2bps  added  to  internal
              introns  to capture canonical splice acceptor/donor sites, and multiple transcripts
              from a single gene merged into a single all inclusive gene entry.

       · fusion_mode Enable fusion detection in RNA-seq when using STAR (recommended)  or  Tophat
         (not  recommended)  as  the  aligner.  OncoFuse  is  used  to  summarise the fusions but
         currently only supports hg19  and  GRCh37.  For  explant  samples  disambiguate  enables
         disambiguation  of  STAR  output  [false,  true].  This option is deprecated in favor of
         fusion_caller.

       · fusion_caller Specify a standalone fusion caller for fusion mode. Supports oncofuse  for
         STAR/tophat  runs,  pizzly  and  ericscript  for  all  runs.   If a standalone caller is
         specified (i.e. pizzly or ericscript ), fusion detection  will  not  be  performed  with
         aligner. oncofuse only supports human genome builds GRCh37 and hg19. ericscript supports
         human genome builds GRCh37,  hg19  and  hg38  after  installing  the  associated  fusion
         databases (datatarget-install).

   HLA typing
       · hlacaller  --  Perform  identification  of  highly  polymorphic HLAs with human build 38
         (hg38). The recommended option is optitype, using the  OptiType  caller.  Also  supports
         using the bwa HLA typing implementation with bwakit

   Validation
       bcbio  pre-installs  standard truth sets for performing validation, and also allows use of
       custom local files for assessing reliability of your runs:

       · validate A VCF file of expected variant calls to perform validation and grading of small
         variants  (SNPs  and  indels)  from  the  pipeline.  This provides a mechanism to ensure
         consistency of calls  against  a  known  set  of  variants,  supporting  comparisons  to
         genotyping array data or reference materials.

       · validate_regions  A BED file of regions to evaluate small variant calls in. This defines
         specific regions covered by the validate VCF  file.

       · svvalidate -- Dictionary of call types and pointer to BED file  of  known  regions.  For
         example:  DEL: known_deletions.bed does deletion based validation of outputs against the
         BED file.

       Each option can be either the path to a local file, or a partial path to  a  file  in  the
       pre-installed truth sets. For instance, to validate an NA12878 run against the Genome in a
       Bottle truth set:

          validate: giab-NA12878/truth_small_variants.vcf.gz
          validate_regions: giab-NA12878/truth_regions.bed
          svvalidate:
            DEL: giab-NA12878/truth_DEL.bed

       follow the same naming schemes  for  small  variants,  regions  and  different  structural
       variant types. bcbio has the following validation materials for germline validations:

       · giab-NA12878  --   Genome  in  a  Bottle  for  NA12878, a Caucasian sample.  Truth sets:
         small_variants, regions, DEL; Builds: GRCh37, hg19, hg38

       · giab-NA24385 --  Genome in a Bottle for NA24385, an  Ashkenazic  Jewish  sample.   Truth
         sets: small_variants, regions; Builds: GRCh37, hg19, hg38

       · giab-NA24631  --   Genome  in  a  Bottle  for  NA24631,  a  Chinese sample.  Truth sets:
         small_variants, regions; Builds: GRCh37, hg19, hg38

       · giab-NA12878-crossmap --  Genome  in  a  Bottle  for  NA12878  converted  to  hg38  with
         CrossMap. Truth sets: small_variants, regions, DEL; Builds: hg38

       · giab-NA12878-remap  --   Genome  in  a  Bottle for NA12878 converted to hg38 with Remap.
         Truth sets: small_variants, regions, DEL; Builds: hg38

       · platinum-genome-NA12878  --  Illumina  Platinum  Genome   for   NA12878.   Truth   sets:
         small_variants, regions; Builds: hg19, hg38

       and for cancer validations:

       · giab-NA12878-NA24385-somatic   --   A  sequenced  NA12878/NA24385  mixture  providing  a
         somatic-like  truth  set  for  detecting  low  frequency  events.  Build:  Truth   sets:
         small_variants, regions. Builds: GRCh37, hg38

       · dream-syn3  --  Synthetic dataset 3 from the ICGC-TCGA DREAM mutation calling challenge.
         Truth sets: small_variants, regions, DEL, DUP, INV, INS. Builds: GRCh37.

       · dream-syn4 -- Synthetic dataset 4 from the ICGC-TCGA DREAM mutation  calling  challenge.
         Truth sets: small_variants, regions, DEL, DUP, INV. Builds: GRCh37.

       · dream-syn3-crossmap  --  Synthetic  dataset  3 from the ICGC-TCGA DREAM mutation calling
         challenge  converted  to  human  build  38  coordinates  with  CrossMap.   Truth   sets:
         small_variants, regions, DEL, DUP, INV, INS. Builds: hg38.

       · dream-syn4-crossmap  --  Synthetic  dataset  4 from the ICGC-TCGA DREAM mutation calling
         challenge  converted  to  human  build  38  coordinates  with  CrossMap.   Truth   sets:
         small_variants, regions, DEL, DUP, INV. Builds: hg38.

       For more information on the hg38 truth set preparation see the work on validation on build
       38 and conversion of human build 37 truth sets  to  build  38.  The  installation  recipes
       contain provenance details about the origins of the installed files.

   UMIs
       Unique  molecular  identifiers  (UMIs)  are  short  random barcodes used to tag individual
       molecules and avoid amplification biased. Both single cell  RNA-seq  and  variant  calling
       support UMIs. For variant calling, fgbio collapses sequencing duplicates for each UMI into
       a single consensus read prior to running re-alignment and variant calling.  This  requires
       mark_duplicates:  true  (the default) since it uses position based duplicates and UMI tags
       for collapsing duplicate reads into consensus sequences.

       To   help   with   preparing   fastq   files   with   UMIs   bcbio   provides   a   script
       bcbio_fastq_umi_prep.py. This handles two kinds of UMI barcodes:

       · Separate  UMIs:  it converts reads output by an Illumina as 3 files (read 1, read 2, and
         UMIs).

       · Duplex barcodes with tags incorporated at the 5' end of read 1 and read 2

       In both cases, these get converted into  paired  reads  with  UMIs  in  the  fastq  names,
       allowing  specification  of  umi_type:  fastq_name  in  your bcbio YAML configuration. The
       script runs on a single set of files or autopairs an entire directory of fastq  files.  To
       convert a directory with separate UMI files:

          bcbio_fastq_umi_prep.py autopair -c <cores_to_use> <list> <of> <fastq> <files>

       To convert duplex barcodes present on the ends of read 1 and read 2:

          bcbio_fastq_umi_prep.py autopair -c <cores_to_use> --tag1 5 --tag2 5 <list> <of> <fastq> <files>

       Configuration options for UMIs:

       · umi_type  The  UMI/cellular  barcode  scheme  used  for  your  data. For single cell RNA
         sequencing, supports [harvard-indrop, harvard-indrop-v2, 10x_v2, icell8, surecell].  For
         variant  analysis with UMI based consensus calling, supports either fastq_name with UMIs
         in read names or the path to a fastq file with UMIs for each aligned read.

       You can adjust the fgbio default options by adjusting Resources. The most common change is
       modifying  the  minimum number of reads as input to consensus sequences. This default to 1
       to avoid losing reads but you can set to larger values for high depth panels:

          resources:
            fgbio:
              options: [--min-reads, 2]

   RNA sequencing
       · transcript_assembler If set, will assemble novel genes and  transcripts  and  merge  the
         results  into  the  known  annotation.  Can have multiple values set in a list. Supports
         ['cufflinks', 'stringtie'].

       · transcriptome_align If set to True, will also align reads to just the transcriptome, for
         use with EBSeq and others.

       · expression_caller   A  list  of  optional  expression  callers  to  turn  on.   Supports
         ['cufflinks', 'express', 'stringtie',  'sailfish',  'dexseq',  'kallisto'].  Salmon  and
         count based expression estimation are run by default.

       · fusion_caller A list of optional fusion callers to turn on. Supports [oncofuse, pizzly].

       · variantcaller  Variant  calling  algorithm  to  call  variants on RNA-seq data. Supports
         [gatk-haplotype] or [vardict].

       · spikein_fasta A FASTA file of spike in sequences to quantitate.

       · bcbiornaseq A dictionary of key-value pairs to be  passed  as  options  to  bcbioRNAseq.
         Currently  supports  organism  as a key and takes the latin name of the genome used (mus
         musculus, homo sapiens, etc) and interesting_groups which will be used to color  quality
         control plots.:

            bcbiornaseq:
              organism: homo sapiens
              interesting_groups: [treatment, genotype, etc, etc]

       You will need to also turn on bcbiornaseq by turning it on via tools_on: [bcbiornaseq].

   Fast RNA-seq
       · transcriptome_fasta  An  optional  FASTA  file  of transcriptome sequences to quantitate
         rather than using bcbio installed transcriptome sequences.

   Single-cell RNA sequencing
       · minimum_barcode_depth Cellular barcodes with less than this many reads assigned to  them
         are discarded (default 10,000).

       · cellular_barcodes  A single file or a list of one or two files which have valid cellular
         barcodes. Provide one file if there is only one barcode and two files  if  the  barcodes
         are   split.   If   no   file   is   provided,   all   cellular   barcodes  passing  the
         minimum_barcode_depth filter are kept.

       · transcriptome_fasta An optional FASTA file  of  transcriptome  sequences  to  quantitate
         rather than the bcbio installed version.

       · transcriptome_gtf  An  optional GTF file of the transcriptome to quantitate, rather than
         the  bcbio  installed  version.  This  is  recommended  for  single-cell  RNA-sequencing
         experiments.

       · singlecell_quantifier Quantifier to use for single-cell RNA-sequencing.  Supports rapmap
         or kallisto.

       · cellular_barcode_correction Number of errors to correct in identified cellular barcodes.
         Requires  a  set  of  known  barcodes  to  be  passed with the cellular_barcodes option.
         Defaults to 1. Set to 0 to turn off error correction.

       · sample_barcodes A text file with one barcode per line of expected sample barcodes.

   smallRNA sequencing
       · adapters The 3' end adapter that needs to be remove. For NextFlex protocol you  can  add
         adapters:  ["4N",  "$3PRIME_ADAPTER"].  For  any  other  options  you can use resources:
         atropos:options:["-u 4", "-u -4"].

       · species 3 letters code to indicate the species in mirbase classification (i.e.  hsa  for
         human).

       · aligner Currently STAR is the only one tested although bowtie can be used as well.

       · expression_caller  A  list of expression callers to turn on: trna, seqcluster, mirdeep2,
         mirge (read smallRNA-seq to learn how to set up bcbio to run mirge)

       · transcriptome_gtf An optional GTF file of the transcriptome to for seqcluster.

       · spikein_fasta A FASTA file of spike in sequences to quantitate.

       · umi_type: 'qiagen_smallRNA_umi' Support of Qiagen UMI small RNAseq protocol.

   ChIP sequencing
       · peakcaller bcbio only accepts [macs2]

       · aligner Currently bowtie2 is the only one tested

       · The phenotype and batch tags need to be set under metadata in the config YAML file.  The
         phenotype  tag  will  specify  the  chip (phenotype: chip) and input samples (phenotype:
         input). The batch tag will specify the input-chip pairs of samples for  example,  batch:
         pair1.  Same  input  can  be  used  for different chip samples giving a list of distinct
         values: batch: [sample1, sample2].

       · chip_method: currently supporting standard CHIP-seq (TF or broad regions using chip)  or
         ATAC-seq  (atac). Paramters will change depending on the option to get the best possible
         results. Only macs2 supported for now.

       You can pass different parameters for macs2 adding to Resources:

          resources:
            macs2:
              options: ["--broad"]

   Quality control
       · mixup_check   Detect   potential   sample   mixups.   Currently   supports   qSignature.
         qsignature_full  runs  a  larger  analysis  while  qsignature  runs  a smaller subset on
         chromosome 22.  [False, qsignature, qsignature_full]

       · kraken Turn on kraken algorithm to detect possible contamination. You  can  add  kraken:
         minikraken  and it will use a minimal database to detect possible contaminants. As well,
         you can point to a custom database directory and kraken will use it. You will  find  the
         results  in the qc directory. You need to use --datatarget kraken during installation to
         make the minikraken database available.

       · preseq     Accepts     lc_extrap     or     c_curve,     and     runs     Preseq      <‐
         http://smithlabresearch.org/software/preseq>`_,  a  tool  that  predicts  the  yield for
         future experiments. By default, it runs 300 steps of estimation using the segment length
         of  100000. The default extrapolation limit for lc_extrap is 3x of the reads number. You
         can override  the  parameters  seg_len,  steps,  extrap_fraction  using  the  Resources:
         section:

            resources:
              preseq:
                extrap_fraction: 5
                steps: 500
                seg_len: 5000

         And  you  can also set extrap and step parameters directly, as well as provide any other
         command line option via options:

            resources:
              preseq:
                extrap: 10000000
                step: 30000
                options: ["-D"]

       · bcbio uses MultiQC to combine QC output for all samples into a single  report  file.  If
         you need to tweak configuration settings from bcbio defaults, you can use Resources. For
         instance to display read counts with full numbers instead of the default millions:

            resources:
              multiqc:
                options: ["--cl_config", "'read_count_multiplier: 1'"]

         or as thousands:

            resources:
              multiqc:
                options: ["--cl_config", "'{read_count_multiplier: 0.001, read_count_prefix: K}'"]

   Post-processing
       · archive Specify targets for long term archival. cram removes fastq names and does  8-bin
         compression  of  BAM files into CRAM format.  cram-lossless generates CRAM files without
         changes to quality scores or fastq name. Default: [] -- no archiving.

   Changing bcbio defaults
       bcbio provides some hints to change default behavior be either turning  specific  defaults
       on or off, with tools_on and tools_off. Both can be lists with multiple options:

       · tools_off  Specify  third  party  tools  to  skip  as part of analysis pipeline. Enables
         turning off specific components of pipelines if not needed:

         · gatk4 Use older GATK versions (3.x) for GATK commands like BQSR,  HaplotypeCaller  and
           VQSR. By default bcbio includes GATK4 and uses it.

         · vqsr turns off variant quality score recalibration for all samples.

         · bwa-mem  forces  use  of original bwa aln alignment. Without this, we use bwa mem with
           70bp or longer reads. fastqc turns off quality control FastQC usage.

         · lumpy-genotype skip genotyping for Lumpy samples, which can be slow  in  the  case  of
           many structural variants.

         · seqcluster turns off use of seqcluster tool in srnaseq pipeline.

         · tumoronly-prioritization  turns  off attempted removal of germline variants from tumor
           only calls using external population data sources like ExAC and 1000 genomes.

         · vardict_somatic_filter disables running a post calling filter for  VarDict  to  remove
           variants found in normal samples. Without vardict_somatic_filter in paired analyses no
           soft filtering of germline variants is performed but all high quality variants pass.

         · upload_alignment turns off final upload of large alignment files.

         · pbgzip turns off use of bgzip with multiple threads.

         · coverage_qc turns off calculation  of  coverage  statistics  with  samtools-stats  and
           picard.

       · tools_on Specify functionality to enable that is off by default:

         · qualimap  runs  Qualimap  (qualimap  uses  downsampled  files  and numbers here are an
           estimation of 1e7 reads.).

         · qualimap_full runs Qualimap with full bam files but it may be slow.

         · damage_filter annotates low frequency somatic calls in INFO/DKFZBias  for  DNA  damage
           artifacts using DKFZBiasFilter.

         · tumoronly_germline_filter  applies a LowPriority filter to tumor-only calls that match
           population germline databases. The default is  to  just  apply  a  tag  EPR  (external
           prioritization)  that flags variants present in external databases. Anything missing a
           pass here is a likely germline.

         · vqsr makes GATK try quality score  recalibration  for  variant  filtration,  even  for
           smaller sample sizes.

         · svplots  adds  additional  coverage and summary plots for CNVkit and detected ensemble
           variants.

         · bwa-mem forces use of bwa mem even for samples with less than 70bp reads.

         · gvcf forces gVCF output for callers that support it (GATK HaplotypeCaller,  FreeBayes,
           Platypus).

         · vep_splicesite_annotations  enables  the use of the MaxEntScan and SpliceRegion plugin
           for VEP. Both optional plugins add extra splice site annotations.

         · gemini Create a GEMINI database of variants for downstream query using the new vcfanno
           and vcf2db approach.

         · gemini_orig  Create  a GEMINI database of variants using the older GEMINI loader. Only
           works for GRCh37 and hg19.

         · gemini_allvariants enables all variants to go into GEMINI, not only  those  that  pass
           filters.

         · vcf2db_expand  decompresses  and  expands the genotype columns in the vcfanno prepared
           GEMINI databases, enabling standard SQL queries on genotypes and depths.

         · bnd-genotype enables genotyping of breakends in Lumpy calls, which  improves  accuracy
           but can be slow.

         · lumpy_usecnv uses input calls from CNVkit as prior evidence to Lumpy calling.

         · coverage_perbase calculates per-base coverage depth for analyzed variant regions.

         · bcbiornaseq loads a bcbioRNASeq object for use with bcbioRNASeq.

   parallelization
       · nomap_split_size  Unmapped  base  pair  regions  required to split analysis into blocks.
         Creates islands of mapped reads surrounded by unmapped (or  N)  regions,  allowing  each
         mapped region to run in parallel. (default: 250)

       · nomap_split_targets Number of target intervals to attempt to split processing into. This
         picks unmapped regions evenly spaced across the genome to process concurrently. Limiting
         targets  prevents  a  large number of small targets. (default: 200 for standard runs, 20
         for CWL runs)

   Ensemble variant calling
       In addition to single method variant calling, we support  calling  with  multiple  calling
       methods and consolidating into a final Ensemble callset.

       The  recommended  method  to  do this uses a simple majority rule ensemble classifier that
       builds a final callset based on the intersection of calls. It selects variants represented
       in at least a specified number of callers:

          variantcaller: [mutect2, varscan, freebayes, vardict]
          ensemble:
            numpass: 2
            use_filtered: false

       This  example selects variants present in 2 out of the 4 callers and does not use filtered
       calls (the  default  behavior).  Because  of  the  difficulties  of  producing  a  unified
       FORMAT/genotype  field across callers, the ensemble outputs contains a mix of outputs from
       the different callers. It picks a representative sample in the order of specified  caller,
       so  in the example above would have a MuTect2 call if present, otherwise a VarScan call if
       present, otherwise a FreeBayes call.  This may require custom normalization scripts during
       post-processing  when  using these calls. bcbio.variation.recall implements this approach,
       which handles speed and file sorting limitations in the bcbio.variation approach.

       This older approach uses the bcbio.variation toolkit  to  perform  the  consolidation.  An
       example configuration in the algorithm section is:

          variantcaller: [gatk, freebayes, samtools, gatk-haplotype, varscan]
          ensemble:
            format-filters: [DP < 4]
            classifier-params:
              type: svm
            classifiers:
              balance: [AD, FS, Entropy]
              calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
            trusted-pct: 0.65

       The ensemble set of parameters configure how to combine calls from the multiple methods:

       · format-filters  A  set  of  filters  to  apply to variants before combining. The example
         removes all calls with a depth of less than 4.

       · classifier-params Parameters to  configure  the  machine  learning  approaches  used  to
         consolidate calls. The example defines an SVM classifier.

       · classifiers  Groups  of  classifiers  to  use for training and evaluating during machine
         learning. The example defines two set of criteria for distinguishing reads  with  allele
         balance issues and those with low calling support.

       · trusted-pct  Define  threshold  of variants to include in final callset. In the example,
         variants called by more than 65% of the approaches (4  or  more  callers)  pass  without
         being requiring SVM filtering.

RESOURCES

       The resources section allows customization of locations of programs and memory and compute
       resources to devote to them:

          resources:
            bwa:
              cores: 12
              cmd: /an/alternative/path/to/bwa
            samtools:
              cores: 16
              memory: 2G
            gatk:
              jvm_opts: ["-Xms2g", "-Xmx4g"]

       · cmd Location of an executable. By default, we assume executables are on the path.

       · cores Cores to use for multi-proccessor enabled software. This is how many cores will be
         allocated  per  job.  For  example  if  you  are  running 10 samples and passed -n 40 to
         bcbio-nextgen and the step you are running has cores: 8 set, a maximum of  five  samples
         will run in parallel, each using 8 cores.

       · jvm_opts  Specific  memory usage options for Java software. For memory usage on programs
         like  GATK,  specify  the  maximum  usage  per  core.  On  multicore  machines,   that's
         machine-memory  divided  by cores.  This avoids memory errors when running multiple jobs
         simultaneously, while the framework will adjust memory up when running multicore jobs.

       · memory Specify the memory per core used by a process. For programs where memory  control
         is  available,  like samtools sort, this limits memory usage. For other programs this is
         an estimate of usage, used by memory-management to avoid over-scheduling memory.  Always
         specify  this  as  the  memory usage for a single core, and the pipeline handles scaling
         this when a process uses multiple cores.

       · keyfile Specify the location of a program specific key file or license server,  obtained
         from a third party software tool. Supports licenses for novoalign and Sentieon. For more
         complex Sentieon setups this can also be a dictionary of environmental variables:

            resources:
              sentieon:
                keyfile:
                  SENTIEON_LICENSE_SERVER: 100.100.100.100:8888
                  SENTIEON_AUTH_MECH: XXX
                  SENTIEON_AUTH_DATA: signature

   Temporary directory
       You also use the resource section  to  specify  system  specific  parameters  like  global
       temporary directories:

          resources:
            tmp:
              dir: /scratch

       This  is  useful on cluster systems with large attached local storage, where you can avoid
       some shared filesystem IO by writing temporary files to the local disk. When setting  this
       keep   in  mind  that  the  global  temporary  disk  must  have  enough  space  to  handle
       intermediates. The space differs between steps but generally you'd need to  have  2  times
       the  largest  input  file  per  sample  and  account for samples running simultaneously on
       multiple core machines.

       To handle clusters that specify local scratch space with an environmental variable,  bcbio
       will resolve environmental variables like:

          resources:
            tmp:
              dir: $YOUR_SCRATCH_LOCATION

   Sample or run specific resources
       To  override  any of the global resource settings in a sample specific manner, you write a
       resource section within your sample YAML configuration. For example, to  create  a  sample
       specific  temporary  directory and pass a command line option to novoalign, write a sample
       resource specification like:

          - description: Example
            analysis: variant2
            resources:
              novoalign:
                options: ["-o", "FullNW", "--rOQ"]
              tmp:
                dir: tmp/sampletmpdir

       To adjust resources for an entire run, you can add this resources specification at the top
       level of your sample YAML:

          details:
            - description: Example
          resources:
            default:
              cores: 16

   Logging directory
       By  default, bcbio writes the logging-output directory to log in the main directory of the
       run. You can configure this to a different location in your bcbio-system.yaml with:

          log_dir: /path/to/logs

INPUT FILE PREPARATION

       Input files for supplementing analysis, like variant_regions need to match  the  specified
       reference  genome.  A  common  cause of confusion is the two chromosome naming schemes for
       human genome build 37: UCSC-style in hg19 (chr1, chr2) and Ensembl/NCBI  style  in  GRCh37
       (1,  2).  To  help  avoid some of this confusion, in build 38 we only support the commonly
       agreed on chr1, chr2 style.

       It's important to ensure that the chromosome naming in your input files match those in the
       reference genome selected. bcbio will try to detect this and provide helpful errors if you
       miss it.

       To convert chromosome names, you can use Devon Ryan's collection of chromosome mappings as
       an input to sed.  For instance, to convert hg19 chr-style coordinates to GRCh37:

          wget --no-check-certificate -qO- http://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh37_UCSC2ensembl.txt \
             | awk '{if($1!=$2) print "s/^"$1"/"$2"/g"}' > remap.sed
          sed -f remap.sed original.bed > final.bed

GENOME CONFIGURATION FILES

       Each  genome  build  has  an  associated buildname-resources.yaml configuration file which
       contains organism specific naming and resource files.  bcbio-nextgen  expects  a  resource
       file  present  next  to  the  genome  FASTA  file.  Example genome configuration files are
       available, and automatically installed for natively supported  genomes.  Create  these  by
       hand to support additional organisms or builds.

       The major sections of the file are:

       · aliases  --  Names  for  third-party programs used as part of the analysis, since naming
         expectations can differ between software programs.

       · variation -- Supporting data files for variant analysis. For human analyses,  the  dbSNP
         and  training  files  are  from  the  GATK  resource  bundle.  These are inputs into the
         training models  for  recalibration.  The  automated  CloudBioLinux  data  scripts  will
         download and install these in the variation subdirectory relative to the genome files.

       · rnaseq  --  Supporting  data  files  for  RNA-seq  analysis. The automated installer and
         updater handles retrieval and installation  of  these  resources  for  supported  genome
         builds.

       · srnaseq  --  Supporting  data  files  for  smallRNA-seq analysis. Same as in rnaseq, the
         automated installer and updater handle this for supported genome builds.

       By default, we place the buildname-resources.yaml files next to the genome FASTA files  in
       the  reference  directory.  For custom setups, you specify an alternative directory in the
       ref:config-resources section of your bcbio_system.yaml file:

          resources:
            genome:
              dir: /path/to/resources/files

REFERENCE GENOME FILES

       The pipeline requires access to reference genomes, including the raw  FASTA  sequence  and
       pre-built  indexes  for aligners. The automated installer will install reference files and
       indexes for commonly used genomes (see the upgrade-install documentation for command  line
       options).

       For  human  genomes,  we  recommend  using  build  38  (hg38). This is fully supported and
       validated in bcbio, and corrects a lot of issues in the previous build 37. We use the 1000
       genomes  distribution  which includes HLAs and decoy sequences. For human build 37, GRCh37
       and hg19, we use the 1000 genome references provided in the GATK  resource  bundle.  These
       differ in chromosome naming: hg19 uses chr1, chr2, chr3 style contigs while GRCh37 uses 1,
       2, 3. They also differ slightly in content: GRCh37 has masked Pseudoautosomal  regions  on
       chromosome Y allowing alignment to these regions on chromosome X.

       You  can  use  pre-existing  data and reference indexes by pointing bcbio-nextgen at these
       resources. We use the Galaxy .loc  files  approach  to  describing  the  location  of  the
       sequence and index data, as described in data-requirements. This does not require a Galaxy
       installation since the installer sets up a minimal set of .loc files. It  finds  these  by
       identifying  the root galaxy directory, in which it expects a tool-data sub-directory with
       the .loc files. It can do this in two ways:

       · Using the directory of your bcbio-system.yaml. This is the default  mechanism  setup  by
         the automated installer and requires no additional work.

       · From  the path specified by the galaxy_config option in your bcbio-system.yaml. If you'd
         like to move your system YAML file, add the full path to  your  galaxy  directory  here.
         This is useful if you have a pre-existing Galaxy installation with reference data.

       To  manually  make genomes available to bcbio-nextgen, edit the individual .loc files with
       locations to your reference and index genomes. You  need  to  edit  sam_fa_indices.loc  to
       point  at the FASTA files and then any genome indexes corresponding to aligners you'd like
       to use (for example: bwa_index.loc for  bwa  and  bowtie2_indices.loc  for  bowtie2).  The
       database key names used (like GRCh37 and mm10) should match those used in the genome_build
       of your sample input configuration file.

ADDING CUSTOM GENOMES

       bcbio_setup_genome.py will help you to install a  custom  genome  and  apply  all  changes
       needed to the configuration files. It needs the genome in FASTA format, and the annotation
       file in GTF or GFF3 format. It can create index for all aligners used by bcbio.  Moreover,
       it  will  create  the  folder  rnaseq to allow you run the RNAseq pipeline without further
       configuration.

          bcbio_setup_genome.py -f genome.fa -g annotation.gtf -i bowtie2 star seq -n Celegans -b WBcel135

       If you want to add smallRNA-seq data files, you will need to add the  3  letters  code  of
       mirbase  for  your  genome  (i.e  hsa  for  human)  and the GTF file for the annotation of
       smallRNA data.  Here you can use  the  same  file  than  the  transcriptome  if  no  other
       available.

          bcbio_setup_genome.py -f genome.fa -g annotation.gtf -i bowtie2 star seq -n Celegans -b WBcel135 --species cel --srna_gtf another_annotation.gtf

       To use that genome just need to configure your YAML files as:

          genome_build: WBcel135

   Effects prediction
       To  perform  variant calling and predict effects in a custom genome you'd have to manually
       download and link this into your installation. First find the snpEff genome build:

          $ snpEff databases | grep Lactobacillus | grep pentosus
          Lactobacillus_pentosus_dsm_20314                                Lactobacillus_pentosus_dsm_20314                                              ENSEMBL_BFMPP_32_179            http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
          Lactobacillus_pentosus_kca1                                     Lactobacillus_pentosus_kca1                                                   ENSEMBL_BFMPP_32_179            http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip

       then download to the appropriate location:

          $ cd /path/to/bcbio/genomes/Lacto/Lactobacillus_pentosus
          $ mkdir snpEff
          $ cd snpEff
          $ wget http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
          $ unzip snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
          $ find . -name "Lactobacillus_pentosus_dsm_20314"
           ./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314
          $ mv ./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314 .

       finally add to your genome configuration file (seq/Lactobacillus_pentosus-resources.yaml):

          aliases:
            snpeff: Lactobacillus_pentosus_dsm_20314

       For adding an organism not present in snpEff, please see this mailing list discussion.

       The pipeline runs in parallel in two different ways:

       · multiple cores -- Analyses will run  in  parallel  using  multiple  cores  on  a  single
         machine. This requires only the multiprocessing Python library, included by default with
         most Python installations.

       · parallel messaging -- This allows  scaling  beyond  the  cores  available  on  a  single
         machine,  and  requires multiple machines with a shared filesystem like standard cluster
         environments.  Machine to machine communication occurs via messaging, using the  IPython
         parallel framework.

TUNING CORE AND MEMORY USAGE

       bcbio  has  two  ways  to  specify  core  usage, helping provide options for parallelizing
       different types of processes:

       · Total available cores: specified with -n on the commandline, this tells bcbio  how  many
         total cores to use. This applies either to a local multicore run or a distributed job.

       · Maximum  cores  to  use for multicore processing of individual jobs. You specify this in
         the  resource  section  of   either   a   sample   YAML   file   (sample-resources)   or
         bcbio_system.yaml.  Ideally  you  specify this in the default section (along with memory
         usage). For example, this would specify that processes using multiple cores can  get  up
         to 16 cores with 2G of memory per core:

            resources:
              default:
                memory: 2G
                cores: 16
                jvm_opts: ["-Xms750m", "-Xmx2000m"]

       bcbio uses these settings, along with memory requests, to determine how to partition jobs.
       For example, if you had -n 32 and cores: 16 for a run on a single 32  core  machine,  this
       would run two simultaneous bwa mapping jobs using 16 cores each.

       Memory  specifications  (both  in  memory  and jvm_opts) are per-core. bcbio takes care of
       adjusting this memory to match the cores used. In the example above, if bcbio was  running
       a  16 core java process, it would use 32Gb of memory for the JVM, adjusting Xmx and Xms to
       match cores used. Internally bcbio looks at the memory and CPU  usage  on  a  machine  and
       matches  your  configuration options to the available system resources. It will scale down
       core requests if memory is limiting, avoiding over-scheduling resources  during  the  run.
       You  ideally  want to set both memory and jvm_opts to match the average memory per core on
       the run machine and adjust upwards if  this  does  not  provide  enough  memory  for  some
       processes during the run.

       For  single  machine  runs with a small number of samples, you generally want to set cores
       close to or equal the number of total cores you're allocating to the  job  with  -n.  This
       will  allow  individual  samples  to  process  as  fast  as possible and take advantage of
       multicore software.

       For distributed jobs, you want to set cores to match the available cores on a single  node
       in your cluster, then use -n as a multiple of this to determine how many nodes to spin up.
       For example, cores: 16 and -n 64 would try to make four 16  core  machines  available  for
       analysis.

MULTIPLE CORES

       Running using multiple cores only requires setting the -n command line flag:

          bcbio_nextgen.py bcbio_sample.yaml -t local -n 12

IPYTHON PARALLEL

       IPython  parallel  provides a distributed framework for performing parallel computation in
       standard cluster environments. The bcbio-nextgen setup script installs  both  IPython  and
       pyzmq,  which  provides  Python  bindings  for  the  ZeroMQ  messaging  library.  The only
       additional requirement is that the work directory where you run the analysis is accessible
       to  all  processing  nodes.  This is typically accomplished with a distributed file system
       like NFS, Gluster or Lustre.

       Run an analysis using ipython for parallel execution:

          bcbio_nextgen.py bcbio_sample.yaml -t ipython -n 12 -s lsf -q queue

       The -s flag specifies a type of scheduler to use (lsf, sge, torque, slurm, pbspro).

       The -q flag specifies the queue to submit jobs to.

       The -n flag defines the total number of cores to use on the cluster during processing. The
       framework  will  select  the  appropriate number of cores and type of cluster (single core
       versus multi-core) to use based on the pipeline stage (see the internals-parallel  section
       in  the  internals documentation for more details). For multiple core steps, the number of
       cores to use for programs like bwa, novoalign and gatk  comes  from  the  config-resources
       section  of  the configuration.  Ensure the cores specification matches the physical cores
       available on machines in your cluster, and  the  pipeline  will  divide  the  total  cores
       specified by -n into the appropriate number of multicore jobs to run.

       The  pipeline  default  parameters  assume a system with minimal time to obtain processing
       cores and consistent file system accessibility. These defaults allow the  system  to  fail
       fast  in  the  case  of cluster issues which need diagnosis. For running on shared systems
       with high resource usage and potential failures due to intermittent cluster issues,  there
       are  turning parameters that increase resiliency. The --timeout flag specifies the numbers
       of minutes to wait for a cluster to start up  before  timing  out.  This  defaults  to  15
       minutes.  The  --retries  flag  specify  the number of times to retry a job on failure. In
       systems  with  transient  distributed  file  system  hiccups  like  lock  errors  or  disk
       availability,  this  will provide recoverability at the cost of resubmitting jobs that may
       have failed for reproducible reasons.

       Finally, the -r resources flag specifies resource options to pass along to the  underlying
       queue scheduler. This currently supports SGE's -l parameter, Torque's -l parameter and LSF
       and SLURM native flags. This allows specification or resources to the scheduler  (see  the
       qsub man page). You may specify multiple resources, so -r mem=4g -r ct=01:40:00 translates
       to  -l  mem=4g  -l  ct=01:40:00  when  passed  to  qsub  or   -r   "account=a2010002"   -r
       "timelimit=04:00:00"   when   using   SLURM,   for  instance.  SLURM  and  Torque  support
       specification of an account parameter with -r account=your_name, which  IPython  transfers
       into -A.

       SGE supports special parameters passed using resources to help handle the heterogeneity of
       possible setups.

       Specify an SGE parallel environment that supports using multiple cores on  a  single  node
       with  -r  pename=your_pe.  Since this setup is system specific it is hard to write general
       code for find a  suitable  environment.  Specifically,  when  there  are  multiple  usable
       parallel  environments,  it  will  select the first one which may not be correct. Manually
       specifying it with a pename= flag to resources will ensure correct selection of the  right
       environment. If you're administering a grid engine cluster and not sure how to set this up
       you'd typically want a smp queue using allocation_rule: $pe_slots  like  in  this  example
       pename configuration or smp template.

       SGE  has other specific flags you may want to tune, depending on your setup. To specify an
       advanced reservation with the -ar flag, use -r ar=ar_id. To specify an alternative  memory
       management model instead of mem_free use -r memtype=approach. It is further recommended to
       configure mem_free (or  any  other  chosen  memory  management  model)  as  a  consumable,
       requestable  resource  in  SGE  to  prevent  overfilling hosts that do not have sufficient
       memory per slot.  This can be done in two steps. First, launch qmon as  an  admin,  select
       Complex  Configuration  in  qmon, click on mem_free`, under the ``Consumable dialog select
       JOB (instead of YES or NO) and finally click  Modify  for  the  changes  to  take  effect.
       Secondly,  for  each  host  in the queue, configure mem_free as a complex value. If a host
       called myngshost has 128GB of  RAM,  the  corresponding  command  would  be  qconf  -mattr
       exechost complex_values mem_free=128G myngshost

       There are also special -r resources parameters to support pipeline configuration:

       · -r  conmem=4  --  Specify  the  memory for the controller process, in Gb. This currently
         applies to SLURM processing and defaults to 4Gb.

       · -r minconcores=2 -- The minimum number of cores to use for the controller  process.  The
         controller  one  works  on  a single core but this can help in queues where you can only
         specify multicore jobs.

       · -r mincores=16 -- Specify the minimum number of cores to  batch  together  for  parallel
         single  core  processes  like variant calling. This will run multiple processes together
         under a single submission to allow sharing of resources like memory,  which  is  helpful
         when  a  small  percentage  of the time a process like variant calling will use a lot of
         memory. By default, bcbio will calculate mincores based on specifications for  multicore
         calling so this doesn't normally require a user to set.

TROUBLESHOOTING

   Diagnosing job failures
       Parallel jobs can often terminate with rather generic failures like any of the following:

       · joblib/parallel.py", ... TypeError: init() takes at least 3 arguments (2 given)

       · Multiprocessing exception:

       · CalledProcessError: Command '<command line that failed>

       These  errors  unfortunately  don't  help  diagnose the problem, and you'll likely see the
       actual error triggering this generic exception earlier in the run. This error can often be
       hard to find due to parallelization.

       If  you  run  into  a  confusing  failure like this, the best approach is to re-run with a
       single core:

          bcbio_nextgen.py your_input.yaml -n 1

       which should produce a more helpful debug message right above the failure.

       It's also worth re-trying the failed command line outside of bcbio to look for errors. You
       can  find the failing command by cross-referencing the error message with command lines in
       log/bcbio-nextgen-commands.log. You may have to change temporary directories (tx/tmp**) in
       some  of  the  job outputs. Reproducing the error outside of bcbio is a good first step to
       diagnosing and fixing the underlying issue.

   No parallelization where expected
       This may occure if the current execution is a re-run of a previous project:

       · Files in checkpoints_parallel/*.done tell bcbio  not  to  parallelize  already  executed
         pipeline tasks. This makes restarts faster by avoiding re-starting a cluster (when using
         distributed runs) for finished stages. If that behaviour is  not  desired  for  a  task,
         removing the checkpoint file will get things parallelizing again.

       · If  the  processing  of  a  task  is  nearly finished the last jobs of this task will be
         running and bcbio will wait for those to finish.

   IPython parallelization problems
       Networking problems on clusters can prevent the  IPython  parallelization  framework  from
       working properly. Be sure that the compute nodes on your cluster are aware of IP addresses
       that they can use to  communicate  with  each  other  (usually  these  will  be  local  IP
       addresses). Running:

          python -c 'import socket; print socket.gethostbyname(socket.gethostname())'

       Should return such an IP address (as opposed to localhost). This can be fixed by adding an
       entry to the hosts file.

       The line:

          host-ip hostname

       where host-ip is replaced by the actual IP address of the  machine  and  hostname  by  the
       machine's  own  hostname,  should  be  aded  to /etc/hosts on each compute node. This will
       probably involve contacting your local cluster administrator.

MEMORY MANAGEMENT

       The memory information specified in  the  system  configuration  config-resources  enables
       scheduling  of  memory  intensive processes. The values are specified on a memory-per-core
       basis and thus bcbio-nextgen handles memory scheduling by:

       · Determining available cores and memory per machine

       · Calculating the memory  and  core  usage.   The  system  configuration  config-resources
         contains the expected core and memory usage of external programs.

       · Adjusting  the  specified  number  of  total cores to avoid over-scheduling memory. This
         allows running programs with more than the available memory per core without getting out
         of memory system errors.

       · Passing  total  memory  usage  along  to  schedulers.  The SLURM, SGE, Torque and PBSPro
         schedulers use this information to allocate memory to processes,  avoiding  issues  with
         other scheduled programs using available memory on a shared machine.

       As  a  result  of  these  calculations,  the  cores used during processing will not always
       correspond to the maximum cores provided in the input -n parameter. The goal is rather  to
       intelligently  maximize  cores and memory while staying within system resources. Note that
       memory specifications are for a single core, and the pipeline takes care of adjusting this
       to actual cores used during processing.

DETERMINING AVAILABLE CORES AND MEMORY PER MACHINE

       bcbio  automatically  tries  to determine the total available memory and cores per machine
       for balancing resource usage. For multicore runs,  it  retrieves  total  memory  from  the
       current  machine.  For parallel runs, it spawns a job on the queue and extracts the system
       information from that machine. This expects a homogeneous set of machines within a cluster
       queue.     You    can    see    the    determined    cores    and    total    memory    in
       provenance/system-ipython-queue.yaml.

       For heterogeneous clusters  or  other  cases  where  bcbio  does  not  correctly  identify
       available system resources, you can manually set the machine cores and total memory in the
       resource section of either a sample YAML file (sample-resources) or bcbio_system.yaml:

          resources:
            machine:
              memory: 48.0
              cores: 16

       The memory usage is total  available  on  the  machine  in  Gb,  so  this  specifies  that
       individual machines have 48Gb of total memory and 16 cores.

TUNING SYSTEMS FOR SCALE

       bcbio-nextgen  scales  out on clusters including hundreds of cores and is stress tested on
       systems with 1000 simultaneous processes. Scaling up often requires system specific tuning
       to  handle  simultaneous  processes.  This  section  collects  useful  tips and tricks for
       managing scaling issues.

   Open file handles
       A common failure mode is having too many open file handles. This  error  report  can  come
       from  the  IPython  infrastructure  logs  as  ZeroMQ attempts to open sockets, or from the
       processing logs as third party software gets file handles. You can  check  your  available
       file  handles  with  ulimit -a | grep open. Setting open file handle limits is open system
       and cluster specific and below are tips for specific setups.

       In addition to open file handle limits (ulimit -n)  large  processes  may  also  run  into
       issues  with  available  max user processes (ulimit -u). Some systems set a low soft limit
       (ulimit -Su) like 1024 but a higher hard limit (ulimit -Hu), allowing  adjustment  without
       root  privileges.  The IPython controllers and engines do this automatically, but the main
       bcbio_nextgen.py driver process cannot. If this scheduler puts this process  on  the  same
       node  as  worker processes, you may run into open file handle limits due to work happening
       on the workers. To fix  this,  manually  set  ulimit  -u  a_high_number  as  part  of  the
       submission process for the main process.

       For  a  Ubuntu  system,  edit  /etc/security/limits.conf  to  set the soft and hard nofile
       descriptors, and edit /etc/pam.d/common-session to add pam_limits.so. See this  blog  post
       for more details.

       For       CentOS/RedHat       systems,       edit       /etc/security/limits.conf      and
       /etc/security/limits.d/90-nproc.conf to increase maximum open files and user limits.

       SGE needs configuration at the qmaster level. Invoke qconf -mconf from a host  with  admin
       privileges, and edit execd_params:

          execd_params                 S_DESCRIPTORS=20000

   IO and Network File Systems
       bcbio-nextgen  makes use of distributed network file systems to manage sharing large files
       between compute nodes. While we strive to minimize disk-based processing by making use  of
       pipes,  the  pipeline  still  has  a  major  IO  component.  To help manage IO and network
       bottlenecks, this section  contains  pointers  on  deployments  and  benchmarking.  Please
       contribute your tips and thoughts.

       · Harvard  and  Dell:  See  the  'Distributed File Systems' section of our post on scaling
         bcbio-nextgen for details about the setup within  Harvard  FAS  Research  Computing  and
         thoughts  on scaling and hardware. We also collaborate with Dell to test the pipeline on
         Dell's Active Infrastructure for Life Sciences.  We found  the  biggest  initial  factor
         limiting scaling was network bandwidth between compute and storage nodes.

       bcbio  runs with Common Workflow Language (CWL) compatible parallelization software. bcbio
       generates a CWL workflow from a standard bcbio sample YAML description file and  any  tool
       that  supports  CWL  input  can  run the workflow. CWL-based tools do the work of managing
       files and workflows, and bcbio performs the biological  analysis  using  either  a  Docker
       container or a local installation.

CURRENT STATUS

       bcbio  creates  CWL  for  alignment,  small  variant  calls  (SNPs  and  indels), coverage
       assessment, HLA typing, quality control and structural variant calling. It generates a CWL
       v1.0.2  compatible  workflow.  The actual biological code execution during runs works with
       either a bcbio docker container or a local installation of bcbio.

       The implementation includes bcbio's approaches to splitting and batching analyses. At  the
       top  level workflow, we parallelize by samples. Using sub-workflows, we split fastq inputs
       into sections for parallel alignment over multiple machines following by merging. We  also
       use  sub-workflows, along with CWL records, to batch multiple samples and run in parallel.
       This enables pooled and tumor/normal cancer calling  with  parallelization  by  chromosome
       regions based on coverage calculations.
         [image: Variant calling overview] [image]

       bcbio supports these CWL-compatible tools:

       · Cromwell  --  multicore  local  runs  and  distributed  runs  on HPC systems with shared
         filesystems and schedulers like SLURM, SGE and PBSPro.

       · Arvados -- a hosted platform that runs on top of parallel cloud environments. We include
         an example below of running on the public Curoverse instance running on Microsoft Azure.

       · DNANexus  --  a  hosted platform running distributed jobs on cloud environments, working
         with both AWS and Azure.

       · rabix bunny -- multicore local runs.

       · Toil -- parallel local and distributed cluster runs on schedulers like  SLURM,  SGE  and
         PBSPro.

       · Seven  Bridges -- parallel distributed analyses on the Seven Bridges platform and Cancer
         Genomics Cloud.

       · cwltool -- a single core analysis engine, primarily used for testing.

       We plan to continue to expand CWL support to include more components of  bcbio,  and  also
       need  to  evaluate  the  workflow  on larger, real life analyses. This includes supporting
       additional CWL runners. We're working on evaluating Galaxy/Planemo  for  integration  with
       the Galaxy community.

INSTALLATION

       bcbio-vm  installs  all  dependencies  required  to generate CWL and run bcbio, along with
       supported CWL runners. There are two install choices, depending on your  usage  of  bcbio:
       running CWL with a existing local bcbio install, or running with containers.

   Install bcbio-vm with a local bcbio
       To  run  bcbio without using containers, first install bcbio and make it available in your
       path. You'll need both the bcbio code  and  tools.   To  only  run  the  tests  and  bcbio
       validations, you don't need a full data installation so can install with --nodata.

       To then install bcbio-vm, add the --cwl flag to the install:

          bcbio_nextgen.py upgrade --cwl

       Adding this to any future upgrades will also update the bcbio-vm wrapper code and tools.

       When  you  begin  running  your own analysis and need the data available, pre-prepare your
       bcbio data directory with bcbio_nextgen.py upgrade --data --cwl.

   Install bcbio-vm with containers
       If you don't have an existing local bcbio installation and want to run with CWL using  the
       tools  and data embedded in containers, you can do a stand along install of just bcbio-vm.
       To install using Miniconda and bioconda packages on Linux:

          wget http://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
          export TARGETDIR=~/install/bcbio-vm/anaconda
          export BINDIR=/usr/local
          bash Miniconda2-latest-Linux-x86_64.sh -b -p $TARGETDIR
          $TARGETDIR/bin/conda install --yes -c conda-forge -c bioconda bcbio-nextgen
          $TARGETDIR/bin/conda install --yes -c conda-forge -c bioconda bcbio-nextgen-vm
          ln -s $TARGETDIR/bin/bcbio_vm.py $BINDIR/bcbio_vm.py
          ln -s $TARGETDIR/bin/conda $BINDIR/bcbiovm_conda
          ln -s $TARGETDIR/bin/python $BINDIR/bcbiovm_python

       In the above commands, the bcbio-vm install goes in $TARGETDIR.  The example  is  in  your
       home  directory  but  set to anywhere you have space.  Also, as an alternative to symbolic
       linking to a $BINDIR, you can add the install bin directory to your PATH:

          export PATH=$TARGETDIR/bin:$PATH

       This install includes bcbio-nextgen libraries, used in generating  CWL  and  orchestrating
       runs, but is not a full bcbio installation. It requires Docker present on your system this
       is all you need to get started running examples, since the CWL runners will pull in Docker
       containers with the bcbio tools.

GETTING STARTED

       To  make  it  easy  to get started, we have pre-built CWL descriptions that use test data.
       These run in under 5 minutes on a local machine and don't require a bcbio installation  if
       you have Docker available on your machine:

       1. Download and unpack the test repository:

             wget -O test_bcbio_cwl.tar.gz https://github.com/bcbio/test_bcbio_cwl/archive/master.tar.gz
             tar -xzvpf test_bcbio_cwl.tar.gz
             cd test_bcbio_cwl-master/somatic

       2. Run  the  analysis  using  either  Cromwell,  Rabix  bunny  or Toil. If you have Docker
          available on your machine, the runner will download the correct bcbio container and you
          don't  need  to install anything else to get started. If you have an old version of the
          container you want to update to the latest  with  docker  pull  quay.io/bcbio/bcbio-vc.
          There are shell scripts that provide the command lines for running:

             bash run_cromwell.sh
             bash run_bunny.sh
             bash run_toil.sh

          Or you can run directly using the bcbio_vm.py wrappers:

             bcbio_vm.py cwlrun cromwell somatic-workflow
             bcbio_vm.py cwlrun toil somatic-workflow
             bcbio_vm.py cwlrun bunny somatic-workflow

          Thes  wrappers  automatically  handle  temporary  directories, permissions, logging and
          re-starts.  If  running  without  Docker,  use  a  local  installation  of  bcbio   add
          --no-container to the commands in the shell scripts.

GENERATING CWL FOR INPUT TO A TOOL

       The  first  step  in  running your analysis project in bcbio is to generate CWL. If you're
       already familiar with bcbio, the process of preparing information about your sample inputs
       and analysis are almost identical:

       · A  standard  bcbio  sample configuration file defining the samples. This can either be a
         full prepared YAML file or a template file and CSV with sample data.

       · A bcbio_system.yaml file defining the system environment for running the  program.  This
         includes  the  resource  specification with cores and memory per core for your machines.
         For choosing cores and memory per cores, you generally want to set  this  to  match  the
         parameters of a single machine either for a local run or on a cluster.

         In  addition  to resources specifications, the bcbio system file now also includes paths
         to the reference biodata and optionally input file directories  if  you  want  to  avoid
         specifying  full  paths  to your inputs in the bcbio_vm.py template command.  bcbio will
         recursively look up file locations within those inputs, and this has  the  advantage  of
         working  identically  for  non-local  file  locations.  Here is an example for a 16 core
         machine with 3.5Gb of memory per core:

            local:
              ref: /path/to/bcbio/genomes/Hsapiens
              inputs:
                - /path/to/input/files
            resources:
              default:
                cores: 16
                memory: 3500M
                jvm_opts: [-Xms1g, -Xmx3500m]

       Generate CWL with:

          bcbio_vm.py template --systemconfig bcbio_system.yaml template.yaml samples.csv [optional list of fastq or BAM inputs]
          bcbio_vm.py cwl --systemconfig bcbio_system.yaml samples/config/samples.yaml

       producing a sample-workflow output directory with the CWL.

       On a first CWL generation run with a new genome, this process will run for a  longer  time
       as  it  needs  to  make  your reference compatible with CWL. This includes creating single
       tar.gz files from some reference directories so they can get passed  to  CWL  steps  where
       they'll  get unpacked. This process only happens a single time and keeps unpacked versions
       so your reference setup is compatible with both old bcbio IPython and new CWL runs.

       You can now run this with any CWL compatible runner and the  bcbio_vm.py  cwlrun  wrappers
       standardize running across multiple tools in different environments.

RUNNING WITH CROMWELL (LOCAL, HPC, CLOUD)

       The  Cromwell  workflow management system runs bcbio either locally on a single machine or
       distributed on a cluster using a scheduler like SLURM, SGE or PBSPro.

       To run a bcbio CWL workflow locally using Docker:

          bcbio_vm.py cwlrun cromwell sample-workflow

       If you want to run from a locally installed bcbio add --no-container to the commandline.

       To run distributed on a SLURM cluster:

          bcbio_vm.py cwlrun cromwell sample-workflow --no-container -q your_queue -s slurm -r timelimit=0-12:00

       Tweak scheduler parameters using the same options as the older bcbio IPython approach.

       To control the resources used Cromwell, set  --joblimit  to  the  allowed  jobs  allocated
       concurrently. This isn't total cores used, but rather the number of jobs either locally or
       remotely scheduled concurrently. Since CWL steps are  heterogeneous  and  use  only  cores
       necessary  for that job, the total cores used will max out at joblimit times maximum cores
       for an individual process. Setting this helps  avoid  over-committing  jobs  to  a  shared
       scheduler during highly parallel processes like variant calling.

       Cromwell can also run directly on cloud resources: docs-cloud-gcp.

RUNNING WITH TOIL (LOCAL, HPC)

       The  Toil pipeline management system runs CWL workflows in parallel on a local machine, on
       a cluster or at AWS.

       To run a bcbio CWL workflow locally with Toil using Docker:

          bcbio_vm.py cwlrun toil sample-workflow

       If you want to run from a locally installed bcbio add --no-container to the commandline.

       To run distributed on a Slurm cluster:

          bcbio_vm.py cwlrun toil sample-workflow -- --batchSystem slurm

RUNNING ON ARVADOS (HOSTED CLOUD)

       bcbio generated CWL workflows run on Arvados and these instructions detail how to  run  on
       the Arvdos public instance. Arvados cwl-runner comes pre-installed with bcbio-vm.  We have
       a publicly accessible project, called bcbio_resources  that  contains  the  latest  Docker
       images, test data and genome references you can use for runs.

       Retrieve  API  keys  from  the  Arvados  public  instance. Login, then go to 'User Icon ->
       Personal Token'.  Copy and  paste  the  commands  given  there  into  your  shell.  You'll
       specifically need to set ARVADOS_API_HOST and ARVADOS_API_TOKEN.

       To run an analysis:

       1. Create  a  new project from the web interface (Projects -> Add a new project). Note the
          project   ID    from    the    URL    of    the    project    (an    identifier    like
          qr1hi-j7d0g-7t73h4hrau3l063).

       2. Upload  reference  data  to Arvados Keep. Note the genome collection UUID. You can also
          use the existing genomes pre-installed in the  bcbio_resources  project  if  using  the
          public Arvados playground:

             arv-put --name testdata_genomes --project-uuid $PROJECT_ID testdata/genomes/hg19

       3. Upload input data to Arvados Keep. Note the collection UUID:

             arv-put --name testdata_inputs --project-uuid $PROJECT_ID testdata/100326_FC6107FAAXX testdata/automated testdata/reference_material

       4. Create  an Arvados section in a bcbio_system.yaml file specifying locations to look for
          reference and input data. input can be one or  more  collections  containing  files  or
          associated files in the original sample YAML:

             arvados:
               reference: qr1hi-4zz18-kuz1izsj3wkfisq
               input: [qr1hi-j7d0g-h691y6104tlg8b4]
             resources:
               default: {cores: 4, memory: 2G, jvm_opts: [-Xms750m, -Xmx2500m]}

       5. Generate  the  CWL to run your samples. If you're using multiple input files with a CSV
          metadata file and template start with creation of a configuration file:

             bcbio_vm.py template --systemconfig bcbio_system_arvados.yaml testcwl_template.yaml testcwl.csv

          To generate the CWL from the system and sample configuration files:

             bcbio_vm.py cwl --systemconfig bcbio_system_arvados.yaml testcwl/config/testcwl.yaml

       6. In most cases, Arvados should directly pick up the Docker  images  you  need  from  the
          public  bcbio_resources  project  in your instance. If you need to manually add to your
          project, you can copy latest bcbio Docker image into your project from  bcbio_resources
          using   arv-copy.   You'll   need  to  find  the  UUID  of  quay.io/bcbio/bcbio-vc  and
          arvados/jobs:

             arv-copy $JOBS_ID --project-uuid $PROJECT_ID --src qr1hi --dst qr1hi
             arv-copy $BCBIO_VC_ID --project-uuid $PROJECT_ID --src qr1hi --dst qr1hi

          or import local Docker images to your Arvados project:

             docker pull arvados/jobs:1.0.20180216164101
             arv-keepdocker --project $PROJECT_ID -- arvados/jobs 1.0.20180216164101
             docker pull quay.io/bcbio/bcbio-vc
             arv-keepdocker --project $PROJECT_ID -- quay.io/bcbio/bcbio-vc latest

       7. Run the CWL on the Arvados public cloud using the Arvados cwl-runner:

             bcbio_vm.py cwlrun arvados arvados_testcwl-workflow -- --project-uuid $PROJECT_ID

RUNNING ON DNANEXUS (HOSTED CLOUD)

       bcbio runs on the DNAnexus platform  by  converting  bcbio  generated  CWL  into  DNAnexus
       workflows  and  apps using dx-cwl. This describes the process using the bcbio workflow app
       (bcbio-run-workflow)            and            bcbio            workflow            applet
       (bcbio_resources:/applets/bcbio-run-workflow)  in the public bcbio_resources project, Both
       are regularly updated and maintained on the DNAnexus platform. Secondarily, we  also  show
       how to install and create workflows locally for additional control and debugging.

       0.  Set some useful environmental variables:

           · $PNAME  --  The  name  of the project you're analyzing. For convenience here we keep
             this the same for your local files and remote DNAnexus project, although  that  does
             not have to be true.

           · $DX_AUTH_TOKEN  --  The  DNAnexus  authorization  token  for access, used for the dx
             command line tool and bcbio scripts.

           · $DX_PROJECT_ID --  The  DNAnexus  GUID  identifier  for  your  project  (similar  to
             project-F8Q7fJj0XFJJ3XbBPQYXP4B9).   You   can   get   this   from   dx   env  after
             creating/selecting a project in steps 1 and 2.

       1.  Create an analysis project:

              dx new project $PNAME

       2.  Upload sample data to the project:

              dx select $PNAME
              dx upload -p --path /data/input *.bam

       3.  Create a bcbio system YAML file with projects, locations of files and desired core and
           memory  usage  for  jobs.  bcbio  uses the core and memory specifications to determine
           machine instance types to use:

              dnanexus:
                project: PNAME
                ref:
                  project: bcbio_resources
                  folder: /reference_genomes
                inputs:
                  - /data/input
                  - /data/input/regions
              resources:
                default: {cores: 8, memory: 3000M, jvm_opts: [-Xms1g, -Xmx3000m]}

       4.  Create a bcbio sample CSV file referencing samples to run. The files can  be  relative
           to  the  inputs directory specified above; bcbio will search recursively for files, so
           you don't need to specify full paths if your file  names  are  unique.  Start  with  a
           sample specification:

              samplename,description,batch,phenotype
              file1.bam,sample1,b1,tumor
              file2.bam,sample2,b1,normal
              file3.bam,sample3,b2,tumor
              file4.bam,sample4,b2,normal

       5.  Pick  a template file that describes the bcbio configuration variables. You can define
           parameters either globally (in the template) file or by sample (in the csv) using  the
           standard bcbio templating.  An example template for GATK4 germline variant calling is:

              details:
               - algorithm:
                   aligner: bwa
                   variantcaller: gatk-haplotype
                 analysis: variant2
                 genome_build: hg38

       6.  Supply  the  three  inputs  (bcbio_system.yaml,  project.csv and template.yaml) to the
           either the bcbio-run-workflow app or applet. This example uses a specific  version  of
           the  bcbio  app for full reproducibility; any future re-runs will always use the exact
           same versioned tools and workflows. You can do this using the web interface or via the
           command line with a small script like:

              TEMPLATE=germline
              APP_VERSION=0.0.2
              FOLDER=/bcbio/$PNAME
              dx select "$PROJECT"
              dx mkdir -p $FOLDER
              for F in $TEMPLATE-template.yaml $PNAME.csv bcbio_system-dnanexus.yaml
              do
                      dx rm -a /$FOLDER/$F || true
                      dx upload --path /$FOLDER/ $F
              done
              dx ls $FOLDER
              dx rm -a -r /$FOLDER/dx-cwl-run || true
              dx run bcbio-run-workflow/$APP_VERSION -iyaml_template=/$FOLDER/$TEMPLATE-template.yaml -isample_spec=/$FOLDER/$PNAME.csv -isystem_configuration=/$FOLDER/bcbio_system-dnanexus.yaml -ioutput_folder=/$FOLDER/dx-cwl-run

           Alternatively  if  you want the latest bcbio code, change the final command to use the
           applet. Everything else in the script is identical:

              dx run bcbio_resources:/applets/bcbio-run-workflow -iyaml_template=/$FOLDER/$TEMPLATE-template.yaml -isample_spec=/$FOLDER/$PNAME.csv -isystem_configuration=/$FOLDER/bcbio_system-dnanexus.yaml -ioutput_folder=/$FOLDER/dx-cwl-run

       The app will lookup all files, prepare a bcbio  CWL  workflow,  convert  into  a  DNAnexus
       workflow,  and  submit  to the platform. The workflow runs as a standard DNAnexus workflow
       and you can monitor through the command line (with dx find executions --root job-YOURJOBID
       and dx watch) or the web interface (Monitor tab).

       If  you prefer not to use the DNAnexus app, you can also submit jobs locally by installing
       bcbio-vm on your local machine. This can also be useful to  test  generation  of  CWL  and
       manually  ensure  identification  of all your samples and associated files on the DNAnexus
       platform.

       1. Follow the automated-sample-config workflow  to  generate  a  full  configuration,  and
          generate a CWL description of the workflow:

             TEMPLATE=germline
             rm -rf $PNAME $PNAME-workflow
             bcbio_vm.py template --systemconfig bcbio_system-dnanexus.yaml $TEMPLATE-template.yaml $PNAME.csv
             bcbio_vm.py cwl --systemconfig bcbio_system-dnanexus.yaml $PNAME/config/$PNAME.yaml

       2. Determine project information and login credentials. You'll want to note the Auth token
          used and Current workspace project ID:

             dx env

       3. Compile the CWL workflow into a DNAnexus workflow:

             dx-cwl compile-workflow $PNAME-workflow/main-$PNAME.cwl \
                --project PROJECT_ID --token $DX_AUTH_TOKEN \
                --rootdir $FOLDER/dx-cwl-run

       4. Upload sample information from generated CWL and run workflow:

             FOLDER=/bcbio/$PNAME
             dx mkdir -p $DX_PROJECT_ID:$FOLDER/$PNAME-workflow
             dx upload -p --path $DX_PROJECT_ID:$FOLDER/$PNAME-workflow $PNAME-workflow/main-$PNAME-samples.json
             dx-cwl run-workflow $FOLDER/dx-cwl-run/main-$PNAME/main-$PNAME \
                    $FOLDER/$PNAME-workflow/main-$PNAME-samples.json \
                    --project PROJECT_ID --token $DX_AUTH_TOKEN \
                    --rootdir $FOLDER/dx-cwl-run

DEVELOPMENT NOTES

       bcbio generates a common workflow language description. Internally, bcbio  represents  the
       files  and  information  related  to processing as a comprehensive dictionary.  This world
       object describes the state of a run and associated files, and new processing steps  update
       or add information to it. The world object is roughly equivalent to CWL's JSON-based input
       object, but  CWL  enforces  additional  annotations  to  identify  files  and  models  new
       inputs/outputs  at each step. The work in bcbio is to move from our laissez-faire approach
       to the more structured CWL model.

       The generated CWL workflow is in run_info-cwl-workflow:

       · main-*.cwl -- the top level CWL file describing the workflow steps

       · main*-samples.json -- the flattened bcbio world structure represented as CWL inputs

       · wf-*.cwl -- CWL sub-workflows, describing sample level parallel processing of a  section
         of the workflow, with potential internal parallelization.

       · steps/*.cwl  -- CWL descriptions of sections of code run inside bcbio. Each of these are
         potential parallelization points and make up the nodes in the workflow.

       To help with defining the outputs at each step, there is a WorldWatcher  object  that  can
       output  changed  files  and  world  dictionary  objects between steps in the pipeline when
       running a bcbio in the standard way. The variant pipeline has examples using it.  This  is
       useful when preparing the CWL definitions of inputs and outputs for new steps in the bcbio
       CWL step definitions.

TODO

       · Support the full variant calling workflow with additional steps like  ensemble  calling,
         heterogeneity detection and disambiguation.

       · Port RNA-seq and small RNA workflows to CWL.

       bcbio  has  two  approaches  to running on cloud providers like Amazon Web Services (AWS),
       Google Cloud (GCP) and Microsoft Azure. For smaller projects we use a  simplified  ansible
       based  approach  which  automates spinning up single multicore machines for running either
       traditional or docs-cwl bcbio runs.

       For larger distributed projects, we're actively working on  using  docs-cwl  support  with
       runners  like  Cromwell  that directly interface and run on cloud services. We'll document
       these approaches here as they're tested and available.

       For getting started, the CWL docs-cwl-installation documentation describes how to  install
       bcbio-vm,  which  provides  a  wrapper  around bcbio that automates interaction with cloud
       providers and Docker. bcbio_vm.py also cleans up the command line usage to  make  it  more
       intuitive and provides a superset of functionality available in bcbio_nextgen.py.

GOOGLE CLOUD (GCP)

       Cromwell runs bcbio CWL pipelines on Google Cloud using the Google Pipelines API.

   Setup
       To setup a Google Compute environment, you'll make use of the Web based console and gcloud
       and gsutil from the Google Cloud SDK, which provide command line interfacts to manage data
       in Google Storage and Google Compute instances. You can install with:

          bcbio_conda install -c conda-forge -c bioconda google-cloud-sdk

       For  authentication,  you  want  to  set  up  a Google Cloud Platform service account. The
       environmental  variable  GOOGLE_APPLICATION_CREDENTIALS  identifies   a   JSON   file   of
       credentials.  which bcbio passes to Cromwell for authentication:

          gcloud auth login
          gcloud projects create your-project
          gcloud iam service-accounts create your-service-account
          gcloud projects add-iam-policy-binding your-project --member \
            "serviceAccount:your-service-account@your-project.iam.gserviceaccount.com" --role "roles/owner"
          gcloud iam service-accounts keys create ~/.config/glcoud/your-service-account.json \
            --iam-account your-service-account@your-project.iam.gserviceaccount.com
          export GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/your-service-account.json

       You'll  need  a  project for your run along with a Google Storage bucket for your data and
       run intermediates:

          gcloud config set project your-project
          gsutil mb gs://your-project

       Additional documentation for Cromwell: Google Pipelines API and Google authentication.

   Data preparation
       Cromwell can localize data present in Google Storage buckets as part of  the  run  process
       and  bcbio will translate the data present in these storage bucket into references for the
       CWL run inputs.

       Upload your data with gsutil:

          gsutil cp your_data.bam gs://your-project/inputs/

       Create a bcbio_system-gcp.yaml input file for docs-cwl-generate:

          gs:
            ref: gs://bcbiodata/collections
            inputs:
              - gs://your-project/inputs
          resources:
            default: {cores: 2, memory: 3G, jvm_opts: [-Xms750m, -Xmx3000m]}

       Generate a Common Workflow Language representation:

          bcbio_vm.py template --systemconfig bcbio_system-gcp.yaml ${TEMPLATE}-template.yaml $PNAME.csv
          bcbio_vm.py cwl --systemconfig bcbio_system-gcp.yaml $PNAME/config/$PNAME.yaml

   Running
       Run the CWL using Cromwell by specifying the project and root Google  Storage  bucket  for
       intermediates:

          bcbio_vm.py cwlrun cromwell $PNAME --cloud-project your-project \
              --cloud-root gs://your-project/work_cromwell

AMAZON WEB SERVICES (OLD)

       Amazon  Web  Services  (AWS)  provides  a  flexible  cloud  based  environment for running
       analyses. Cloud approaches offer  the  ability  to  perform  analyses  at  scale  with  no
       investment  in  local  hardware.  They  also  offer  full  programmatic  control  over the
       environment, allowing bcbio to automate the entire setup, run and teardown process.

       bcbio-vm uses Elasticluster to build a cluster on AWS with an encrypted NFS mounted  drive
       and  an  optional  Lustre  shared  filesystem.   We're  phasing out this approach to cloud
       support in bcbio and will be actively moving to Common Workflow Language based approaches.

   Data preparation
       The easiest way to organize AWS projects is using an analysis folder inside an S3  bucket.
       Create  a  bucket  and  folder  for your analysis and upload fastq, BAM and, optionally, a
       region BED file. Bucket names should include only lowercase letters, numbers  and  hyphens
       (-)  to  conform  to S3 bucket naming restrictions and avoid issues with resolution of SSL
       keys. You can create buckets and upload files using the AWS S3 web console,  the  AWS  cli
       client or specialized tools like gof3r.

       You will also need a template file describing the type of run to do and a CSV file mapping
       samples in the bucket to names and any other  metadata.  See  the  automated-sample-config
       docs for more details about these files. Also upload both of these files to S3.

       With that in place, prepare and upload the final configuration to S3 with:

          bcbio_vm.py template s3://your-project/your-analysis/template.yaml s3://your-project/your-analysis/name.csv

       This  will  find  the input files in the s3://your-project/your-analysis bucket, associate
       fastq and BAM files with the right samples, and add a found BED files  as  variant_regions
       in  the  configuration.  It  will  then  upload  the  final  configuration  back  to S3 as
       s3://your-project/your-analysis/name.yaml, which you can run directly from a bcbio cluster
       on  AWS. By default, bcbio will use the us-east S3 region, but you can specify a different
       region      in       the       s3       path       to       the       metadata       file:
       s3://your-project@eu-central-1/your-analysis/name.csv

       We currently support human analysis with both the GRCh37 and hg19 genomes. We can also add
       additional genomes as needed by the community and generally welcome feedback and  comments
       on reference data support.

   Extra software
       We're  not  able to automatically install some useful tools in pre-built docker containers
       due to licensing restrictions. Variant calling with GATK requires a manual  download  from
       the  GATK  download site for academic users.  Commercial users need a license for GATK and
       for somatic calling with muTect. To make these jars  available,  upload  them  to  the  S3
       bucket  in  a jars directory. bcbio will automatically include the correct GATK and muTect
       directives during your run.  Alternatively, you can also manually specify the path to  the
       jars using a global resources section of your input sample YAML file:

          resources:
            gatk:
              jar: s3://bcbio-syn3-eval/jars/GenomeAnalysisTK.jar

       As  with  sample  YAML  scripts,  specify a different region with an @ in the bucket name:
       s3://your-project@us-west-2/jars/GenomeAnalysisTK.jar

   AWS setup
       The first time running bcbio on AWS you'll need  to  setup  permissions,  VPCs  and  local
       configuration  files.  We  provide commands to automate all these steps and once finished,
       they can be re-used for subsequent runs. To start you'll need to have an account at Amazon
       and your Access Key ID and Secret Key ID from the AWS security credentials page. These can
       be IAM credentials instead  of  root  credentials  as  long  as  they  have  administrator
       privileges. Make them available to bcbio using the standard environmental variables:

          export AWS_ACCESS_KEY_ID=your_access_key
          export AWS_SECRET_ACCESS_KEY=your_secret_key

       With  this  in  place,  two commands setup your elasticluster and AWS environment to run a
       bcbio cluster. The first creates public/private keys, a bcbio IAM user,  and  sets  up  an
       elasticluster config in ~/.bcbio/elasticluster/config:

          bcbio_vm.py aws iam --region=us-east-1

       The second configures a VPC to host bcbio:

          bcbio_vm.py aws vpc --region=us-east-1

       The aws vpc command is idempotent and can run multiple times if you change or remove parts
       of the infrastructure. You can also rerun the aws  iam  command,  but  if  you'd  like  to
       generate  a  new  elasticluster configuration file (~/.bcbio/elasticluster/config) add the
       recreate flag: bcbio_vm.py aws iam --recreate. This generates a new set of IAM credentials
       and  public/private  keys.  These are only stored in the ~/.bcbio directory so you need to
       fully recreate them if you delete the old ones.

   Running a cluster
       Following this setup, you're ready to run a bcbio cluster on AWS. We start from a standard
       Ubuntu AMI, installing all software for bcbio and the cluster as part of the boot process.

       To configure your cluster run:

          bcbio_vm.py aws config edit

       This dialog allows you to define the cluster size and machine resources you'd like to use.
       The defaults only have small instances to prevent accidentally starting an expensive  run.
       If  you're  planning  a  run with less than 32 cores, do not use a cluster and instead run
       directly on a single machine using one of the large r3 or c3 instances.

       This script also sets the size of the encrypted NFS-mounted drive, which you  can  use  to
       store processing data when running across a distributed cluster. At scale, you can replace
       this with a Lustre shared filesystem. See below for details on launching and  attaching  a
       Lustre filesystem to a cluster.

       To ensure everything is correctly configured, run:

          bcbio_vm.py aws info

       When happy with your setup, start the cluster with:

          bcbio_vm.py aws cluster start

       The  cluster  will  take five to ten minutes to start and be provisioned. If you encounter
       any intermittent failures, you can rerun the cluster configuration step  with  bcbio_vm.py
       aws  cluster  setup  or  the  bcbio-specific  installation  with  bcbio_vm.py  aws cluster
       bootstrap.

   Running Lustre
       Elasticluster mounts the /encrypted directory as a NFS share available across all  of  the
       worker  machines.  You  can  use  this  as a processing directory for smaller runs but for
       larger runs may need a scalable distributed file system. bcbio supports using Intel  Cloud
       Edition for Lustre (ICEL) to set up a Lustre scratch filesystem on AWS.

       · Subscribe to ICEL in the Amazon Marketplace.

       · By default, the Lustre filesystem will be 2TB and will be accessible to all hosts in the
         VPC. Creation takes about ten minutes and can happen  in  parallel  while  elasticluster
         sets up the cluster. Start the stack:

            bcbio_vm.py aws icel create

         If  you  encounter any intermittent failures when installing collectl plugin, that means
         lustre server is created successfully, you can rerun the lustre configuration step  with
         bcbio_vm.py  aws  icel create --setup. If you had any failure creating the lustre server
         before the collectl plugin installation, you should stop it, and try again.

       · Once the ICEL stack and elasticluster cluster are both running, mount the filesystem  on
         the cluster:

            bcbio_vm.py aws icel mount

       · The cluster instances will reboot with the Lustre filesystem mounted.

   Running an analysis
       To run the analysis, connect to the head node with:

          bcbio_vm.py aws cluster ssh

       Create your project directory and link the global bcbio configuration file in there with:

       · NFS file system (no Lustre):

            mkdir /encrypted/your-project
            cd !$ && mkdir work && cd work

       · Lustre file system:

            sudo mkdir /scratch/cancer-dream-syn3-exome
            sudo chown ubuntu !$
            cd !$ && mkdir work && cd work

       If you started a single machine, run with:

          bcbio_vm.py run -n 8 s3://your-project/your-analysis/name.yaml

       Where the -n argument should be the number of cores on the machine.

       To run on a full cluster:

          bcbio_vm.py ipythonprep s3://your-project/your-analysis/name.yaml slurm cloud -n 60
          sbatch bcbio_submit.sh

       Where  60  is the total number of cores to use across all the worker nodes.  Of your total
       machine cores, allocate 2 for the base bcbio_vm script and IPython  controller  instances.
       The  SLURM  workload manager distributes jobs across your cluster on a queue called cloud.
       A slurm-PID.out file in the work directory contains the current status  of  the  job,  and
       sacct_std  provides  the status of jobs on the cluster. If you are new to SLURM, here is a
       summary of useful SLURM commands.

       On successful completion, bcbio uploads the results of the  analysis  back  into  your  s3
       bucket  and  folder  as  s3://your-project/your-analysis/final.  You  can  now cleanup the
       cluster and Lustre filesystem.

   Graphing resource usage
       AWS runs include automatic monitoring of  resource  usage  with  collectl.  bcbio_vm  uses
       collectl statistics to plot CPU, memory, disk and network usage during each step of a run.
       To  prepare  resource  usage  plots  after  finishing  an   analysis,   first   copy   the
       bcbio-nextgen.log  file  to your local computer. Either use bcbio_vm.py elasticluster sftp
       bcbio      to      copy       from       the       work       directory       on       AWS
       (/encrypted/your-project/work/log/bcbio-nextgen.log)  or  transfer  it  from the output S3
       bucket (your-project/your-analysis/final/DATE_your-project/bcbio-nextgen.log).

       If your run worked cleanly you can use the log input file directly. If  you  had  failures
       and  restarts, or would only like to graph part of the run, you can edit the timing steps.
       Run grep Timing bcbio-nextgen.log > your-run.txt to get the timing steps only,  then  edit
       as desired.

       Retrieve  the  collectl  statistics  from  the  AWS cluster and prepare the resource usage
       graphs with:

          bcbio_vm.py graph bcbio-nextgen.log

       By  default  the  collectl  stats  will   be   in   monitoring/collectl   and   plots   in
       monitoring/graphs  based  on  the  above  log timeframe. If you need to re-run plots later
       after shutting the cluster down, you can use the none cluster flag by running  bcbio_vm.py
       graph bcbio-nextgen.log --cluster none.

       If  you'd  like to run graphing from a local non-AWS run, such as a local HPC cluster, run
       bcbio_vm.py graph bcbio-nextgen.log --cluster local instead.

       For convenience, there's a "serialize" flag ('-s')  that  saves  the  dataframe  used  for
       plotting.  In order to explore the data and extract specific datapoints or zoom, one could
       just deserialize the ouput like a python pickle file:

       ``

              `

              import cPickle  as  pickle  with  gzip.open("./monitoring/collectl_info.pickle.gz",
              "rb") as decomp:
                 collectl_info = pickle.load(decomp) data, hardware, steps = collectl_info[1][0],
                 collectl_info[1][1], collectl_info[1][2]

       ``

       `

       And plot,  slice,  zoom  it  in  an  jupyter  notebook  using  matplotlib,  [highcharts](‐
       https://github.com/arnoutaertgeerts/python-highcharts).

       In  addition  to plots, the summarize_timing.py utility script prepares a summary table of
       run times per step.

   Shutting down
       The bcbio Elasticluster and Lustre integration can spin up a lot of AWS resources.  You'll
       be  paying for these by the hour so you want to clean them up when you finish running your
       analysis. To stop the cluster:

          bcbio_vm.py aws cluster stop

       To remove the Lustre stack:

          bcbio_vm.py aws icel stop

       Double check that all instances have been properly stopped by looking in the AWS console.

   Manual configuration
       Experienced elasticluster  users  can  edit  the  configuration  files  themselves.  bcbio
       provides a small wrapper that automatically reads and writes these configurations to avoid
       users needing to understand  elasticluster  internals,  but  all  functionality  is  fully
       available.   Edit  your  ~/.bcbio/elasticluster/config  file to change parameters. You can
       also see the latest example configuration.  in the bcbio-vm  GitHub  repository  for  more
       details on the other available options.

       bcbio-nextgen  runs  in  a  temporary work directory which contains a number of processing
       intermediates. Pipeline completion extracts the final useful output files into a  separate
       directory,  specified  by  the  upload-configuration.  This configuration allows upload to
       local directories, Galaxy, or Amazon S3. Once extracting and confirming the output  files,
       you can delete the temporary directory to save space.

COMMON FILES

       The  output  directory  contains sample specific output files labeled by sample name and a
       more general project directory. The sample directories contain all of the sample  specific
       output  files, while the project directory contains global files like project summaries or
       batched population level variant calls. See the teaching documentation for a full  variant
       calling  example  with additional details about configuration setting and resulting output
       files.

   Project directory
       · project-summary.yaml -- Top level YAML format  summary  file  with  statistics  on  read
         alignments and duplications as well as analysis specific metrics.

       · programs.txt  --  Program  versions  for bcbio-nextgen and software run in the pipeline.
         This enables reproduction of analyses.

       · multiqc run MultiQC to gather all QC metrics from different tools,  such  as,  cutadapt,
         featureCounts, samtools, STAR ... into an unique HTML report.

   Sample directories
       · SAMPLE/qc -- Directory of quality control runs for the sample.  These include charts and
         metrics for assessing quality of sequencing and analysis.

       · SAMPLE-ready.bam -- A prepared BAM file of the aligned reads.  Depending on the analysis
         used, this may include trimmed, recalibrated and realigned reads following alignment.

VARIANT CALLING

   Project directory
       · grading-summary.csv  --  Grading  details  comparing  each  sample to a reference set of
         calls. This will only have information when providing a validation callset.

       · BATCH-caller.vcf -- Variants called for a population/batch of samples  by  a  particular
         caller.

       · BATCH-caller.db  --  A  GEMINI database associating variant calls with a wide variety of
         third party annotations. This provides  a  queryable  framework  for  assessing  variant
         quality statistics.

   Sample directories
       · SAMPLE-caller.vcf -- Variants calls for an individual sample.

RNA-SEQ

   Project directory
       · annotated_combined.counts  --  featureCounts  counts matrix with gene symbol as an extra
         column.

       · combined.counts -- featureCounts counts matrix with gene symbol as an extra column.

       · combined.dexseq -- DEXseq counts matrix with exonID as first column.

       · combined.gene.sf.tmp -- Sailfish gene count matrix normalized to TPM.

       · combined.isoform.sf.tpm -- Sailfish transcript count matix normalized to TPM.

       · combined.sf -- Sailfish raw output, all samples files are pasted one after another.

       · tx2gene.csv -- Annotation file needed for DESeq2 to use Sailfish output.

   Sample directories
       · SAMPLE-transcriptome.bam -- BAM file aligned to transcriptome.

       · SAMPLE-ready.counts -- featureCounts gene counts output.

       · sailfish -- Sailfish output.

SMALL RNA-SEQ

   Project directory
       · counts_mirna.tsv -- miRBase miRNA count matrix.

       · counts.tsv -- miRBase isomiRs count matrix. The ID is made of 5 tags: miRNA name,  SNPs,
         additions,  trimming  at  5  and trimming at 3.  Here there is detail explanation of the
         naming .

       · counts_mirna_novel.tsv -- miRDeep2 miRNA count matrix.

       · counts_novel.tsv -- miRDeep2 isomiRs. See counts.tsv explanation for more detail.  count
         matrix.

       · seqcluster  --  output  of  seqcluster  tool.   Inside this folder, counts.tsv has count
         matrix for all clusters found over the genome.

       · seqclusterViz     --      input      file      for      interactive      browser      at
         https://github.com/lpantano/seqclusterViz

       · report  --  Rmd  template to help with downstream analysis like QC metrics, differential
         expression, and clustering.

   Sample directories
       · SAMPLE-mirbase-ready.counts -- counts for miRBase miRNAs.

       · SAMPLE-novel-ready -- counts for miRDeep2 novel miRNAs.

       · tRNA -- output for tdrmapper.

DOWNSTREAM ANALYSIS

       This section collects useful scripts and tools to do downstream analysis of  bcbio-nextgen
       outputs. If you have pointers to useful tools, please add them to the documentation.

       · Calculate and plot coverage with matplolib, from Luca Beltrame.

       · Another way to visualize coverage for targeted NGS (exome) experiments with bedtools and
         R, from Stephen Turner

       · assess the efficiency of targeted enrichment sequencing with ngscat

       This section provides useful concepts for  getting  started  digging  into  the  code  and
       contributing  new functionality. We welcome contributors and hope these notes help make it
       easier to get started.

DEVELOPMENT GOALS

       During development we seek  to  maximize  functionality  and  usefulness,  while  avoiding
       complexity.  Since  these  goals  are sometimes in conflict, it's useful to understand the
       design approaches:

       · Support high level  configurability  but  avoid  exposing  all  program  options.  Since
         pipelines  support  a wide variety of tools, each with a large number of options, we try
         to define configuration variables at high level based  on  biological  intent  and  then
         translate these into best-practice options for each tool. The goal is to avoid having an
         overwhelming number of input configuration options.

       · Provide best-practice pipelines that make recommended decisions for processing.  Coupled
         with  goal  of  minimizing  configuration parameters, this requires trust and discussion
         around algorithm choices. An example is bwa alignment, which  uses  bwa  aln  for  reads
         shorter  than  75bp and bwa mem for longer reads, based on recommendations from Heng Li.
         Our general goal is to encourage discussion and development of best-practices to make it
         easy to do the right thing.

       · Support  extensive  debugging  output.  In complex distributed systems, programs fail in
         unexpected ways even during production runs. We try to maximize logging to help identify
         and diagnose these type of unexpected problems.

       · Avoid  making mistakes. This results in being conservative about decisions like deleting
         file intermediates. Coupled with extensive logging, we trade off disk usage  for  making
         it  maximally  easy  to  restart and debug problems. If you'd like to delete work or log
         directories automatically, we recommend  doing  this  as  part  of  your  batch  scripts
         wrapping bcbio-nextgen.

       · Strive for a clean, readable code base. We strive to make the code a secondary source of
         information after hand written docs.  Practically,  this  means  maximizing  information
         content in source files while using in-line documentation to clarify as needed.

       · Focus  on  a  functional  coding  style with minimal use of global mutable objects. This
         approach works well with distributed code and isolates debugging to individual functions
         rather than globally mutable state.

       · Make sure your changes integrate correctly by running the test suite before submitting a
         pull request. the pipeline is automatically tested in Travis-CI, and a  red  label  will
         appear in the pull request if the former causes any issue.

OVERVIEW

       The most useful modules inside bcbio, ordered by likely interest:

       · pipeline  -- Top level functionality that drives the analysis pipeline. main.py contains
         top level definitions of pipelines like variant calling and  RNAseq,  and  is  the  best
         place to start understanding the overall organization of the code.

       · ngsalign  --  Integration  with aligners for high-throughput sequencing data. We support
         individual aligners with their own separate modules.

       · variation -- Tools for  variant  calling.  Individual  variant  calling  and  processing
         approaches each have their own submodules.

       · rnaseq -- Run RNA-seq pipelines, currently supporting TopHat/Cufflinks.

       · provenance  --  Track  third  party  software  versions, command lines and program flow.
         Handle writing of debugging details.

       · distributed -- Handle distribution of programs across multiple cores, or across multiple
         machines using IPython.

       · workflow  --  Provide  high  level  tools  to  run  customized  analyses.  They tie into
         specialized analyses or visual front ends  to  make  running  bcbio-nextgen  easier  for
         specific common tasks.

       · broad  --  Code  to  handle  calling  Broad tools like GATK and Picard, as well as other
         Java-based programs.

DEVELOPMENT INFRASTRUCTURE

       bcbio-nextgen uses GitHub for code development, and we welcome pull requests. GitHub makes
       it  easy  to  establish  custom forks of the code and contribute those back. The Biopython
       documentation has great information on using git and  GitHub  for  a  community  developed
       project.  In short, make a fork of the bcbio code by clicking the Fork button in the upper
       right corner of the GitHub page, commit your changes to this custom fork and keep it up to
       date  with  the  main bcbio repository as you develop. The github help pages have detailed
       information  on  keeping  your  fork  updated  with  the  main  github  repository   (e.g.
       https://help.github.com/articles/syncing-a-fork/).   After  commiting  changes,  click New
       Pull Request from your fork when you'd like to submit  your  changes  for  integration  in
       bcbio.

       For  developing and testing changes locally, you can install directly into a bcbio-nextgen
       installation. The automated bcbio-nextgen installer creates an isolated Python environment
       using   Anaconda.   This   will   be  a  subdirectory  of  your  installation  root,  like
       /usr/local/share/bcbio-nextgen/anaconda.  The  installer  also  includes  a   bcbio_python
       executable  target  which is the python in this isolated anaconda directory. You generally
       will want to make changes to your local copy of the bcbio-nextgen code  and  then  install
       these into the code directory.  To install from your bcbio-nextgen source tree for testing
       do:

          bcbio_python setup.py install

       One tricky part that we don't yet know how  to  work  around  is  that  pip  and  standard
       setup.py  install  have  different  ideas about how to write Python eggs. setup.py install
       will create an isolated python egg directory  like  bcbio_nextgen-0.7.5a-py2.7.egg,  while
       pip creates an egg pointing to a top level bcbio directory. Where this gets tricky is that
       the top level bcbio directory takes precedence. The best way to work around  this  problem
       is   to   manually   remove   the   current  pip  installed  bcbio-nextgen  code  (rm  -rf
       /path/to/anaconda/lib/python2.7/site-packages/bcbio*) before  managing  it  manually  with
       bcbio_python   setup.py  install.  We'd  welcome  tips  about  ways  to  force  consistent
       installation across methods.

       If you want to test with bcbio_nextgen code in  a  separate  environment  from  your  work
       directory, we recommend using the installer to install only the bcbio code into a separate
       directory:

          python bcbio_nextgen_install.py /path/to/testbcbio --nodata --isolate

       Then add this directory to your PATH before your bcbio installation with the tools: export
       PATH=/path/to/testbcbio/anaconda/bin:$PATH,   or   directly   calling  the  testing  bcbio
       /path/to/testbcbio/anaconda/bin/bcbio_nextgen.py.

BUILDING THE DOCUMENTATION LOCALLY

       If you have added or modified this documentation, to build it locally and see how it looks
       like you can do so by running:

          cd docs
          make html

       The  documentation will be built under docs/_build/html, open index.html with your browser
       to load your local build.

       If you want to use the same theme that Read The Docs uses, you can  do  so  by  installing
       sphinx_rtd_theme  via  pip. You will also need to add this in the docs/conf.py file to use
       the theme only locally:

          html_theme = 'default'
          on_rtd = os.environ.get('READTHEDOCS', False)
          if not on_rtd:  # only import and set the theme if we're building docs locally
              import sphinx_rtd_theme
              html_theme = 'sphinx_rtd_theme'
              html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]

ADDING TOOLS

   Aligner
       Write new aligners within their own submodule inside the ngsalign directory. bwa.py  is  a
       good  example  to  follow along with. There are two functions to implement, based on which
       type of alignment you'd like to allow:

       · align_bam -- Performs alignment given an input BAM file.  Expected to  return  a  sorted
         BAM output file.

       · align  --  Performs  alignment  given  FASTQ  inputs (gzipped or not). This is generally
         expected to implement an approach with unix-pipe that minimizes intermediates  and  disk
         IO,  returning  a  sorted BAM output file. For back-compatibility this can also return a
         text based SAM file.

       See the names section for more details on arguments.

       Other required implementation details include:

       · galaxy_loc_file -- Provides the name of the Galaxy loc file used to  identify  locations
         of  indexes  for  this  aligner.  The  automated  installer  sets  up  these  loc  files
         automatically.

       · remap_index_fn -- A function that remaps an index from the Galaxy location file into the
         exact  one for this aligner. This is useful for tools which aren't supported by a Galaxy
         .loc file but you can locate them relative to another index.

       Once implemented, plug the aligner into  the  pipeline  by  defining  it  as  a  _tool  in
       bcbio/pipeline/alignment.py.  You  can then use it as normal by specifying the name of the
       aligner in the aligner section of your configuration input.

   Variant caller
       New variant calling approaches live within their own module  inside  bcbio/variation.  The
       freebayes.py  implementation  is  a  good example to follow for providing your own variant
       caller. Implement a function to run variant calling on multiple BAMs in  an  input  region
       that takes the following inputs:

       · align_bams -- A list of BAM files to call simultaneously.

       · items  --  List  of data dictionaries associated with each of the samples in align_bams.
         Enables customization of variant calling  based  on  sample  configuration  inputs.  See
         documentation  on  the  data dictionary for all of the information contained inside each
         data item. Having  multiple  configurations  allows  customization  of  sample  specific
         variant calls using parameters supplied to sample-configuration.

       · ref_file -- Fasta reference genome file.

       · assoc_files  -- Useful associated files for variant calling. This includes the DbSNP VCF
         file.  It's  a  named  tuple  mapping  to  files   specified   in   the   configuration.
         bcbio/pipeline/shared.py has the available inputs.

       · region -- A tuple of (chromosome, start, end) specifying the region to call in.

       · out_file--  The output file to write to. This should contain calls for all input samples
         in the supplied region.

       Once implemented, add the variant caller into the pipeline by updating caller_fns  in  the
       variantcall_sample  function  in bcbio/variation/genotype.py. You can use it by specifying
       it in the variantcaller parameter of your sample configuration.

ADDING NEW ORGANISMS

       While bcbio-nextgen and supporting tools receive the most testing and development on human
       or  human-like  diploid organisms, the algorithms are generic and we strive to support the
       wide diversity of organisms used in your research. We welcome contributors  interested  in
       setting  up  and  maintaining  support  for  their  particular research organism, and this
       section defines the steps in integrating a new genome. We  also  welcome  suggestions  and
       implementations that improve this process.

       Setup CloudBioLinux to automatically download and prepare the genome:

       · Add  the  genome  database  key  and organism name to list of supported organisms in the
         CloudBioLinux configuration (config/biodata.yaml).

       · Add  download  details  to  specify  where  to   get   the   fasta   genome   files   (‐
         cloudbio/biodata/genomes.py).  CloudBioLinux  supports common genome providers like UCSC
         and Ensembl directly.

       Add the organism to the supported installs within bcbio:

       · This happens in two places: for the initial installer (scripts/bcbio_nextgen_install.py)
         and the updater (bcbio/install.py).

       Test  installation  of genomes by pointing to your local cloudbiolinux edits during a data
       installation:

          mkdir -p tmpbcbio-install
          ln -s ~/bio/cloudbiolinux tmpbcbio-install
          bcbio_nextgen.py upgrade --data --genomes DBKEY

       Add     configuration     information     to     bcbio-nextgen     by      creating      a
       config/genomes/DBKEY-resources.yaml  file.  Copy an existing minimal template like canFam3
       and edit with pointers to snpEff and other genome resources. The  VEP  database  directory
       has Ensembl names. SnpEff has a command to list available databases:

          snpEff databases

       Finally,  send  pull  requests  for  CloudBioLinux  and  bcbio-nextgen  and  we'll happily
       integrate the new genome.

       This will provide basic integration with bcbio and allow running a minimal  pipeline  with
       alignment  and quality control. We also have utility scripts in CloudBioLinux to help with
       preparing dbSNP (utils/prepare_dbsnp.py) and RNA-seq  (utils/prepare_tx_gff.py)  resources
       for some genomes. For instance, to prepare RNA-seq transcripts for mm9:

          bcbio_python prepare_tx_gff.py --genome-dir /path/to/bcbio/genomes Mmusculus mm9

       We  are  still  working  on  ways  to best include these as part of the standard build and
       install since they either require additional tools to run locally,  or  require  preparing
       copies in S3 buckets.

STANDARD FUNCTION ARGUMENTS

   names
       This dictionary provides lane and other BAM run group naming information used to correctly
       build BAM files. We use the rg attribute as the ID within a BAM file:

          {'lane': '7_100326_FC6107FAAXX',
           'pl': 'illumina',
           'pu': '7_100326_FC6107FAAXX',
           'rg': '7',
           'sample': 'Test1'}

   data
       The data dictionary is a large dictionary representing processing, configuration and files
       associated  with  a  sample.  The  standard  work  flow is to pass this dictionary between
       functions, updating with associated files from the additional processing. Populating  this
       dictionary  only  with  standard  types  allows  serialization  to  JSON  for  distributed
       processing.

       The dictionary is dynamic throughout the workflow depending on the step, but some  of  the
       most useful key/values available throughout are:

       · config  --  Input  configuration variables about how to process in the algorithm section
         and locations of programs in the resources section.

       · dirs -- Useful directories for building output files or retrieving inputs.

       · metadata -- Top level metadata associated  with  a  sample,  specified  in  the  initial
         configuration.

       · genome_resources  --  Naming  aliases  and  associated files associated with the current
         genome    build.    Retrieved    from    organism    specific    configuration     files
         (buildname-resources.yaml) this specifies the location of supplemental organism specific
         files like support files for variation and RNA-seq analysis.

       It also contains information the genome build,  sample  name  and  reference  genome  file
       throughout. Here's an example of these inputs:

          {'config': {'algorithm': {'aligner': 'bwa',
                                    'callable_regions': 'analysis_blocks.bed',
                                    'coverage_depth': 'low',
                                    'coverage_interval': 'regional',
                                    'mark_duplicates': 'samtools',
                                    'nomap_split_size': 50,
                                    'nomap_split_targets': 20,
                                    'num_cores': 1,
                                    'platform': 'illumina',
                                    'quality_format': 'Standard',
                                    'realign': 'gkno',
                                    'recalibrate': 'gatk',
                                    'save_diskspace': True,
                                    'upload_fastq': False,
                                    'validate': '../reference_material/7_100326_FC6107FAAXX-grade.vcf',
                                    'variant_regions': '../data/automated/variant_regions-bam.bed',
                                    'variantcaller': 'freebayes'},
                      'resources': {'bcbio_variation': {'dir': '/usr/share/java/bcbio_variation'},
                                    'bowtie': {'cores': None},
                                    'bwa': {'cores': 4},
                                    'cortex': {'dir': '~/install/CORTEX_release_v1.0.5.14'},
                                    'cram': {'dir': '/usr/share/java/cram'},
                                    'gatk': {'cores': 2,
                                             'dir': '/usr/share/java/gatk',
                                             'jvm_opts': ['-Xms750m', '-Xmx2000m'],
                                             'version': '2.4-9-g532efad'},
                                    'gemini': {'cores': 4},
                                    'novoalign': {'cores': 4,
                                                  'memory': '4G',
                                                  'options': ['-o', 'FullNW']},
                                    'picard': {'cores': 1,
                                               'dir': '/usr/share/java/picard'},
                                    'snpEff': {'dir': '/usr/share/java/snpeff',
                                               'jvm_opts': ['-Xms750m', '-Xmx3g']},
                                    'stampy': {'dir': '~/install/stampy-1.0.18'},
                                    'tophat': {'cores': None},
                                    'varscan': {'dir': '/usr/share/java/varscan'},
                                    'vcftools': {'dir': '~/install/vcftools_0.1.9'}}},
          'genome_resources': {'aliases': {'ensembl': 'human',
                                            'human': True,
                                            'snpeff': 'hg19'},
                                'rnaseq': {'transcripts': '/path/to/rnaseq/ref-transcripts.gtf',
                                           'transcripts_mask': '/path/to/rnaseq/ref-transcripts-mask.gtf'},
                                'variation': {'dbsnp': '/path/to/variation/dbsnp_132.vcf',
                                              'train_1000g_omni': '/path/to/variation/1000G_omni2.5.vcf',
                                              'train_hapmap': '/path/to/hg19/variation/hapmap_3.3.vcf',
                                              'train_indels': '/path/to/variation/Mills_Devine_2hit.indels.vcf'},
                                'version': 1},
           'dirs': {'fastq': 'input fastq directory',
                        'galaxy': 'directory with galaxy loc and other files',
                        'work': 'base work directory'},
           'metadata': {'batch': 'TestBatch1'},
           'genome_build': 'hg19',
           'name': ('', 'Test1'),
           'sam_ref': '/path/to/hg19.fa'}

       Processing  also  injects  other  useful  key/value pairs. Here's an example of additional
       information supplied during a variant calling workflow:

          {'prep_recal': 'Test1/7_100326_FC6107FAAXX-sort.grp',
           'summary': {'metrics': [('Reference organism', 'hg19', ''),
                                   ('Total', '39,172', '76bp paired'),
                                   ('Aligned', '39,161', '(100.0\\%)'),
                                   ('Pairs aligned', '39,150', '(99.9\\%)'),
                                   ('Pair duplicates', '0', '(0.0\\%)'),
                                   ('Insert size', '152.2', '+/- 31.4')],
                       'pdf': '7_100326_FC6107FAAXX-sort-prep-summary.pdf',
                       'project': 'project-summary.yaml'},
           'validate': {'concordant': 'Test1-ref-eval-concordance.vcf',
                        'discordant': 'Test1-eval-ref-discordance-annotate.vcf',
                        'grading': 'validate-grading.yaml',
                        'summary': 'validate-summary.csv'},
           'variants': [{'population': {'db': 'gemini/TestBatch1-freebayes.db',
                                        'vcf': None},
                         'validate': None,
                         'variantcaller': 'freebayes',
                         'vrn_file': '7_100326_FC6107FAAXX-sort-variants-gatkann-filter-effects.vcf'}],
           'vrn_file': '7_100326_FC6107FAAXX-sort-variants-gatkann-filter-effects.vcf',
           'work_bam': '7_100326_FC6107FAAXX-sort-prep.bam'}

PARALLELIZATION FRAMEWORK

       bcbio-nextgen  supports  parallel  runs  on  local  machines  using  multiple  cores   and
       distributed on a cluster using IPython using a general framework.

       The  first  parallelization step starts up a set of resources for processing. On a cluster
       this spawns a IPython parallel controller and set of  engines  for  processing.  The  prun
       (parallel  run)  start  function  is  the entry point to spawning the cluster and the main
       argument is a parallel dictionary  which  contains  arguments  to  the  engine  processing
       command. Here is an example input from an IPython parallel run:

          {'cores': 12,
           'type': 'ipython'
           'progs': ['aligner', 'gatk'],
           'ensure_mem': {'star': 30, 'tophat': 8, 'tophat2': 8},
           'module': 'bcbio.distributed',
           'queue': 'batch',
           'scheduler': 'torque',
           'resources': [],
           'retries': 0,
           'tag': '',
           'timeout': 15}

       The  cores and type arguments must be present, identifying the total cores to use and type
       of processing, respectively. Following that are arguments to help identify  the  resources
       to use. progs specifies the programs used, here the aligner, which bcbio looks up from the
       input sample file, and gatk. ensure_mem is an optional  argument  that  specifies  minimum
       memory  requirements  to programs if used in the workflow. The remaining arguments are all
       specific to IPython to help it spin up engines on the appropriate computing cluster.

       A shared component of all processing runs is the identification of used programs from  the
       progs  argument.  The  run creation process looks up required memory and CPU resources for
       each program from the config-resources section of your bcbio_system.yaml file. It combines
       these  resources  into  required  memory  and  cores  using  the  logic  described  in the
       memory-management section of the parallel documentation. Passing these requirements to the
       cluster creation process ensures the available machines match program requirements.

       bcbio-nextgen's  pipeline.main  code  contains  examples  of  starting  and  using  set of
       available processing engines. This example starts up machines that use samtools, gatk  and
       cufflinks then runs an RNA-seq expression analysis:

          with prun.start(_wprogs(parallel, ["samtools", "gatk", "cufflinks"]),
                          samples, config, dirs, "rnaseqcount") as run_parallel:
              samples = rnaseq.estimate_expression(samples, run_parallel)

       The  pipelines  often reuse a single set of machines for multiple distributed functions to
       avoid the overhead of starting up and tearing down machines and clusters.

       The run_parallel function returned from the prun.start function enables running on jobs in
       the  parallel  on  the  created  machines.  The  ipython wrapper code contains examples of
       implementing this. It is a simple function that takes  two  arguments,  the  name  of  the
       function to run and a set of multiple arguments to pass to that function:

          def run(fn_name, items):

       The  items  arguments need to be strings, lists and dictionaries to allow serialization to
       JSON format. The internals of the run function take care of running all  of  the  code  in
       parallel and returning the results back to the caller function.

       In this setup, the main processing code is fully independent from the parallel method used
       so running on a single multicore machine or in parallel  on  a  cluster  return  identical
       results and require no changes to the logical code defining the pipeline.

       During  re-runs,  we  avoid  the  expense of spinning up processing clusters for completed
       tasks using simple checkpoint files in the checkpoints_parallel directory. The  prun.start
       wrapper  writes  these  on  completion  of  processing  for a group of tasks with the same
       parallel architecture, and on subsequent runs will go through these on the  local  machine
       instead of parallelizing. The processing code supports these quick re-runs by checking for
       and avoiding re-running of tasks when it finds output files.

       Plugging new parallelization approaches into this  framework  involves  writing  interface
       code  that  handles  the two steps. First, create a cluster of ready to run machines given
       the parallel function with expected core and memory utilization:

       · num_jobs -- Total number of machines to start.

       · cores_per_job -- Number of cores available on each machine.

       · mem -- Expected memory needed for each machine.  Divide  by  cores_per_job  to  get  the
         memory usage per core on a machine.

       Second, implement a run_parallel function that handles using these resources to distribute
       jobs and return results. The multicore wrapper and ipython  wrapper  are  useful  starting
       points for understanding the current implementations.

OVERVIEW

         [image: Variant calling overview] [image]

PARALLEL

       bcbio  calculates  callable  regions following alignment using goleft depth. These regions
       determine breakpoints for analysis, allowing parallelization  by  genomic  regions  during
       variant  calling.  Defining  non-callable  regions  allows bcbio to select breakpoints for
       parallelization within chromosomes  where  we  won't  accidentally  impact  small  variant
       detection.  The  callable  regions  also  supplement the variant calls to define positions
       where not called bases are homozygous reference, as  opposed  to  true  no-calls  with  no
       evidence.  The  callable regions plus variant calls is an alternative to gVCF output which
       explicitly enumerates reference calls in the output variant file.
         [image:  Parallel  approaches]  [image]  Overview  of  cluster  types  during   parallel
         execution.UNINDENT

       · Variant calling and bcbio training for the Harvard Chan Bioinformatics Core In Depth NGS
         Data Analysis Course (10 October 2018): slides

       · Building a diverse set of validations; lightning talk at the GCCBOSC2018  Bioinformatics
         Community Conference: slides

       · bcbio training at the GCCBOSC2018 Bioinformatics Community Conference, focusing on bcbio
         CWL integration with examples of variant calling analyses  on  Personal  Genome  Project
         examples (26 June 2018): slides;; video

       · Description  of  bcbio  and  Common Workflow integration with a focus on parallelization
         strategies. From a bcbio discussion with Peter Park's lab at Harvard Medical School  (26
         January 2018): slides

       · In  depth  description  of  bcbio  and  Common  Workflow Language integration, including
         motivation and practical examples of running on  clusters,  DNAnexus,  SevenBridges  and
         Arvados.  From  the  Boston  Bioinformatics  Interest  Group  meeting (2 November 2017):
         slides; video

       · bcbio practical interoperability with the Common Workflow Language at BOSC 2017 (22 July
         2017): slides; video

       · teaching  variant  calling,  bcbio  and  GATK4  validation  at  the Summer 2017 NGS Data
         Analysis Course at Harvard Chan School (6 July 2017): slides

       · Training course for the Cancer Genomics Cloud,  decribing  how  bcbio  uses  the  Common
         Workflow Language to run in multiple infrastructures (1 May 2017): slides

       · MIT   Bioinformatics   Interest   Group  about  how  Common  Workflow  Language  enables
         interoperability with multiple workflow engines (3 November 2016): slides and video

       · Broad Institute software engineering seminar about bcbio validation and integration with
         Common Workflow Language and Workflow Definition Language (28 September 2016): slides

       · Materials  from  teaching  at  the  Summer 2016 NGS Data Analysis Course at Harvard Chan
         School (11 August 2016): slides

       · Bioinformatics Open Source Conference (BOSC) 2016 lightning talk  on  bcbio  and  common
         workflow language (8 July 2016): slides and video.

       · Materials  from  teaching  from the Spring 2016 NGS Data Analysis Course at Harvard Chan
         School (28 April 2016): slides

       · Statistical Genetics and  Network  Science  Meeting  at  Channing  Division  of  Network
         Medicine (23 March 2016): slides

       · Presentation  at  Curoverse  Brown Bag Seminar on bcbio and in progress integration work
         with Common Workflow Language and Arvados (11 January 2016): slides

       · Materials from teaching oriented example at Cold  Spring  Harbor  Laboratory's  Advanced
         Sequencing Technology and Applications course.  (18 November 2015): slides

       · Supporting  the  common workflow language and Docker in bcbio Bio in Docker symposium (9
         November 2015): slides

       · Validation on human build 38, HLA typing, low frequency cancer  calling  and  structural
         variation  for  Boston  Bioinformatics  Interest  Group (BIG) meeting (5 November 2015):
         slides

       · Presentation on Research Scientist Careers for  Iowa  State  Bioinformatics  Course  (23
         September 2015): slides

       · Prioritization of structural variants based on known biological information at BOSC 2015
         (10 July 2015): slides; video

       · Overview of variant calling for NGS Data Analysis Course at Harvard Medical  School  (19
         May 2015): slides

       · NGS Glasgow (23 April 2015): slides

       · Boston Computational Biology and Bioinformatics meetup (1 April 2015): slides

       · Program  in Genetic Epidemiology and Statistical Genetics seminar series at Harvard Chan
         School (6 February 2015): slides

       · Talk at Good Start Genetics (23 January 2015): slides

       · Boston area Bioinformatics Interest Group (15 October 2014): slides

       · University of Georgia Institute of Bioinformatics (12 September 2014): slides

       · Intel Life Sciences discussion (7 August 2014): slides

       · Bioinformatics Open Source Conference (BOSC) 2014: slides, conference website

       · Galaxy Community Conference 2014: slides, conference website

       · bcbio hackathon at Biogen (3 June 2014)

       · Harvard ABCD group slides (17 April 2014)

       · BIG meeting (February 2014)

       · Novartis slides (21 January 2014)

       · Mt Sinai: Strategies for accelerating the genomic sequencing pipeline: Mt Sinai workshop
         slides, Mt Sinai workshop website

       · Genome Informatics 2013 GI 2013 Presentation slides

       · Bioinformatics Open Source Conference 2013: BOSC 2013 Slides, BOSC 2013 Video, BOSC 2013
         Conference website

       · Arvados Summit 2013: Arvados Summit Slides, Arvados Summit website

       · Scientific Python 2013: SciPy 2013 Video, SciPy 2013 Conference website

       Feel free to reuse any images or text from these talks. The slides are on GitHub.

ABSTRACT

       Community Development of Validated Variant Calling Pipelines

       Brad Chapman, Rory Kirchner, Oliver Hofmann and Winston  Hide  Harvard  School  of  Public
       Health, Bioinformatics Core, Boston, MA, 02115

       Translational  research  relies  on  accurate identification of genomic variants. However,
       rapidly changing best practice approaches in alignment and variant calling,  coupled  with
       large  data  sizes, make it a challenge to create reliable and reproducible variant calls.
       Coordinated community development can help overcome these challenges  by  sharing  testing
       and   updates   across   multiple   groups.   We  describe  bcbio-nextgen,  a  distributed
       multi-architecture pipeline that automates variant calling, validation and organization of
       results   for  query  and  visualization.  It  creates  an  easily  installable,  reliable
       infrastructure from best-practice open source tools with the following goals:

       · Quantifiable: Validates variant calls against known reference materials developed by the
         Genome  in  a  Bottle  consortium.  The  bcbio.variation  toolkit  automates scoring and
         assessment of calls  to  identify  regressions  in  variant  identification  as  calling
         pipelines evolve. Incorporation of multiple variant calling approaches from Broad's GATK
         best practices and the Marth lab's gkno software enables  informed  comparisons  between
         current and future algorithms.

       · Scalable:  bcbio-nextgen  handles large population studies with hundreds of whole genome
         samples by parallelizing on a wide variety of schedulers and multicore machines, setting
         up  different  ad  hoc  cluster  configurations for each workflow step. Work in progress
         includes integration with  virtual  environments,  including  Amazon  Web  Services  and
         OpenStack.

       · Accessible:  Results  automatically  feed  into  tools  for  query  and investigation of
         variants. The GEMINI framework provides a queryable database associating variants with a
         wide variety of genome annotations. The o8 web-based tool visualizes the work of variant
         prioritization and assessment.

       · Community developed: bcbio-nextgen is widely used in  multiple  sequencing  centers  and
         research  laboratories.  We actively encourage contributors to the code base and make it
         easy to get started with a fully automated installer and updater that prepares all third
         party software and reference genomes.

LINKS FROM THE PRESENTATION

       · HugeSeq

       · Genome Comparison & Analytic Testing at Bioplanet

       · Peter Block’s “Community” book

       · CloudBioLinux   and  Homebrew  Science  as  installation  frameworks;  Conda  as  Python
         environment

       · bcbio documentation at ReadTheDocs

       · Arvados framework for meta data tracking, NGS processing and data provenance

       · Notes on improved scaling for NGS workflows

       · Genomic Reference Materials from Genome in a Bottle

       · Comparison of aligners and callers using NIST reference materials

       · Callers and minimal BAM preparation workflows

       · Coverage assessment

       This is a teaching  orientated  example  of  using  bcbio  from  the  Cold  Spring  Harbor
       Laboratory's  Advanced  Sequencing  Technology  and  Applications course. This uses cancer
       tumor normal data from the ICGC-TCGA DREAM synthetic 3  challenge,  subset  to  exomes  on
       chromosome 6 to reduce runtimes. It demonstrates:

       · Running a cancer tumor/normal workflow through bcbio.

       · Analysis with human genome build 38.

       · SNP and indel detection, with 3 variant callers and an ensemble method.

       · Structural variant calling, with 2 callers.

       · Prioritization of structural variants for cancer associated genes in CIViC.

       · HLA typing.

       · Validation of both small and structural variants against truth sets.

LOADING PRE-RUN ANALYSIS

       To  save  downloading the genome data and running the analysis, we have a pre-prepared AMI
       with the data and analysis run. Use the AWS Console to launch the pre-built AMI --  search
       Community  AMIs  for  ami-5e84fe34.  Any  small  instance  type  is fine for exploring the
       configuration, run directory and output files. Make sure you associate a public IP  and  a
       security group that allows remote ssh.

       Once    launched,    ssh    into   the   remote   machine   with   ssh   -i   your-keypair
       ubuntu@public.ip.address to explore the inputs and outputs.   The  default  PATH  contains
       bcbio  and  third  party programs in /usr/local/bin, with the biological data installed in
       /usr/local/share/bcbio. The run is in a ~/run/cancer-syn3-chr6.

INPUT CONFIGURATION FILE

       To run bcbio, you prepare a small configuration file describing your  analysis.   You  can
       prepare  it  manually  or  use  an  automated  configuration  method.   The  example has a
       pre-written configuration file with tumor/normal data located in the
       ``
       config` directory and this section walks through the settings.

       You define the type of analysis (variant calling) along with the input  files  and  genome
       build:

          analysis: variant2
          files: [../input/cancer-syn3-chr6-tumor-1.fq.gz, ../input/cancer-syn3-chr6-tumor-2.fq.gz]
          genome_build: hg38

       Sample  description  and  assignment  as  a  tumor  sample, called together with a matched
       normal:

          description: syn3-tumor
          metadata:
            batch: syn3
            phenotype: tumor
            sex: female

       Next it defines parameters for running the analysis. First we pick our aligner (bwa mem):

          algorithm:
            aligner: bwa

       Post-alignment, we mark duplicates but do not perform recalibration and realignment:

          mark_duplicates: true
          recalibrate: false
          realign: false

       We call variants in exome regions on chromosome 6 using a BED file input, call variants as
       low  as  2%  in  the tumor sample, and use 3 variant callers along with an ensemble method
       that combines results for any found in 2 out of 3:

          variant_regions: ../input/NGv3-chr6-hg38.bed
          min_allele_fraction: 2
          variantcaller: [vardict, freebayes, varscan]
          ensemble:
            numpass: 2

       For structural variant calling, we use two callers and prioritize variants to those  found
       in the CIViC database:

          svcaller: [lumpy, manta]
          svprioritize: cancer/civic

       Call HLA types with OptiType:

          hlacaller: optitype

       Finally,  we  validate  both  the  small  variants  and  structural  variants.  These  use
       pre-installed validation sets that come with bcbio. We limit validation regions  to  avoid
       low complexity regions, which cause bias in
       ``
       validating indels <http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/>`_:

          exclude_regions: [lcr]
          validate: dream-syn3-crossmap/truth_small_variants.vcf.gz
          validate_regions: dream-syn3-crossmap/truth_regions.bed
          svvalidate:
            DEL: dream-syn3-crossmap/truth_DEL.bed
            DUP: dream-syn3-crossmap/truth_DUP.bed
            INV: dream-syn3-crossmap/truth_INV.bed

OUTPUT FILES

       Output  files  are in ~/run/cancer-syn3-chr6/final, extracted from the full work directory
       in ~/run/cancer-syn3-chr6/work.

       The directories with sample  information  are  in  syn3-tumor/.  Aligned  BAMs  include  a
       -ready.bam  file  with  all  of  the  original reads (including split and discordants) and
       separate files with only the split (-sr.bam) and discordant (-disc.bam) reads:

          syn3-tumor-ready.bam
          syn3-tumor-ready.bam.bai
          syn3-tumor-sr.bam
          syn3-tumor-sr.bam.bai
          syn3-tumor-disc.bam
          syn3-tumor-disc.bam.bai

       SNP and indel calls for 3 callers, plus combined ensemble calls:

          syn3-tumor-ensemble.vcf.gz
          syn3-tumor-ensemble.vcf.gz.tbi
          syn3-tumor-freebayes.vcf.gz
          syn3-tumor-freebayes.vcf.gz.tbi
          syn3-tumor-varscan.vcf.gz
          syn3-tumor-varscan.vcf.gz.tbi
          syn3-tumor-vardict.vcf.gz
          syn3-tumor-vardict.vcf.gz.tbi

       Structural variant calls for 2 callers, plus a simplified list of structural  variants  in
       cancer genes of interest:

          syn3-tumor-sv-prioritize.tsv
          syn3-tumor-lumpy.vcf.gz
          syn3-tumor-lumpy.vcf.gz.tbi
          syn3-tumor-manta.vcf.gz
          syn3-tumor-manta.vcf.gz.tbi

       HLA typing results:

          syn3-tumor-hla-optitype.csv

       Validation results from comparisons against truth set, including plots:

          syn3-tumor-sv-validate.csv
          syn3-tumor-sv-validate-DEL.png
          syn3-tumor-sv-validate-df.csv
          syn3-tumor-sv-validate-DUP.png
          syn3-tumor-sv-validate-INV.png
          syn3-tumor-validate.png

       The  top  level directory for the project, 2015-11-18_syn3-cshl/ has files relevant to the
       entire run. There is a consolidated quality control report:

          multiqc/multiqc_report.html

       Povenance information, with log files of all commands run and program versions used:

          bcbio-nextgen.log
          bcbio-nextgen-commands.log
          programs.txt
          data_versions.csv

       A top level summary of metrics for alignment, variant calling and coverage that is  useful
       downstream:

          project-summary.yaml

PREPARING AND RUNNING

       The  steps  to prepare an AMI from a bare machine and run the analysis. These are pre-done
       on the teaching AMI to save time:

       1. Use the AWS Console to launch a Ubuntu Server 14.04 (ami-d05e75b8). Start an m4.4xlarge
          instance  with  a  100Gb  SSD.  Make  sure  you  associate  a  public IP and can ssh in
          externally.

       2. SSH to your instance:

             ssh -i ~/.ec2/your-key.pem ubuntu@public-ip

       3. Install bcbio with hg38 data:

             sudo apt-get update
             sudo apt-get install -y build-essential zlib1g-dev wget curl python-setuptools git \
                                     openjdk-7-jdk openjdk-7-jre ruby libncurses5-dev libcurl4-openssl-dev \
                                     libbz2-dev unzip pigz bsdmainutils
             wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
             python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir /usr/local \
                    --genomes hg38 --aligners bwa --sudo --isolate -u development

       4. Install the analysis data:

             mkdir -p run
             cd run
             wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/teaching/cancer-syn3-chr6-prep.sh
             bash cancer-syn3-chr6-prep.sh

       5. Run the analysis:

             cd cancer-syn3-chr6/work
             bcbio_nextgen.py ../config/cancer-syn3-chr6.yaml -n 16

       https://github.com/bcbio/bcbio-nextgen

SMALL RNA-SEQ

       Data was analyzed with bcbio-nextgen (https://github.com/bcbio/bcbio-nextgen) using  piDNA
       to  detect the adapter, cutadapt to remove it, STAR/bowtie to align against the genome and
       seqcluster to detect small RNA transcripts. miRNAs were  detected  using  miraligner  tool
       with  miRBase as the reference miRNA database. tRNA profiles were detected using tdrmapper
       tool. mirdeep2 was used for discovery of novel miRNAs. FastQC was used for QC metrics  and
       multiqc for reporting.

       Download                                    BIB                                    format:
       https://github.com/bcbio/bcbio-nextgen/tree/master/docs/contents/misc/bcbio-smallrna.bib

   Tools
       · Tsuji J, Weng Z. (2016) DNApi: A De Novo Adapter  Prediction  Algorithm  for  Small  RNA
         Sequencing                             Data.                            11(10):e0164228.
         http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164228

       · Andrews, S. (2010). FastQC: A quality control tool for high  throughput  sequence  data.
         Bioinformatics. doi:citeulike-article-id:11583827

       · Didion,  J.  P.,  Martin, M., & Collins, F. S. (2017). Atropos: specific, sensitive, and
         speedy trimming of sequencing reads. http://doi.org/10.7287/peerj.preprints.2452v4

       · Dale, R. K., Pedersen, B. S., & Quinlan, A. R. (2011).  Pybedtools:  A  flexible  Python
         library  for  manipulating  genomic  datasets  and  annotations. Bioinformatics, 27(24),
         3423–3424. doi:10.1093/bioinformatics/btr539

       · Quinlan, A. R., & Hall, I. M. (2010).  BEDTools:  A  flexible  suite  of  utilities  for
         comparing       genomic       features.       Bioinformatics,       26(6),      841–842.
         doi:10.1093/bioinformatics/btq033

       · Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J., & Prins,  P.  (2015).  Sambamba:
         Fast   processing   of   NGS   alignment  formats.  Bioinformatics,  31(12),  2032–2034.
         doi:10.1093/bioinformatics/btv098

       · Heger,       A.       (2009).       Pysam.       github.com.       Retrieved        from
         https://github.com/pysam-developers/pysam

       · Li,  H. (2011). A statistical framework for SNP calling, mutation discovery, association
         mapping  and  population  genetical   parameter   estimation   from   sequencing   data.
         Bioinformatics, 27(21), 2987–2993. doi:10.1093/bioinformatics/btr509

       · Li,  H.,  Handsaker,  B.,  Wysoker,  A.,  Fennell, T., Ruan, J., Homer, N., … Durbin, R.
         (2009).  The  Sequence  Alignment/Map  format  and  SAMtools.  Bioinformatics,   25(16),
         2078–2079. doi:10.1093/bioinformatics/btp352

       · Pantano,  L.,  Estivill, X., & Martí, E. (2010). SeqBuster, a bioinformatic tool for the
         processing and analysis of small RNAs datasets, reveals ubiquitous  miRNA  modifications
         in   human   embryonic  cells.  Nucleic  Acids  Research,  38(5),  e34.  Retrieved  from
         http://www.ncbi.nlm.nih.gov/pubmed/20008100

       · Pantano, L., Friedlander, M. R.,  Escaramis,  G.,  Lizano,  E.,  Pallares-Albanell,  J.,
         Ferrer,  I.,  …  Marti,  E.  (2015).  Specific  small-RNA  signatures in the amygdala at
         premotor and motor stages of Parkinson’s disease revealed by deep  sequencing  analysis.
         Bioinformatics (Oxford, England). doi:10.1093/bioinformatics/btv632

       For the alignment, add what you have used:

       · Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., … Gingeras,
         T. R. (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics,  29(1),  15–21.
         doi:10.1093/bioinformatics/bts635

       · Langmead,  B.,  Trapnell,  C.,  Pop,  M.,  &  Salzberg,  S.  L.  (2009).  Ultrafast  and
         memory-efficient alignment of short DNA sequences to the human genome.  Genome  Biology,
         10, R25. doi:10.1186/gb-2009-10-3-r25

       · Kim,  D.,  Langmead,  B.  & Salzberg, SL. (2016). HISAT: a fast spliced aligner with low
         memory requirements. Nature Methods, 12(4): 357–360. doi: 10.1038/nmeth.3317

       If you used TopHat2 for alignment:

       · Kim, D., Pertea, G., Trapnell, C., Pimentel, H.,  Kelley,  R.  &  Salzberg  SL.  (2013).
         TopHat2:  accurate  alignment of transcriptomes in the presence of insertions, deletions
         and gene fusions. Genome Biology, 14(4): R36. doi: 10.1186/gb-2013-14-4-r36

       · Brueffer, C. &  Saal,  LH.  (2016).  TopHat-Recondition:  A  post-processor  for  TopHat
         unmapped reads. BMC Bioinformatics, 17(1):199. doi: 10.1186/s12859-016-1058-x

       If you have in the output novel miRNA discovering, add:

       · Friedlander,  M. R., MacKowiak, S. D., Li, N., Chen, W., & Rajewsky, N. (2012). MiRDeep2
         accurately identifies known and hundreds of novel microRNA genes in seven animal clades.
         Nucleic Acids Research, 40(1), 37–52. doi:10.1093/nar/gkr688

       If you have tRNA mapping output, add:

       · Selitsky,  S.  R.,  &  Sethupathy,  P.  (2015).  tDRmapper:  challenges and solutions to
         mapping, naming, and quantifying tRNA-derived RNAs from human small RNA-sequencing data.
         BMC Bioinformatics, 16(1), 354. doi:10.1186/s12859-015-0800-0

       If you have miRge activated:

       · Yin   Lu,   Alexander   S.  Baras,  Marc  K  Halushka.  miRge2.0:  An  updated  tool  to
         comprehensively analyze microRNA sequencing data. bioRxiv.org.

       If you have MINTmap activated:

       · Loher, P, Telonis, AG, Rigoutsos, I. MINTmap: fast and exhaustive profiling  of  nuclear
         and  mitochondrial  tRNA fragments from short RNA-seq data. Sci Rep. 2017;7 :41184. doi:
         10.1038/srep41184. PubMed PMID:28220888 PubMed Central PMC5318995.

   Data
       · Griffiths-Jones, S. (2004). The microRNA Registry. Nucleic Acids  Research,  32(Database
         issue), D109–11. doi:10.1093/nar/gkh023

       · Griffiths-Jones,  S.  (2006).  miRBase:  the  microRNA  sequence  database.  Methods  in
         Molecular Biology (Clifton, N.J.), 342, 129–38. doi:10.1385/1-59745-123-1:129

       · Griffiths-Jones, S., Saini, H. K., Van Dongen, S., & Enright,  A.  J.  (2008).  miRBase:
         Tools    for    microRNA    genomics.    Nucleic    Acids    Research,   36(SUPPL.   1).
         doi:10.1093/nar/gkm952

       · Kozomara, A., & Griffiths-Jones, S. (2011). MiRBase: Integrating microRNA annotation and
         deep-sequencing data. Nucleic Acids Research, 39(SUPPL. 1). doi:10.1093/nar/gkq1027

       · Kozomara,  A.,  &  Griffiths-Jones,  S.  (2014).  MiRBase:  Annotating  high  confidence
         microRNAs   using   deep   sequencing   data.   Nucleic    Acids    Research,    42(D1).
         doi:10.1093/nar/gkt1181

       · genindex

       · modindex

       · search

AUTHOR

       Brad Chapman

COPYRIGHT

       2019, bcbio-nextgen contributors