Ubuntu Manpage: bcbio_nextgen - bcbio_nextgen Documentation [image: bcbio banner] [image]

NAME

bcbio_nextgen - bcbio_nextgen Documentation [image: bcbio banner] [image]

A python toolkit providing best-practice pipelines for fully automated high throughput
sequencing analysis. You write a high level configuration file specifying your inputs and
analysis parameters. This input drives a parallel pipeline that handles distributed
execution, idempotent processing restarts and safe transactional steps. The goal is to
provide a shared community resource that handles the data processing component of
sequencing analysis, providing researchers with more time to focus on the downstream
biology.

Contents:

bcbio-nextgen provides best-practice pipelines for automated analysis of high throughput
sequencing data with the goal of being:

· Quantifiable: Doing good science requires being able to accurately assess the quality of
results and re-verify approaches as new algorithms and software become available.

· Analyzable: Results feed into tools to make it easy to query and visualize the results.

· Scalable: Handle large datasets and sample populations on distributed heterogeneous
compute environments.

· Reproducible: Track configuration, versions, provenance and command lines to enable
debugging, extension and reproducibility of results.

· Community developed: The development process is fully open and sustained by contributors
from multiple institutions. By working together on a shared framework, we can overcome
the challenges associated with maintaining complex pipelines in a rapidly changing area
of research.

· Accessible: Bioinformaticians, biologists and the general public should be able to run
these tools on inputs ranging from research materials to clinical samples to personal
genomes.

USERS

       A sample of institutions using  bcbio-nextgen  for  solving  biological  problems.  Please
       submit your story if you're using the pipeline in your own research.

       · Harvard School of Public Health: We use bcbio-nextgen within the bioinformatics core for
         variant calling on large population studies related to human health like  Breast  Cancer
         and  Alzheimer's disease.  Increasing scalability of the pipeline has been essential for
         handling study sizes of more than 1400 whole genomes.

       · Massachusetts General Hospital: The Department of Molecular Biology uses the pipeline to
         automatically process samples coming off Illumina HiSeq instruments. Automated pipelines
         perform alignment and sample-specific analysis, with results directly  uploaded  into  a
         local Galaxy instance.

       · Science  for  Life  Laboratory:  The  genomics  core  platform  in  the Swedish National
         Infrastructure (NGI) for genomics, has crunched over 16TBp (terabasepairs) and processed
         almost  7000+  samples  from  the  beginning  of 2013 until the end of July. UPPMAX, our
         cluster located in Uppsala runs the pipeline in production since 2010.

       · Institute of Human Genetics, UCSF: The Genomics Core Facility utilizes bcbio-nextgen  in
         processing  more  than  2000 whole genome, exome, RNA-seq, ChIP-seq on various projects.
         This pipeline tremendously lowers the barrier  of  getting  access  to  next  generation
         sequencing technology. The community engaged here is also very helpful in providing best
         practices advices and up-to-date solution to ease scientific discovery.

       · IRCCS "Mario Negri" Institute for Pharmacological Research: The  Translational  Genomics
         Unit  in  the Department of Oncology uses bcbio-nextgen for targeted resequencing (using
         an Illumina MiSeq) to  identify  mutations  and  other  variants  in  tumor  samples  to
         investigate  their  link to tumor progression, patient survival and drug sensitivity and
         resistance. A poster from the 2014 European Society of Human Genetics  meeting  provides
         more  details  on  usage in ovarian cancer. A paper on the study of longitudinal ovarian
         cancer biopsies, which makes extensive use of bcbio-nextgen, was published  in  2015  in
         Annals of Oncology.

       · The  Translational  Genomics Research Institute (TGen): Members of the Huentelman lab at
         TGen apply bcbio-nextgen to a wide variety of studies of  with  a  major  focus  in  the
         neurobiology  of  aging  and  neurodegeneration  in  collaboration  with the The Arizona
         Alzheimer's Consortium (AAC) and  the McKnight Brain Research Foundation.  We  also  use
         bcbio  in  studies of rare diseases in children through TGen's Center for Rare Childhood
         Disorders (C4RCD),  and other rare diseases  such  as  Multiple  System  Atrophy  (MSA).
         bcbio-nextgen  has  also  been  instrumental  in  projects for TGen's Program for Canine
         Health & Performance (PCHP) and numerous RNA-seq projects using rodent models. Our  work
         with bcbio started with a parnership with Dell and The Neuroblastoma and Medulloblastoma
         Translational Research Consortium (NMTRC), and TGen as part of a Phase I clinical  trial
         in these rare childhood cancers.

       · Computer  Science  and  Artificial Intelligence Laboratory (CSAIL), MIT: The Gifford lab
         uses the bcbio-nextgen pipeline to analyze a variety of sequencing  datasets  for  their
         research in genetics and regulatory genomics (including the SysCode and stemcell.mit.edu
         projects).  The pipeline applies collaboratively-developed best practices  for  analysis
         as  well as computation, which enables the lab to run the pipeline on local clusters and
         Amazon EC2.

       · Sheffield Bioinformatics Core, The University of Sheffield: The Sheffield Bioinformatics
         Core  is  a  relatively  new Core facility at The University of Sheffield, and bcbio has
         been instrumental in setting-up a  best-practice  Bioinformatics  analysis  service.  We
         employ  bcbio  to  automate the analyses of RNA-seq, small RNA and ChiP-Seq datasets for
         researchers at The University of Sheffield  and  NIHR  Biomedical  Research  Centre.  In
         conjunction  with  the  bcbioRNASeq Bioconductor package, we deliver publication-quality
         reports to our researchers based on reproducible analyses.

AUTOMATED

We provide an automated script that installs third party analysis tools, required genome
data and python library dependencies for running human variant and RNA-seq analysis,
bundled into an isolated directory or virtual environment:

wget https://raw.github.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir=/usr/local \
--genomes GRCh37 --aligners bwa --aligners bowtie2

bcbio should install cleanly on Linux systems. For Mac OSX, we suggest trying bcbio-vm
which runs bcbio on docs-cloud or isolates all the third party tools inside a Docker
container. bcbio-vm is still a work in progress but not all of the dependencies bcbio uses
install cleanly on OSX.

With the command line above, indexes and associated data files go in
/usr/local/share/bcbio-nextgen and tools are in /usr/local. If you don't have write
permissions to install into the /usr/local directories you can install in a user directory
like ~/local or use sudo chmod to give your standard user permissions. Please don't run
the installer with sudo or as the root user. Do not use directories with : in the name,
it is not POSIX compliant and will cause installation failures.

The installation is highly customizable, and you can install additional software and data
later using bcbio_nextgen.py upgrade. Run python bcbio_nextgen_install.py with no
arguments to see options for configuring the installation process. Some useful arguments
are:

· --isolate Avoid updating the user's ~/.bashrc if installing in a non-standard PATH. This
facilitates creation of isolated modules without disrupting the user's environmental
setup. Manually edit your ~/.bashrc to allow bcbio runs with:

export PATH=/path_to_bcbio/bin:$PATH

· --nodata Do not install genome data.

The machine will need to have some basic requirements for installing and running bcbio:

· Python 2.7, Python 3.x, or Python 2.6 plus the argparse dependency.

· Basic system setup for unpacking files: tar, gzip, unzip, bzip2, xz-utils.

· The git version control system (http://git-scm.com/)

· wget for file retrieval (https://www.gnu.org/software/wget/)

Optional tool specific requirements:

· Java 1.7, needed when running GATK < 3.6 or MuTect. This must be available in your path
so typing java -version resolves a 1.7 version. bcbio distributes Java 8 as part of the
anaconda installation for recent versions of GATK and MuTect2. You can override the Java
8 installed with bcbio by setting BCBIO_JAVA_HOME=/path/to/your/javadir if you have the
java you want in /path/to/your/javadir/bin/java.

· An OpenGL library, like Mesa (On Ubuntu/deb systems: libglu1-mesa, On RedHat/rpm
systems: mesa-libGLU-devel). This is only required for cancer heterogeneity analysis
with BubbleTree.

· The Pisces tumor-only variant callers requires the Microsoft .NET runtime.

The bcbio-nextgen Dockerfile contains the packages needed to install on bare Ubuntu
systems.

The automated installer creates a fully integrated environment that allows simultaneous
updates of the framework, third party tools and biological data. This offers the
advantage over manual installation of being able to manage and evolve a consistent
analysis environment as algorithms continue to evolve and improve. Installing this way is
as isolated and self-contained as possible without virtual machines or lightweight system
containers like Docker. The Upgrade section has additional documentation on including
additional genome data for supported bcbio genomes. For genome builds not included in the
defaults, see the documentation on config-custom. Following installation, you should edit
the pre-created system configuration file in
/usr/local/share/bcbio-nextgen/galaxy/bcbio_system.yaml to match your local system or
cluster configuration (see tuning-cores).

UPGRADE

We use the same automated installation process for performing upgrades of tools, software
and data in place. Since there are multiple targets and we want to avoid upgrading
anything unexpectedly, we have specific arguments for each. Generally, you'd want to
upgrade the code, tools and data together with:

bcbio_nextgen.py upgrade -u stable --tools --data

Tune the upgrade with these options:

· -u Type of upgrade to do for bcbio-nextgen code. stable gets the most recent released
version and development retrieves the latest code from GitHub.

· --datatarget Customized installed data or download additional files not included by
default: Customizing data installation

· --toolplus Specify additional tools to include. See the section on Extra software for
more details.

· --genomes and --aligners options add additional aligner indexes to download and prepare.
bcbio_nextgen.py upgrade -h lists available genomes and aligners. If you want to install
multiple genomes or aligners at once, specify --genomes or --aligners multiple times,
like this: --genomes GRCh37 --genomes mm10 --aligners bwa --aligners bowtie2

· Leave out the --tools option if you don't want to upgrade third party tools. If using
--tools, it will use the same directory as specified during installation. If you're
using an older version that has not yet gone through a successful upgrade or
installation and saved the tool directory, you should manually specify --tooldir for the
first upgrade. You can also pass --tooldir to install to a different directory.

· Leave out the --data option if you don't want to get any upgrades of associated genome
data.

· Some aligners such as STAR don't have pre-built indices due to the large file sizes of
these. You set the number of cores to use for indexing with --cores 8.

CUSTOMIZING DATA INSTALLATION

bcbio installs associated data files for sequence processing, and you're able to customize
this to install larger files or change the defaults. Use the --datatarget flag
(potentially multiple times) to customize or add new targets.

By default, bcbio will install data files for variation, rnaseq and smallrna but you can
sub-select a single one of these if you don't require other analyses. The available
targets are:

· variation -- Data files required for variant calling: SNPs, indels and structural
variants. These include files for annotation like dbSNP, associated files for variant
filtering, coverage and annotation files.

· rnaseq -- Transcripts and indices for running RNA-seq. The transcript files are also
used for annotating and prioritizing structural variants.

· smallrna -- Data files for doing small RNA analysis.

· gemini -- The GEMINI framework associates publicly available metadata with called
variants, and provides utilities for query and analysis. This target installs the
required GEMINI data files, including ExAC.

· gnomad -- gnomAD is a large scale collection of genome variants, expanding on ExAC to
include whole genome and more exome inputs. This is a large 25Gb download, available for
human genome builds GRCh37, hg19 and hg38.

· vep -- Data files for the Variant Effects Predictor (VEP). To use VEP as an alternative
to the default installed snpEff, set vep in the variant-config configuration.

· dbnsfp Like CADD, dbNSFP provides integrated and generalized metrics from multiple
sources to help with prioritizing variations for follow up. The files are large: dbNSFP
is 10Gb, expanding to 100Gb during preparation.

· dbscsnv dbscSNV includes all potential human SNVs within splicing consensus regions (−3
to +8 at the 5’ splice site and −12 to +2 at the 3’ splice site), i.e. scSNVs, related
functional annotations and two ensemble prediction scores for predicting their potential
of altering splicing.

· battenberg Data files for Battenberg, which detects subclonality and copy number changes
in whole genome cancer samples.

· kraken Database for Kraken, optionally used for contamination detection.

· ericscript Database for EricScript, based gene fusion detection. Supports hg38, hg19 and
GRCh37.

For somatic analyses, bcbio includes COSMIC v68 for hg19 and GRCh37 only. Due to license
restrictions, we cannot include updated versions of this dataset and hg38 support with the
installer. To prepare these datasets yourself you can use a utility script shipped with
cloudbiolinux that downloads, sorts and merges the VCFs, then copies into your bcbio
installation:

export COSMIC_USER="your@registered.email.edu"
export COSMIC_PASS="cosmic_password"
bcbio_python prepare_cosmic.py 89 /path/to/bcbio

/path/to/bcbio/ here is the directory one up from the genomes directory.

EXTRA SOFTWARE

We're not able to automatically install some useful tools due to licensing restrictions,
so we provide a mechanism to manually download and add these to bcbio-nextgen during an
upgrade with the --toolplus command line.

GATK and MuTect/MuTect2
bcbio includes an installation of GATK4, which is freely available for all uses. This is
the default runner for HaplotypeCaller or MuTect2. If you want to use an older version of
GATK, it requires manual installation. This is freely available for academic users, but
requires a license for commercial use. It is not freely redistributable so requires a
manual download from the GATK download site. You also need to include tools_off: [gatk4]
in your configuration for runs: see config-changing-defaults.

To install GATK3, register with the pre-installed gatk bioconda wrapper:

gatk3-register /path/to/GenomeAnalysisTK.tar.bz2

If you're not using the most recent post-3.6 version of GATK, or using a nightly build,
you can add --noversioncheck to the command line to skip comparisons to the GATK version.

MuTect2 is distributed with GATK in versions 3.5 and later.

To install versions of GATK < 3.6, download and unzip the latest version from the GATK
distribution. Then make this jar available to bcbio-nextgen with:

bcbio_nextgen.py upgrade --tools --toolplus gatk=/path/to/gatk/GenomeAnalysisTK.jar

This will copy the jar and update your bcbio_system.yaml and manifest files to reflect the
new version.

MuTect also has similar licensing terms and requires a license for commercial use. After
downloading the MuTect jar, make it available to bcbio:

bcbio_nextgen.py upgrade --tools --toolplus mutect=/path/to/mutect/mutect-1.1.7.jar

Note that muTect does not provide an easy way to query for the current version, so your
input jar needs to include the version in the name.

SYSTEM REQUIREMENTS

       bcbio-nextgen provides a wrapper around external tools and data, so the actual tools  used
       drive  the  system  requirements. For small projects, it should install on workstations or
       laptops with a couple Gb of memory, and then scale as  needed  on  clusters  or  multicore
       machines.

       Disk  space  requirements  for  the  tools,  including  all system packages are under 4Gb.
       Biological data requirements will depend on the genomes and aligner indices  used,  but  a
       suggested  install  with GRCh37 and bowtie/bwa2 indexes uses approximately 35Gb of storage
       during preparation and ~25Gb after:

          $ du -shc genomes/Hsapiens/GRCh37/*
          3.8G  bowtie2
          5.1G  bwa
          3.0G  rnaseq-2014-05-02
          3.0G  seq
          340M  snpeff
          4.2G  variation
          4.4G  vep
          23.5G total

TROUBLESHOOTING

Proxy or firewall problems
Some steps retrieve third party tools from GitHub, which can run into issues if you're
behind a proxy or block git ports. To instruct git to use https:// globally instead of
git://:

$ git config --global url.https://github.com/.insteadOf git://github.com/

GATK or Java Errors
Most software tools used by bcbio require Java 1.8. bcbio distributes an OpenJDK Java
build and uses it so you don't need to install anything. Older versions of GATK (< 3.6)
and MuTect require a locally installed Java 1.7. If you have version incompatibilities,
you'll see errors like:

Unsupported major.minor version 51.0

Fixing this requires either installing Java 1.7 for old GATK and MuTect or avoiding
pointing to an incorrect java (unset JAVA_HOME). You can also tweak the java used by
bcbio, described in the Automated installation section.

ImportErrors
Import errors with tracebacks containing Python libraries outside of the bcbio
distribution (/path/to/bcbio/anaconda) are often due to other conflicting Python
installations. bcbio tries to isolate itself as much as possible but external libraries
can get included during installation due to the PYTHONHOME or PYTHONPATH environmental
variables or local site libraries. These commands will temporary unset those to get bcbio
installed, after which it should ignore them automatically:

$ unset PYTHONHOME
$ unset PYTHONPATH
$ export PYTHONNOUSERSITE=1

Finally, having a .pydistutils.cfg file in your home directory can mess with where the
libraries get installed. If you have this file in your home directory, temporarily
renaming it to something else may fix your installation issue.

MANUAL PROCESS

       The manual process does not allow the in-place updates and management of third party tools
       that  the  automated installer makes possible. It's a more error-prone and labor intensive
       process. If you find you can't use the installer we'd love to hear why  to  make  it  more
       amenable  to  your  system. If you'd like to develop against a bcbio installation, see the
       documentation on setting up a code-devel-infrastructure.

   Tool Requirements
       The code drives a number of next-generation sequencing analysis tools  that  you  need  to
       install  on  any  machines  involved in the processing. The CloudBioLinux toolkit provides
       automated scripts to help with installation for both software and associated data files:

          fab -f cloudbiolinux/fabfile.py -H localhost install_biolinux:flavor=ngs_pipeline_minimal

       You can also install them manually, adjusting locations in the resources section  of  your
       bcbio_system.yaml  configuration file as needed. The CloudBioLinux infrastructure provides
       a full list of third party software installed with bcbio-nextgen in
       `packages-conda.yaml`_
       , which lists all third party tools installed through Bioconda

   Data requirements
       In addition to existing bioinformatics software  the  pipeline  requires  associated  data
       files  for  reference genomes, including pre-built indexes for aligners. The CloudBioLinux
       toolkit again provides an automated way to download and prepare these reference genomes:

          fab -f data_fabfile.py -H localhost -c your_fabricrc.txt install_data_s3:your_biodata.yaml

       The  biodata.yaml  file  contains  information  about  what  genomes  to   download.   The
       fabricrc.txt  describes where to install the genomes by adjusting the data_files variable.
       This creates a tree structure that includes  a  set  of  Galaxy-style  location  files  to
       describe locations of indexes:

          ├── galaxy
          │   ├── tool-data
          │   │   ├── alignseq.loc
          │   │   ├── bowtie_indices.loc
          │   │   ├── bwa_index.loc
          │   │   ├── sam_fa_indices.loc
          │   │   └── twobit.loc
          │   └── tool_data_table_conf.xml
          ├── genomes
          │   ├── Hsapiens
          │   │   ├── GRCh37
          │   │   └── hg19
          │   └── phiX174
          │       └── phix
          └── liftOver

       Individual  genome  directories contain indexes for aligners in individual sub-directories
       prefixed by the aligner name. This structured scheme helps manage aligners that don't have
       native  Galaxy  .loc  files.  The  automated  installer  will  download  and  set  this up
       automatically:

          `-- phix
              |-- bowtie
              |   |-- phix.1.ebwt
              |   |-- phix.2.ebwt
              |   |-- phix.3.ebwt
              |   |-- phix.4.ebwt
              |   |-- phix.rev.1.ebwt
              |   `-- phix.rev.2.ebwt
              |-- bowtie2
              |   |-- phix.1.bt2
              |   |-- phix.2.bt2
              |   |-- phix.3.bt2
              |   |-- phix.4.bt2
              |   |-- phix.rev.1.bt2
              |   `-- phix.rev.2.bt2
              |-- bwa
              |   |-- phix.fa.amb
              |   |-- phix.fa.ann
              |   |-- phix.fa.bwt
              |   |-- phix.fa.pac
              |   |-- phix.fa.rbwt
              |   |-- phix.fa.rpac
              |   |-- phix.fa.rsa
              |   `-- phix.fa.sa
              |-- novoalign
              |   `-- phix
              |-- seq
              |   |-- phix.dict
              |   |-- phix.fa
              |   `-- phix.fa.fai
              `-- ucsc
                  `-- phix.2bit

GERMLINE VARIANT CALLING

bcbio implements configurable SNP, indel and structural variant calling for germline
populations. We include whole genome and exome evaluations against reference calls from
the Genome in a Bottle consortium and Illumina Platinum Genomes project, enabling
continuous assessment of new alignment and variant calling algorithms. We regularly report
on these comparisons and continue to improve approaches as the community makes new tools
available. Here is some of the research that contributes to the current implementation:

· An introduction to the variant evaluation framework. This includes a comparison of the
bwa mem and novoalign aligners. We also compared the FreeBayes, GATK HaplotypeCaller and
GATK UnifiedGenotyper variant callers.

· An in-depth evaluation of FreeBayes and BAM post-alignment processing. We found that
FreeBayes quality was equal to GATK HaplotypeCaller. Additionally, a lightweight
post-alignment preparation method using only de-duplication was equivalent to GATK's
recommended Base Quality Score Recalibration (BQSR) and realignment around indels, when
using good quality input datasets and callers that do local realignment.

· Additional work to improve variant filtering, providing methods to remove low complexity
regions (LCRs) that can bias indel results. We also tuned GATK's Variant Quality Score
Recalibrator (VQSR) and compared it with cutoff-based soft filtering. VQSR requires a
large number of variants and we use it in bcbio with GATK HaplotypeCaller when your
algorithm-config contains high depth samples (coverage_depth is not low) and you are
calling on the whole genome (coverage_interval is genome) or have more than 50 regional
or exome samples called concurrently.

· An evaluation of joint calling with GATK HaplotypeCaller, FreeBayes, Platypus and
samtools. This validates the joint calling implementation, allowing scaling of large
population germline experiments. It also demonstrates improved performance of new
callers: samtools 1.0 and Platypus.

· Support for build 38 of the human genome, improving precision of detection thanks to the
improved genome representation.

bcbio automates post-variant calling annotation to make the outputs easier to feed
directly into your biological analysis. We annotate variant effects using snpEff or
Variant Effect Predictor (VEP), and prepare a GEMINI database that associates variants
with multiple external annotations in a SQL-based query interface. GEMINI databases have
the most associated external information for human samples (GRCh37/hg19 and hg38) but are
available for any organism with the database populated using the VCF INFO column and
predicted effects.

Basic germline calling
The best approach to build a bcbio docs-config for germline calling is to use the
automated-sample-config with one of the default templates:

· FreeBayes template -- Call variants using FreeBayes with a minimal preparation pipeline.
This is a freely available unrestricted pipeline fully included in the bcbio
installation.

· GATK HaplotypeCaller template -- Run GATK best practices, including Base Quality Score
Recalibration, realignment and HaplotypeCaller variant calling. This requires a license
from Broad for commercial use. You need to manually install GATK along with bcbio using
downloads from the GATK Broad site or Appistry (see toolplus-install).

You may also want to enable Structural variant calling for detection of larger events,
which work with either caller. Another good source of inspiration are the configuration
files from the example-pipelines, which may help identify other configuration variables of
interest. A more complex setup with multiple callers and resolution of ensemble calls is
generally only useful with a small population where you are especially concerned about
sensitivity. Single caller detection with FreeBayes or GATK HaplotypeCaller provide good
resolution of events.

Population calling
When calling multiple samples, we recommend calling together to provide improved
sensitivity and a fully squared off final callset. To associate samples together in a
population add a metadata batch to the sample-configuration:

- description: Sample1
metadata:
batch: Batch1
- description: Sample2
metadata:
batch: Batch1

Batching samples results in output VCFs and GEMINI databases containing all merged sample
calls. bcbio has two methods to call samples together:

· Batch or pooled calling -- This calls all samples simultaneously by feeding them to the
variant caller. This works for smaller batch sizes (< 100 samples) as memory
requirements become limiting in larger pools. This is the default approach taken when
you specify a variantcaller in the variant-config configuration.

· Joint calling -- This calls samples independently, then combines them together into a
single callset by integrating the individual calls. This scales to larger population
sizes by avoiding the computational bottlenecks of pooled calling. We recommend joint
calling with HaplotypeCaller if you have a license for GATK usage, but also support
joint calling with FreeBayes using a custom implementation. Specifying a jointcaller
along with the appropriate variantcaller in the variant-config configuration enables
this:

- description: Sample1
algorithm:
variantcaller: gatk-haplotype
jointcaller: gatk-haplotype-joint
metadata:
batch: Batch1
- description: Sample2
algorithm:
variantcaller: gatk-haplotype
jointcaller: gatk-haplotype-joint
metadata:
batch: Batch1

CANCER VARIANT CALLING

bcbio supports somatic cancer calling with tumor and optionally matched normal pairs using
multiple SNP, indel and structural variant callers. A full evaluation of cancer calling
validates callers against synthetic dataset 3 from the ICGC-TCGA DREAM challenge. bcbio
uses a majority voting ensemble approach to combining calls from multiple SNP and indel
callers, and also flattens structural variant calls into a combined representation.

The example configuration for the example-cancer validation is a good starting point for
setting up a tumor/normal run on your own dataset. The configuration works similarly to
population based calling. Supply a consistent batch for tumor/normal pairs and mark them
with the phenotype:

- description: your-tumor
algorithm:
variantcaller: [vardict, strelka2, mutect2]
metadata:
batch: batch1
phenotype: tumor
- description: your-normal
algorithm:
variantcaller: [vardict, strelka2, mutect2]
metadata:
batch: batch1
phenotype: normal

Other config-cancer configuration options allow tweaking of the processing parameters. For
pairs you want to analyze together, specify a consistent set of variantcaller options for
both samples.

Cancer calling handles both tumor-normal paired calls and tumor-only calling. To specify a
tumor-only sample, provide a single sample labeled with phenotype: tumor. Otherwise the
configuration and setup is the same as with paired analyses. For tumor-only samples, bcbio
will try to remove likely germline variants present in the public databases like 1000
genomes and ExAC, and not in COSMIC. This runs as long as you have a local GEMINI data
installation (--datatarget gemini) and marks likely germline variants with a LowPriority
filter. This post has more details on the approach and validation.

The standard variant outputs (sample-caller.vcf.gz) for tumor calling emphasize somatic
differences, those likely variants unique to the cancer. If you have a tumor-only sample
and GEMINI data installed, it will also output sample-caller-germline.vcf.gz, which tries
to identify germline background mutations based on presence in public databases. If you
have tumor/normal data and would like to also call likely germline mutations see the
documentation on specifying a germline caller: Somatic with germline variants.

We're actively working on improving calling to better account for the heterogeneity and
structural variability that define cancer genomes.

SOMATIC WITH GERMLINE VARIANTS

       For tumor/normal somatic  samples,  bcbio  can  call  both  somatic  (tumor-specific)  and
       germline (pre-existing) variants. The typical outputs of Cancer variant calling are likely
       somatic variants acquired by the cancer, but pre-existing germline risk variants are often
       also diagnostic.

       For  tumor-only  cases  we  suggest  running  standard Cancer variant calling.  Tumor-only
       inputs mix somatic and germline variants, making it  difficult  to  separate  events.  For
       small  variants  (SNPs  and indels) bcbio will attempt to distinguish somatic and germline
       mutations using the presence of variants in population databases.

       To option somatic and germline calls for your tumor/normal inputs, specify  which  callers
       to use for each step in the variant-config configuration:

          description: your-normal
          variantcaller:
             somatic: vardict
             germline: freebayes

       bcbio  does  a  single alignment for the normal sample, then splits at the variant calling
       steps using this normal sample to do germline calling. In this example, the  output  files
       are:

       · your-tumor/your-tumor-vardict.vcf.gz  --  Somatic calls from the tumor samples using the
         normal as background to subtract existing calls.

       · your-normal/your-normal-freebayes.vcf.gz -- Germline calls on the normal sample.

       Germline calling supports multiple callers, and other configuration options like  ensemble
       and  structural  variant calling inherit from the remainder configuration. For example, to
       use 3 callers for somatic and germline calling, create ensemble calls for both and include
       germline and somatic events from two structural variant callers:

          variantcaller:
             somatic: [vardict, strelka2, mutect2]
             germline: [freebayes, gatk-haplotype, strelka2]
          ensemble:
             numpass: 2
          svcaller: [manta, cnvkit]

       In  addition  to  the somatic and germline outputs attached to the tumor and normal sample
       outputs as described above, you'll get:

       · your-tumor/your-tumor-manta.vcf.gz  --  Somatic  structural  variant  calls   for   each
         specified  svcaller.  These  will  have genotypes for both the tumor and normal samples,
         with somatic calls labeled as PASS variants.

       · your-normal/your-normal-manta.vcf.gz --  Germline  structural  variant  calls  for  each
         specified svcaller. We expect these to be noisier than the somatic calls due to the lack
         of a reference sample to help remove technical noise.

SOMATIC TUMOR ONLY CNVS

       Copy  number  variation  (CNVs)  detection  in  tumor  only  samples  requires  accurately
       representing the non-somatic capture and sequencing background in the absence of a matched
       sample. Capture or sequencing specific coverage differences can trigger false positives or
       negatives.  Without  a  matched  normal  to  remove these artifacts, you can use a process
       matched set of unrelated samples to build a Panel of  Normals  (PoN)  for  the  background
       correction.

       To  create these, collect all the samples you plan to use for the panel of normals and run
       through a automated-sample-config as a single batch with the  background  samples  set  as
       control and any nonbackground as the non-background. An example sample CSV:

          samplename,description,svclass,batch
          background1.bam,background1,control,pon_build
          background2_R1.fq.gz;background2_R2.fq.gz,background2,control,pon_build
          testsample.bam,testsample,pon_build

       and template YAML:

          details:
            - analysis: variant2
              genome_build: hg38
              algorithm:
                svcaller: [gatk-cnv, seq2c]
                variant_regions: your_regions.bed

       After running, collect the panel of normal files from each calling method:

       · gatk-cnv: work/structural/testsample/bins/background1-pon_build-pon.hdf5

       · seq2c:  This  doesn't  have  a default panel of normals file format so we create a bcbio
         specific    one    as    a    concatenation     of     the     read     mapping     file
         (final/date_project/seq2c-read_mapping.txt)          and          coverage          file
         (final/date_project/seq2c-coverage.tsv) outputs for the background samples.  When fed to
         future bcbio runs, it will correctly extract and re-use this file as background.

       · CNVkit: final/testsample/testsample-cnvkit-background.cnn

       CNVkit  and  gatk-cnv cannot be run together, because they require different, incompatible
       normalization schemes.

       Once you have the panel of normals, use them as background in any tumor only project  with
       the same sequencing and capture process in your :ref: variant-config configuration:

          svcaller: [gatk-cnv, seq2c]
          variant_regions: your_regions.bed
          background:
            cnv_reference:
              cnvkit: ../pon/your_regions-cnvkit-pon.cnn
              gatk-cnv: ../pon/your_regions-gatk-cnv-pon.hdf5
              seq2c: ../pon/your_region-seq2c-pon.txt

STRUCTURAL VARIANT CALLING

       bcbio  can  detect  larger  structural variants like deletions, insertions, inversions and
       copy number changes for both germline population and  cancer  variant  calling,  based  on
       validation against existing truth sets:

       · Validation  of  germline  structural variant detection using multiple calling methods to
         validate against deletions in NA12878. This implements a pipeline that works  in  tandem
         with  SNP  and  indel  calling  to  detect  larger structural variations like deletions,
         duplications, inversions and copy number variants (CNVs).

       · Validation of tumor/normal calling  using  the  synthetic  DREAM  validation  set.  This
         includes   validation   of  additional  callers  against  duplications,  insertions  and
         inversions.

       To enable structural variant calling, specify svcaller options in the algorithm section of
       your configuration:

          - description: Sample
            algorithm:
              svcaller: [lumpy, manta, cnvkit]

       The  best  supported  callers  are Lumpy and Manta, for paired end and split read calling,
       CNVkit for read-depth based CNV calling, and WHAM for association testing. We also support
       DELLY,  another  excellent  paired end and split read caller, although it is slow on large
       whole genome datasets.

RNA-SEQ

bcbio can also be used to analyze RNA-seq data. It includes steps for quality control,
adapter trimming, alignment, variant calling, transcriptome reconstruction and
post-alignment quantitation at the level of the gene and isoform.

We recommend using the STAR aligner for all genomes where there are no alt alleles. For
genomes such as hg38 that have alt alleles, hisat2 should be used as it handles the alts
correctly and STAR does not yet. Use Tophat2 only if you do not have enough RAM available
to run STAR (about 30 GB).

Our current recommendation is to run adapter trimming only if using the Tophat2 aligner.
Adapter trimming is very slow, and aligners that soft clip the ends of reads such as STAR
and hisat2, or algorithms using pseudoalignments like Sailfish handle contaminant
sequences at the ends properly. This makes trimming unnecessary. Tophat2 does not perform
soft clipping so if using Tophat2, trimming must still be done.

Salmon, which is an extremely fast alignment-free method of quantitation, is run for all
experiments. Salmon can accurately quantitate the expression of genes, even ones which are
hard to quantitate with other methods (see this paper for example for Sailfish, which
performs similarly to Salmon.). Salmon can also quantitate at the transcript level which
can help gene-level analyses (see this paper for example). We recommend using the Salmon
quantitation rather than the counts from featureCounts to perform downstream
quantification.

Although we do not recommend using the featureCounts based counts, the alignments are
still useful because they give you many more quality metrics than the quasi-alignments
from Salmon.

After a bcbio RNA-seq run there will be in the upload directory a directory for each
sample which contains a BAM file of the aligned and unaligned reads, a sailfish directory
with the output of Salmon, including TPM values, and a qc directory with plots from FastQC
and qualimap.

In addition to directories for each sample, in the upload directory there is a project
directory which contains a YAML file describing some summary statistics for each sample
and some provenance data about the bcbio run. In that directory is also a combined.counts
file with the featureCounts derived counts per cell.

FAST RNA-SEQ

       This mode of bcbio-nextgen quantitates transcript expression using Salmon and does nothing
       else. It is an order of magnitude faster or more than running the full  RNA-seq  analysis.
       The  cost  of  the  increased speed is that you will have much less information about your
       samples at the end of the run, which  can  make  troubleshooting  trickier.   Invoke  with
       analysis: fastrna-seq.

SINGLE-CELL RNA-SEQ

bcbio-nextgen supports universal molecular identifiers (UMI) based single-cell RNA-seq
analyses. If your single-cell prep does not use universal molecular identifiers (UMI), you
can most likely just run the standard RNA-seq pipeline and use the results from that. The
UMI are used to discard reads which are possibly PCR duplicates and is very helpful for
removing some of the PCR duplicate noise that can dominate single-cell experiments.

Unlike the standard RNA-seq pipeline, the single-cell pipeline expects the FASTQ input
files to not be separated by cellular barcode, so each file is a mix of cells identified
by a cellular barcode (CB), and unique reads from a transcript are identified with a UMI.
bcbio-nextgen inspects each read, identifies the cellular barcode and UMI and puts them in
the read name. Then the reads are aligned to the transcriptome with RapMap and the number
of reads aligning to each transcript is counted for each cellular barcode. The output is a
table of counts with transcripts as the rows and columns as the cellular barcodes for each
input FASTQ file.

Optionally the reads can be quantitated with kallisto to output transcript compatibility
counts rather than counts per gene (TCC paper).

To extract the UMI and cellular barcodes from the read, bcbio-nextgen needs to know where
the UMI and the cellular barcode are expected to be in the read. Currently there is
support for two schemes, the inDrop system from the Harvard single-cell core facility and
CEL-seq. If bcbio-nextgen does not support your UMI and barcoding scheme, please open up
an issue and we will help implement support for it.

Most of the heavy lifting for this part of bcbio-nextgen is implemented in the umis
repository.

SMALLRNA-SEQ

bcbio-nextgen also implements a configurable best-practices pipeline for smallRNA-seq
quality controls, adapter trimming, miRNA/isomiR quantification and other small RNA
detection.

· Adapter trimming:

· atropos

· dnapi for adapter de-novo detection

· Sequence alignment:

· STAR for genome annotation

· bowtie, bowtie2 and hisat2 for genome annotation as an option

· Specific small RNAs quantification (miRNA/tRNAs...):

· seqbuster for miRNA annotation

· MINTmap for tRNA fragments annotation

miRge2 for alternative small RNA quantification. To setup this tool, you need
install manually miRge2.0, and download the library data for your species.
Read how to install and download the data here. If you have human folder at
/mnt/data/human the option to pass to resources will be /mnt/data. Then setup
resources:

resources:

mirge: options: ["-lib $PATH_TO_PARENT_SPECIES_LIB"]

· Quality control:

· FastQC

· Other small RNAs quantification:

· seqcluster

· mirDeep2 for miRNA prediction

The pipeline generates a _RMD_ template file inside report folder that can be rendered
with knitr. An example of the report is at here. Count table (counts_mirna.tst) from
mirbase miRNAs will be inside mirbase or final project folder. Input files for isomiRs
package for isomiRs analysis will be inside each sample in mirbase folder.. If mirdeep2
can run, count table (counts_mirna_novel.tsv) for novel miRNAs will be inside mirdeep2 or
final project folder. tdrmapper results will be inside each sample inside tdrmapper or
final project folder.

CHIP/ATAC-SEQ

       The  bcbio-nextgen  implementation  of  ChIP-seq aligns, removes multimapping reads, calls
       peaks with a paired input file using MACS2 and outputs  a  set  of  greylist  regions  for
       filtering possible false peaks in regions of high depth in the input file.

       · Adapter trimming: - atropos

       · Sequence alignment: - bowtie2, bwa mem

       · Peak calling: - macs2

       · Greylisting: - chipseq-greylist

       · Quality control: - FastQC

METHYLATION

       Whole  genome  bisulfite  sequencing  is supported using the bismark2 pipeline.  It can be
       turned on by setting analysis to wgbs-seq.

STANDARD

       This pipeline implements alignment and qc tools. Furthermore, it will  run  qsignature  to
       detect  possible  duplicated  samples,  or mislabeling. It uses SNPs signature to create a
       distance matrix that helps easily to create groups. The project yaml file  will  show  the
       number  of  total  samples  analyzed, the number of very similar samples, and samples that
       could be duplicated.

   Configuration
       We will assume that you installed bcbio-nextgen with the automated installer, and so  your
       default  bcbio_system.yaml  file is configured correctly with all of the tools pointing to
       the right places. If that is the case, to run bcbio-nextgen on a set of samples  you  just
       need  to  set  up a YAML file that describes your samples and what you would like to do to
       them. Let's say that you have a single paired-end control lane, prepared with the Illumina
       TruSeq  Kit  from  a  human.  Here is what a well-formed sample YAML file for that RNA-seq
       experiment would look like:

          fc_date: '070113'
          fc_name: control_experiment
          upload:
            dir: final
          details:
            - files: [/full/path/to/control_1.fastq, /full/path/to/control_2.fastq]
              description: 'Control_rep1'
              genome_build: GRCh37
              analysis: RNA-seq
              algorithm:
                   aligner: tophat2
                   quality_format: Standard
                   trim_reads: read_through
                   adapters: [truseq, polya]
                   strandedness: unstranded

       fc_date and fc_name will be combined to form a prefix to name intermediate files, and  can
       be  set  to  whatever  you  like.  upload  is  explained  pretty well in the configuration
       documentation and the above will direct bcbio-nextgen to put the  output  files  from  the
       pipeine  into  the final directory.  Under details is a list of sections each describing a
       sample to process.  You can set many parameters under each section but most  of  the  time
       just  setting a few like the above is all that is necessary.  analysis tells bcbio-nextgen
       to run the best-practice RNA-seq pipeline on this sample.

       In the above, since there are two  files,  control_1.fastq  and  control_2.fastq  will  be
       automatically  run as paired-end data. If you have single end data you can just supply one
       file and it will run as single-end. The description  field  will  be  used  to  eventually
       rename  the  files, so make it very evocative since you will be looking at it a lot later.
       genome_build is self-explanatory.

       Sometimes you need a little bit more flexibility  than  the  standard  pipeline,  and  the
       algorithm   section  has  many  options  to  fine-tune  the  behavior  of  the  algorithm.
       quality_format tells bcbio-nextgen what quality format your FASTQ  inputs  are  using,  if
       your  samples  were  sequenced  any  time  past 2009 or so, you probably want to set it to
       Standard. Adapter read-through is a problem in RNA-seq libraries, so we want to  trim  off
       possible  adapter  sequences  on  the ends of reads, so trim_reads is set to read_through,
       which will also trim off poor quality ends.  Since  your  library  is  a  RNA-seq  library
       prepared  with the TruSeq kit, the set of adapters to trim off are the TruSeq adapters and
       possible polyA tails, so adapters is set to both of those. strandedness can be set if your
       library  was  prepared  in  a  strand-specific  manner  and  can  be  set  to firststrand,
       secondstrand or unstranded (the default).

   Multiple samples
       Lets say you have a set of mouse samples to analyze and each sample is a  single  lane  of
       single-end  RNA-seq  reads  prepared  using  the  NextEra kit.  There are two case and two
       control samples. Here is a sample configuration file for that analysis:

          fc_date: '070113'
          fc_name: mouse_analysis
          upload:
            dir: final
          details:
            - files: [/full/path/to/control_rep1.fastq]
              description: 'Control_rep1'
              genome_build: mm10
              analysis: RNA-seq
              algorithm:
                   aligner: tophat2
                   quality_format: Standard
                   trim_reads: read_through
                   adapters: [nextera, polya]
            - files: [/full/path/to/control_rep2.fastq]
              description: 'Control_rep2'
              genome_build: mm10
              analysis: RNA-seq
              algorithm:
                   aligner: tophat2
                   quality_format: Standard
                   trim_reads: read_through
                   adapters: [nextera, polya]
            - files: [/full/path/to/case_rep1.fastq]
              description: 'Case_rep1'
              genome_build: mm10
              analysis: RNA-seq
              algorithm:
                   aligner: tophat2
                   quality_format: Standard
                   trim_reads: read_through
                   adapters: [nextera, polya]
            - files: [/full/path/to/case_rep2.fastq]
              description: 'Case_rep2'
              genome_build: mm10
              analysis: RNA-seq
              algorithm:
                   aligner: tophat2
                   quality_format: Standard
                   trim_reads: read_through
                   adapters: [nextera, polya]

       More samples are added just by adding more entries under the  details  section.   This  is
       tedious and error prone to do by hand, so there is an automated template system for common
       experiments. You could set up the previous experiment by making a  mouse  version  of  the
       illumina-rnaseq    template   file   and   saving   it   to   a   local   file   such   as
       illumina-mouse-rnaseq.yaml. Then you can set up  the  sample  file  using  the  templating
       system:

          bcbio_nextgen.py -w template illumina-mouse-rnaseq.yaml mouse_analysis
          /full/path/to/control_rep1.fastq /full/path/to/control_rep2.fastq
          /full/path/to/case_rep1.fastq /full/path/to/case_rep2.fastq

       If  you  had  paired-end  samples  instead  of  single-end  samples, you can still use the
       template system as long as the forward and reverse read filenames are the same, barring  a
       _1 and _2. For example: control_1.fastq and control_2.fastq will be detected as paired and
       combined in the YAML file output by the templating system.

OVERVIEW

       1. Create a sample configuration file for your project (substitute  the  example  BAM  and
          fastq names below with the full path to your sample files):

             bcbio_nextgen.py -w template gatk-variant project1 sample1.bam sample2_1.fq sample2_2.fq

          This uses a standard template (GATK best practice variant calling) to automate creation
          of a full configuration for all samples. See automated-sample-config for  more  details
          on  running  the  script,  and  manually edit the base template or final output file to
          incorporate project specific  configuration.  The  example  pipelines  provide  a  good
          starting point and the sample-configuration documentation has full details on available
          options.

       2. Run analysis, distributed across 8 local cores:

             bcbio_nextgen.py bcbio_sample.yaml -n 8

       3. Read the docs-config documentation for full details on adjusting both  the  sample  and
          system configuration files to match your experiment and computational setup.

PROJECT DIRECTORY

       bcbio encourages a project structure like:

          my-project/
          ├── config
          ├── final
          └── work

       with  the  input configuration in the config directory, the outputs of the pipeline in the
       final  directory,  and  the  actual  processing  done  in  the  work  directory.  Run  the
       bcbio_nextgen.py  script  from  inside the work directory to keep all intermediates there.
       The final directory, relative to the parent  directory  of  the  work  directory,  is  the
       default  location  specified  in  the  example configuration files and gets created during
       processing. The final directory has all of the finished outputs and  you  can  remove  the
       work  intermediates  to  cleanup  disk  space  after  confirming the results. All of these
       locations are configurable and this project structure is only a recommendation.

LOGGING

       There are 3 logging files in the log directory within your working folder:

       · bcbio-nextgen.log High level logging information about the analysis.  This  provides  an
         overview of major processing steps and useful checkpoints for assessing run times.

       · bcbio-nextgen-debug.log  Detailed  information  about  processes including stdout/stderr
         from third party software and error traces for  failures.  Look  here  to  identify  the
         status of running pipelines or to debug errors. It labels each line with the hostname of
         the machine it ran on to ease debugging in distributed cluster environments.

       · bcbio-nextgen-commands.log Full command lines for all third party software tools run.

EXAMPLE PIPELINES

       We supply example input configuration files for validation and to  help  in  understanding
       the pipeline.

   Whole genome trio (50x)
       This input configuration runs whole genome variant calling using bwa, GATK HaplotypeCaller
       and FreeBayes. It uses a father/mother/child trio from the CEPH NA12878  family:  NA12891,
       NA12892,  NA12878.  Illumina's Platinum genomes project has 50X whole genome sequencing of
       the three members. The analysis compares results against a reference NA12878 callset  from
       NIST's Genome in a Bottle initiative.

       To run the analysis do:

          mkdir -p NA12878-trio-eval/config NA12878-trio-eval/input NA12878-trio-eval/work
          cd NA12878-trio-eval/config
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-validate.yaml
          cd ../input
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-validate-getdata.sh
          bash NA12878-trio-wgs-validate-getdata.sh
          cd ../work
          bcbio_nextgen.py ../config/NA12878-trio-wgs-validate.yaml -n 16

       This  is  a  large  whole  genome  analysis  and  meant  to test both pipeline scaling and
       validation across the entire genome. It  can  take  multiple  days  to  run  depending  on
       available  cores.  It requires 300Gb for the input files and 1.3Tb for the work directory.
       Smaller  examples  below  exercise  the  pipeline  with  less   disk   and   computational
       requirements.

       We  also  have  a  more  extensive  evaluation that includes 2 additional variant callers,
       Platypus and samtools, and 3 different methods of calling variants: single sample, pooled,
       and  incremental  joint  calling.  This  uses the same input data as above but a different
       input configuration file:

          mkdir -p NA12878-trio-eval/work_joint
          cd NA12878-trio-eval/config
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-joint.yaml
          cd ../work_joint
          bcbio_nextgen.py ../config/NA12878-trio-wgs-joint.yaml -n 16

   Exome with validation against reference materials
       This example calls variants on NA12878 exomes from EdgeBio's clinical sequencing pipeline,
       and  compares  them against reference materials from NIST's Genome in a Bottle initiative.
       This supplies a full regression pipeline to ensure consistency of calling between releases
       and  updates  of  third  party  software. The pipeline performs alignment with bwa mem and
       variant calling with FreeBayes, GATK UnifiedGenotyper and GATK HaplotypeCaller. Finally it
       integrates all 3 variant calling approaches into a combined ensemble callset.

       This is a large full exome example with multiple variant callers, so can take more than 24
       hours on machines using multiple cores.

       First get the input configuration file, fastq  reads,  reference  materials  and  analysis
       regions:

          mkdir -p NA12878-exome-eval
          cd NA12878-exome-eval
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-exome-methodcmp-getdata.sh
          bash NA12878-exome-methodcmp-getdata.sh

       Finally run the analysis, distributed on 8 local cores, with:

          cd work
          bcbio_nextgen.py ../config/NA12878-exome-methodcmp.yaml -n 8

       The grading-summary.csv contains detailed comparisons of the results to the NIST reference
       materials, enabling rapid comparisons of methods.

   Cancer tumor normal
       This example calls variants using multiple approaches  in  a  paired  tumor/normal  cancer
       sample  from the ICGC-TCGA DREAM challenge. It uses synthetic dataset 3 which has multiple
       subclones, enabling detection of lower frequency variants. Since  the  dataset  is  freely
       available and has a truth set, this allows us to do a full evaluation of variant callers.

       To get the data:

          mkdir -p cancer-dream-syn3/config cancer-dream-syn3/input cancer-dream-syn3/work
          cd cancer-dream-syn3/config
          wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/cancer-dream-syn3.yaml
          cd ../input
          wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/cancer-dream-syn3-getdata.sh
          bash cancer-dream-syn3-getdata.sh

       Run with:

          cd ../work
          bcbio_nextgen.py ../config/cancer-dream-syn3.yaml -n 8

       The configuration and data file has downloads for exome only and whole genome analyses. It
       enables exome by  default,  but  you  can  use  the  larger  whole  genome  evaluation  by
       uncommenting the relevant parts of the configuration and retrieval script.

   Cancer-like mixture with Genome in a Bottle samples
       This  example  simulates  somatic cancer calling using a mixture of two Genome in a Bottle
       samples, NA12878 as the "tumor" mixed with NA24385 as the background.  The Hartwig Medical
       Foundation  and  Utrecht  Medical  Center  generated  this "tumor/normal" pair by physical
       mixing of samples prior to sequencing. The GiaB FTP directory  has  more  details  on  the
       design  and  truth sets.  The sample has variants at 15% and 30%, providing the ability to
       look at lower frequency mutations.

       To get the data:

          wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/cancer-giab-na12878-na24385-getdata.sh
          bash cancer-giab-na12878-na24385-getdata.sh

       Then run the analysis with:

          cd work
          bcbio_nextgen.py ../config/cancer-giab-na12878-na24385.yaml -n 16

   Structural variant calling -- whole genome NA12878 (50x)
       This example runs structural variant calling  with  multiple  callers  (Lumpy,  Manta  and
       CNVkit),  providing  a combined output summary file and validation metrics against NA12878
       deletions. It uses the same NA12878 input as the whole genome trio example.

       To run the analysis do:

          mkdir -p NA12878-sv-eval
          cd NA12878-sv-eval
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-sv-getdata.sh
          bash NA12878-sv-getdata.sh
          cd work
          bcbio_nextgen.py ../config/NA12878-sv.yaml -n 16

       This is large whole genome analysis and the timing and disk  space  requirements  for  the
       NA12878 trio analysis above apply here as well.

   RNAseq example
       This  example  aligns  and  creates  count  files for use with downstream analyses using a
       subset of the SEQC data from the FDA's Sequencing Quality Control project.

       Get the setup script and run it, this will download six samples  from  the  SEQC  project,
       three  from the HBRR panel and three from the UHRR panel. This will require about 100GB of
       disk space for these input files.  It will also set up a configuration file for  the  run,
       using the templating system:

          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/rnaseq-seqc-getdata.sh
          bash rnaseq-seqc-getdata.sh

       Now go into the work directory and run the analysis:

          cd seqc/work
          bcbio_nextgen.py ../config/seqc.yaml -n 8

       This  will run a full scale RNAseq experiment using Tophat2 as the aligner and will take a
       long time to finish on a single machine. At the  end  it  will  output  counts,  Cufflinks
       quantitation  and  a  set  of  QC  results  about each lane. If you have a cluster you can
       parallelize it to speed it up considerably.

       A nice looking  standalone  report  of  the  bcbio-nextgen  run  can  be  generated  using
       bcbio.rnaseq. Check that repository for details.

   Human genome build 38
       Validate  variant  calling  on human genome build 38, using two different builds (with and
       without alternative alleles) and three different validation datasets (Genome in  a  Bottle
       prepared with two methods and Illumina platinum genomes).  To run:

          mkdir -p NA12878-hg38-val
          cd NA12878-hg38-val
          wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-hg38-validate-getdata.sh
          bash NA12878-hg38-validate-getdata.sh
          mkdir -p work
          cd work
          bcbio_nextgen.py ../config/NA12878-hg38-validate.yaml -n 16

   Whole genome (10x)
       An  input  configuration  for running whole gnome variant calling with bwa and GATK, using
       Illumina's Platinum genomes project (NA12878-illumina.yaml). See this blog post  on  whole
       genome  scaling for expected run times and more information about the pipeline. To run the
       analysis:

       · Create an input directory structure like:

            ├── config
            │   └── NA12878-illumina.yaml
            ├── input
            └── work

       · Retrieve inputs and comparison calls:

            cd input
            wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_1.fastq.gz
            wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_2.fastq.gz

       · Retrieve configuration input file:

            cd config
            wget https://raw.github.com/bcbio/bcbio-nextgen/master/config/examples/NA12878-illumina.yaml

       · Run analysis on 16 core machine:

            cd work
            bcbio_nextgen.py ../config/NA12878-illumina.yaml -n 16

       · Examine  summary  of  concordance  and  discordance  to  comparison   calls   from   the
         grading-summary.csv file in the work directory.

TEST SUITE

The test suite exercises the scripts driving the analysis, so are a good starting point to
ensure correct installation. Tests use the pytest framework. The tests are available in
the bcbio source code:

$ git clone https://github.com/bcbio/bcbio-nextgen.git

There is a small wrapper script that finds the py.test and other dependencies
pre-installed with bcbio you can use to run tests:

$ cd tests
$ ./run_tests.sh

You can use this to run specific test targets:

$ ./run_tests.sh cancer
$ ./run_tests.sh rnaseq
$ ./run_tests.sh devel
$ ./run_tests.sh docker

Optionally, you can run pytest directly from the bcbio install to tweak more options. It
will be in /path/to/bcbio/anaconda/bin/py.test. Pass -s to py.test to see the stdout log,
and -v to make py.test output more verbose. The tests are marked with labels which you can
use to run a specific subset of the tests using the -m argument:

$ py.test -m rnaseq

To run unit tests:

$ py.test tests/unit

To run integration pipeline tests:

$ py.test tests/integration

To run tests which use bcbio_vm:

$ py.test tests/bcbio_vm

To see the test coverage, add the --cov=bcbio argument to py.test.

By default the test suite will use your installed system configuration for running tests,
substituting the test genome information instead of using full genomes. If you need a
specific testing environment, copy tests/data/automated/post_process-sample.yaml to
tests/data/automated/post_process.yaml to provide a test-only configuration.

Two configuration files, in easy to write YAML format, specify details about your system
and samples to run:

· bcbio_sample.yaml Details about a set of samples to process, including input files and
analysis options. You configure these for each set of samples to process. This will be
the main file prepared for each sample run and the documentation below details
techniques to help prepare them.

· bcbio_system.yaml High level information about the system, including locations of
installed programs like GATK and cores and memory usage (see tuning-cores). These apply
across multiple runs. The automated installer creates a ready to go system configuration
file that can be manually edited to match the system. Find the file in the galaxy
sub-directory within your installation data location (ie.
/usr/local/share/bcbio-nextgen/galaxy). To modify system parameters for a specific run,
supply Sample or run specific resources in your bcbio_sample.yaml file.

Commented system and sample example files are available in the config directory. The
example-pipelines section contains additional examples of ready to run sample files.

AUTOMATED SAMPLE CONFIGURATION

bcbio-nextgen provides a utility to create configuration files for multiple sample inputs
using a base template. Start with one of the best-practice templates, or define your own,
then apply to multiple samples using the template workflow command:

bcbio_nextgen.py -w template freebayes-variant project1.csv sample1.bam sample2_1.fq sample2_2.fq

· freebayes-variant is the name of the standard freebayes-variant.yaml input, which the
script fetches from GitHub. This argument can also be a path to a locally customized
YAML configuration. In both cases, the script replicates the single sample template
configuration to all input samples.

· project1.csv is a comma separated value file containing sample metadata, descriptions
and algorithm tweaks:

samplename,description,batch,phenotype,sex,variant_regions
sample1,ERR256785,batch1,normal,female,/path/to/regions.bed
sample2,ERR256786,batch1,tumor,,/path/to/regions.bed

The first column links the metadata to a specific input file. The template command tries
to identify the samplename from read group information in a BAM file, or uses the base
filename if no read group information is present. For BAM files, this would be the
filename without the extension and path (/path/to/yourfile.bam => yourfile). For FASTQ
files, the template functionality will identify pairs using standard conventions (_1 and
_2, including Illumina extensions like _R1), so use the base filename without these
(/path/to/yourfile_R1.fastq => yourfile). Note that paired-end samples sequentially
numbered without leading zeros (e.g., sample_1_1.fastq, sample_1_2.fastq,
sample_2_1.fastq, sample_2_2.fastq, etc., will likely not be parsed correctly; see #1919
for more info). In addition, . characters could be problematic, so it's better to avoid
this character and use it only as separation for the file extension.

For docs-cwl inputs, the first samplename column should contain the base filename. For
BAM files, this is your_file.bam. For fastqs this is
your_file_R1.fastq.gz;your_file_R2.fastq.gz, separating individual files with a
semicolon. By putting paths to the actual locations of the inputs in your
bcbio_system.yaml input when generating CWL, you can easily move projects between
different filesystems.
The remaining columns can contain:

· description Changes the sample description, originally supplied by the file name or
BAM read group, to this value. You can also set the lane, although this is less
often done as the default sequential numbering works here. See the documentation
for Samples from GEO or SRA on how these map to BAM read groups.

· Algorithm parameters specific for this sample. If the column name matches an
available Algorithm parameters, then this value substitutes into the sample
algorithm, replacing the defaults from the template. You can also change other
information in the BAM read group through the algorithm parameters. See Alignment
configuration documentation for details on how these map to read group information.

· Samples from GEO or SRA metadata key/value pairs. Any columns not falling into the
above cases will go into the metadata section. A ped specification will allow bcbio
to read family, gender and phenotype information from a PED input file and use it
for batch, sex and phenotype, respectively. The PED inputs supplement information
from the standard template file, so if you specify a value in the template CSV the
PED information will no overwrite it. Alternatively, ped fields can be specified
directly in the metadata as columns. If family_id is specified it will be used as
the family_id for that sample, otherwise batch will be used. The description column
is used as the individual_id column and the phenotype column will be used for as
the affected column in the PED format:

samplename,description,phenotype,batch,sex,ethnicity,maternal_id,paternal_id,family_id
NA12878.bam,NA12878,-9,CEPH,female,-9,NA12892,NA12891,NA12878FAM

Individual column items can contain booleans (true or false), integers, or lists
(separated by semi-colons). These get converted into the expected time in the output
YAML file. For instance, to specify a sample that should go into multiple batches:

samplename,description,phenotype,batch
normal.bam,two_normal,normal,Batch1;Batch2

For dictionary inputs like somatic-w-germline-variants setups, you can separate items in
a dictionary with colons and double colons, and also use semicolons for lists:

samplename,description,phenotype,variantcaller
tumor.bam,sample1,tumor,germline:freebayes;gatk-haplotype::somatic:vardict;freebayes

The name of the metadata file, minus the .csv extension, is a short name identifying the
current project. The script creates a project1 directory containing the sample
configuration in project1/config/project1.yaml.

· The remaining arguments are input BAM or FASTQ files. The script pairs FASTQ files
(identified by _1 and _2) and extracts sample names from input BAMs, populating the
files and description field in the final configuration file. Specify the full path to
sample files on your current machine.

To make it easier to define your own project specific template, an optional first step is
to download and edit a local template. First retrieve a standard template:

bcbio_nextgen.py -w template freebayes-variant project1

This pulls the current GATK best practice variant calling template into your project
directory in project1/config/project1-template.yaml. Manually edit this file to define
your options, then run the full template creation for your samples, pointing to this
custom configuration file:

bcbio_nextgen.py -w template project1/config/project1-template.yaml project1.csv folder/*

If your sample folder contains additional BAM or FASTQ files you do not wish to include in
the sample YAML configuration, you can restrict the output to only include samples in the
metadata CSV with --only-metadata. The output will print warnings about samples not
present in the metadata file, then leave these out of the final output YAML:

bcbio_nextgen.py -w template --only-metadata project1/config/project1-template.yaml project1.csv folder/*

MULTIPLE FILES PER SAMPLE

       In   case   you   have   multiple  FASTQ  or  BAM  files  for  each  sample  you  can  use
       bcbio_prepare_samples.py.  The main parameters are:

       · --out: the folder where the merged files will be

       · --csv: the CSV file that is exactly the same as described previously, but having as many
         duplicate lines for each sample as files to be merged:

            samplename,description,batch,phenotype,sex,variant_regions
            file1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
            file2.fastq,sample1,batch1,normal,female,/path/to/regions.bed
            file1.fastq,sample2,batch1,tumor,,/path/to/regions.bed

       An example of usage is:

          bcbio_prepare_samples.py --out merged --csv project1.csv

       The script will create the sample1.fastq,sample2.fastq in the merged folder, and a new CSV
       file in the same folder than the input CSV :project1-merged.csv. Later, it can be used for
       bcbio:

          bcbio_nextgen.py -w template project1/config/project1-template.yaml project1-merged.csv merged/*fastq

       The new CSV file will look like:

          samplename,description,batch,phenotype,sex,variant_regions
          sample1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
          sample2.fastq,sample2,batch1,tumor,,/path/to/regions.bed

       It supports parallelization the same way bcbio_nextgen.py does:

          python $BCBIO_PATH/scripts/utils/bcbio_prepare_samples.py --out merged --csv project1.csv -t ipython -q queue_name -s lsf -n 1

       See more examples at parallelize pipeline.

       In case of paired reads, the CSV file should contain all files:

          samplename,description,batch,phenotype,sex,variant_regions
          file1_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
          file2_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
          file1_R2.fastq,sample1,batch1,normal,femela,/path/to/regions.bed
          file2_R2.fastq,sample1,batch1,normal,female,/path/to/regions.bed

       The  script  will  try  to  guess  the  paired files the same way that bcbio_nextgen.py -w
       template does. It would detect paired files if the difference  among  two  files  is  only
       _R1/_R2 or -1/-2 or _1/_2 or .1/.2

       The output CSV will look like and is compatible with bcbio:

          samplename,description,batch,phenotype,sex,variant_regions
          sample1,sample1,batch1,normal,female,/path/to/regions.bed

SAMPLES FROM GEO OR SRA

       In  case  you  want  to  download  samples  from  GEO  or  SRA  repositories,  you can use
       bcbio_prepare_samples.py as well.

       You need to create your project.csv file like this:
          samplename,description GSMNNNNN,sample1 GSMNNNNN,sample2 SRRNNNNN,sample3

       The script will download all the files related to each sample and merge them  in  case  of
       multiple files.

SAMPLE INFORMATION

The sample configuration file defines details of each sample to process:

details:
- analysis: variant2
lane: 1
description: Example1
files: [in_pair_1.fq, in_pair_2.fq]
genome_build: hg19
algorithm:
platform: illumina
metadata:
batch: Batch1
sex: female
platform_unit: flowcell-barcode.lane
library: library_type

· analysis Analysis method to use [variant2, RNA-seq, smallRNA-seq]

· lane A unique number within the project. Corresponds to the ID parameter in the BAM read
group.

· description Unique name for this sample, corresponding to the SM parameter in the BAM
read group. Required.

· files A list of files to process. This currently supports either a single end or two
paired-end FASTQ files, or a single BAM file. It does not yet handle merging BAM files
or more complicated inputs.

· genome_build Genome build to align to, which references a genome keyword in Galaxy to
find location build files.

· algorithm Parameters to configure algorithm inputs. Options described in more detail
below:

· platform Sequencing platform used. Corresponds to the PL parameter in BAM read groups.
Optional, defaults to illumina.

· metadata Additional descriptive metadata about the sample:

· batch defines a group that the sample falls in. We perform multi-sample variant
calling on all samples with the same batch name. This can also be a list, allowing
specification of a single normal sample to pair with multiple tumor samples in
paired cancer variant calling (batch: [MatchWithTumor1, MatchWithTumor2]).

· sex specifies the sample gender used to correctly prepare X/Y chromosomes. Use male
and female or PED style inputs (1=male, 2=female).

· phenotype stratifies cancer samples into tumor and normal or case/controls into
affected and unaffected. Also accepts PED style specifications (1=unaffected,
2=affected). CNVkit uses case/control status to determine how to set background
samples for CNV calling.

· disease identifies a specific disease name for the sample. Used along with
svprioritize to help identify gene regions for reporting during analysis with
heterogeneity callers like PureCN and TitanCNA. This is primarily for cancer
studies and you can narrow genes by disease using inputs like lung, breast or
pancreatic for different cancer types.

· prep_method A free text description of the method used in sample prep. Used to
group together samples during CNV calling for background. This is not required and
when not present bcbio assumes all samples in an analysis use the same method.

· svclass defines a classification for a sample for use in SV case/control setups.
When set as control will put samples into the background samples used for
normalization.

· ped provides a PED phenotype file containing sample phenotype and family
information. Template creation uses this to supplement batch, sex and phenotype
information provided in the template CSV. GEMINI database creation uses the PED
file as input.

· platform_unit -- Unique identifier for sample. Optional, defaults to lane if not
specified.

· library -- Name of library preparation used. Optional, empty if not present.

· validate_batch -- Specify a batch name to group samples together for preparing
validation plots. This is useful if you want to process samples in specific
batches, but include multiple batches into the same validation plot.

· validate_combine -- Specify a batch name to combine multiple samples into an
additional validation summary. Useful for larger numbers of small samples to
evaluate together.

UPLOAD

The upload section of the sample configuration file describes where to put the final
output files of the pipeline. At its simplest, you can configure bcbio-nextgen to upload
results to a local directory, for example a folder shared amongst collaborators or a
Dropbox account. You can also configure it to upload results automatically to a Galaxy
instance, to Amazon S3 or to iRODS. Here is the simplest configuration, uploading to a
local directory:

upload:
dir: /local/filesystem/directory

General parameters, always required:

· method Upload method to employ. Defaults to local filesystem. [filesystem, galaxy, s3,
irods]

· dir Local filesystem directory to copy to.

Galaxy parameters:

· galaxy_url URL of the Galaxy instance to upload to. Upload assumes you are able to
access a shared directory also present on the Galaxy machine.

· galaxy_api_key User API key to access Galaxy: see the Galaxy API documentation.

· galaxy_library Name of the Galaxy Data Library to upload to. You can specify this
globally for a project in upload or for individual samples in the sample details
section.

· galaxy_role Specific Galaxy access roles to assign to the uploaded datasets. This is
optional and will default to the access of the parent data library if not supplied. You
can specify this globally for a project in upload or for individual samples in the
sample details section. The Galaxy Admin documentation has more details about roles.

Here is an example configuration for uploading to a Galaxy instance. This assumes you have
a shared mounted filesystem that your Galaxy instance can also access:

upload:
method: galaxy
dir: /path/to/shared/galaxy/filesystem/folder
galaxy_url: http://url-to-galaxy-instance
galaxy_api_key: YOURAPIKEY
galaxy_library: data_library_to_upload_to

Your Galaxy universe_wsgi.ini configuration needs to have allow_library_path_paste = True
set to enable uploads.

S3 parameters:

· bucket AWS bucket to direct output.

· folder A folder path within the AWS bucket to prefix the output.

· region AWS region name to use. Defaults to us-east-1

· reduced_redundancy Flag to determine if we should store S3 data with reduced redundancy:
cheaper but less reliable [false, true]

For S3 access credentials, set the standard environmental variables, AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION or use IAM access roles with an instance
profile on EC2 to give your instances permission to create temporary S3 access.

iRODS parameters:

· folder Full directory name within iRODS to prefix the output.

· resource (optional) iRODS resource name, if other than default.

example configuration

upload:
method: irods dir: ../final folder: /irodsZone/your/path/ resource:
yourResourceName

Uploads to iRODS depend on a valid installation of the iCommands CLI, and a preconfigured
connection through the iinit command.

GLOBALS

       You can define files used multiple times in the algorithm section of your configuration in
       a  top  level  globals dictionary. This saves copying and pasting across the configuration
       and makes it easier to manually adjust the configuration if inputs change:

          globals:
            my_custom_locations: /path/to/file.bed
          details:
            - description: sample1
              algorithm:
                variant_regions: my_custom_locations
            - description: sample2
              algorithm:
                variant_regions: my_custom_locations

ALGORITHM PARAMETERS

The YAML configuration file provides a number of hooks to customize analysis in the sample
configuration file. Place these under the algorithm keyword.

Alignment
· platform Sequencing platform used. Corresponds to the PL parameter in BAM read groups.
Default 'Illumina'.

· aligner Aligner to use: [bwa, bowtie, bowtie2, hisat2, minimap2, novoalign, snap, star,
tophat2, false] To use pre-aligned BAM files as inputs to the pipeline, set to false,
which will also skip duplicate marking by default. Using pre-aligned inputs requires
proper assignment of BAM read groups and sorting. The bam_clean argument can often
resolve issues with problematic input BAMs.

· bam_clean Clean an input BAM when skipping alignment step. This handles adding read
groups, sorting to a reference genome and filtering problem records that cause problems
with GATK. Options:

· remove_extracontigs -- Remove non-standard chromosomes (for human, anything that is
not chr1-22,X,Y) from the BAM file. This allows compatibility when the BAM
reference genome has different contigs from the reference file but consistent
ordering for standard chromosomes. Also fixes the read groups in the BAM file as
in fixrg. This is faster than the full picard cleaning option.

· fixrg -- only adjust read groups, assuming everything else in BAM file is
compatible.

· picard -- Picard/GATK based cleaning. Includes read group changes, fixing of
problematic reads and re-ordering chromosome order to match the reference genome.
To fix misencoded input BAMs with non-standard scores, set quality_format to
illumina.

· bam_sort Allow sorting of input BAMs when skipping alignment step (aligner set to
false). Options are coordinate or queryname. For additional processing through standard
pipelines requires coordinate sorted inputs. The default is to not do additional sorting
and assume pre-sorted BAMs.

· disambiguate For mixed or explant samples, provide a list of genome_build identifiers to
check and remove from alignment. Currently supports cleaning a single organism. For
example, with genome_build: hg19 and disambiguate: [mm10], it will align to hg19 and
mm10, run disambiguation and discard reads confidently aligned to mm10 and not hg19.
Affects fusion detection when star is chosen as the aligner. Aligner must be set to a
non false value for this to run.

· align_split_size: Increase parallelization of alignment. As of 0.9.8, bcbio will try to
determine a useful parameter and you don't need to set this. If you manually set it,
bcbio will respect your specification. Set to false to avoid splitting entirely. If set,
this defines the number of records to feed into each independent parallel step (for
example, 5000000 = 5 million reads per chunk). It converts the original inputs into
bgzip grabix indexed FASTQ files, and then retrieves chunks for parallel alignment.
Following alignment, it combines all chunks back into the final merged alignment file.
This allows parallelization at the cost of additional work of preparing inputs and
combining split outputs. The tradeoff makes sense when you have large files and lots of
distributed compute. When you have fewer large multicore machines this parameter may not
help speed up processing.

· quality_format Quality format of FASTQ or BAM inputs [standard, illumina]

· strandedness For RNA-seq libraries, if your library is strand specific, set the
appropriate flag from [unstranded, firststrand, secondstrand]. Defaults to unstranded.
For dUTP marked libraries, firststrand is correct; for Scriptseq prepared libraries,
secondstrand is correct.

· save_diskspace Remove align prepped bgzip and split BAM files after merging into final
BAMs. Helps reduce space on limited filesystems during a run. tools_off:
[upload_alignment] may also be useful in conjunction with this. [false, true]

Read trimming
· trim_reads Trims low quality or adapter sequences or at the ends of reads using atropos.
adapters and custom_trim specify the sequences to trim. For RNA-seq, it's recommended
to leave as False unless running Tophat2. For variant calling, we recommend trimming
only in special cases where standard soft-clipping does not resolve false positive
problems. Supports trimming with
`<https://github.com/jdidion/atropos> atropos`_
or fastp. fastp is currently not compatible with alignment splitting in variant calling
and requires align_split_size: false. The old parameter read_through defaults to using
atropos trimming. [False, atropos, fastp]. Default to False,

· adapters If trimming adapter read through, trim a set of stock adapter sequences. Allows
specification of multiple items in a list, for example [truseq, polya] will trim both
TruSeq adapter sequences and polyA tails. polyg trimming removes high quality G
stretches present in NovaSeq and NextSeq data. In the small RNA pipeline, bcbio will try
to detect the adapter using DNApi. If you set up this parameter, then bcbio will use
this value instead. Choices: [truseq, illumina, nextera, polya, polyx, polyg, nextera2,
truseq2].

· nextera2: Illumina NEXTera DNA prep kit from NEB

· truseq2: SMARTer Universal Low Input RNA Kit

· custom_trim A list of sequences to trim from the end of reads, for example: [AAAATTTT,
GGGGCCCC]

· min_read_length Minimum read length to maintain when read_through trimming set in
trim_reads. Defaults to 25.

· trim_ends Specify values for trimming at ends of reads, using a fast approach built into
fastq preparation. This does not do quality or adapter trimming but will quickly cut off
a defined set of values from either the 5' or 3' end of the first and second reads.
Expects a list of 4 values: [5' trim read1, 3' trim read1, 5' trim read2, 3' trim
read2]. Set values to 0 if you don't need trimming (ie. [6, 0, 12, 0] will trim 6bp from
the start of read 1 and 12bp from the start of read 2. Only implemented for variant
calling pipelines.

Alignment postprocessing
· mark_duplicates Mark duplicated reads [true, false]. If true, will perform streaming
duplicate marking with biobambam's bammarkduplicates or bamsormadup. Uses samblaster as
an alternative if you have paired reads and specifying lumpy as an svcaller. Defaults to
true for variant calling and false for RNA-seq and small RNA analyses. Also defaults to
false if you're not doing alignment (aligner: false).

· recalibrate Perform base quality score recalibration on the aligned BAM file, adjusting
quality scores to reflect alignments and known variants. Supports both GATK and Sentieon
recalibration. Defaults to false, no recalibration. [false, gatk, sentieon]

· realign Perform GATK's realignment around indels on the aligned BAM file. Defaults to no
realignment since realigning callers like FreeBayes and GATK HaplotypeCaller handle this
as part of the calling process. [false, gatk]

Coverage information
· coverage_interval Regions covered by sequencing. bcbio calculates this automatically
from alignment coverage information, so you only need to specify it in the input
configuration if you have specific needs or bcbio does not determine coverage correctly.
genome specifies full genome sequencing, regional identifies partial-genome pull down
sequencing like exome analyses, and amplicon is partial-genome sequencing from PCR
amplicon sequencing. This influences GATK options for filtering: we use Variant Quality
Score Recalibration when set to genome, otherwise we apply cutoff-based soft filters.
Also affects copy number calling with CNVkit, structural variant calling and deep panel
calling in cancer samples, where we tune regional/amplicon analyses to maximize
sensitivity. [genome, regional, amplicon]

· maxcov_downsample bcbio downsamples whole genome runs with >10x average coverage to a
maximum coverage, avoiding slow runtimes in collapsed repeats and poly-A/T/G/C regions.
This parameter specified the multiplier of average coverage to downsample at. For
example, 200 downsamples at 6000x coverage for a 30x whole genome. Set to false or 0 to
disable downsampling. Current defaults to false pending runtime improvements.

· coverage_depth_min Minimum depth of coverage. When calculating regions to call in, bcbio
may exclude regions with less than this many reads. It is not a hard filter for variant
calling, but rather a guideline for determining callable regions. It's primarily useful
when trying to call on very low depth samples. Defaults to 4. Setting lower than 4 will
trigger low-depth calling options for GATK.

Analysis regions
These BED files define the regions of the genome to analyze and report on.
variant_regions adjusts regions for small variant (SNP and indel) calling. sv_regions
defines regions for structural variant calling if different than variant_regions. For
coverage-based quality control metrics, we first use coverage if specified, then
sv_regions if specified, then variant_regions. See the section on Input file preparation
for tips on ensuring chromosome naming in these files match your reference genome. bcbio
pre-installs some standard BED files for human analyses. Reference these using the naming
schemes described in the reference data repository.

· variant_regions BED file of regions to call variants in.

· sv_regions -- A specification of regions to target during structural variant calling. By
default, bcbio uses regions specified in variant_regions but this allows custom
specification for structural variant calling. This can be a pointer to a BED file or
special inputs: exons for only exon regions, transcripts for transcript regions (the min
start and max end of exons) or transcriptsXXXX for transcripts plus a window of XXXX
size around it. The size can be an integer (transcripts1000) or exponential
(transcripts1e5). This applies to CNVkit and heterogeneity analysis.

· coverage A BED file of regions to check for coverage and completeness in QC reporting.
This can also be a shorthand for a BED file installed by bcbio (see Structural variant
calling for options).

· exclude_regions List of regions to remove as part of analysis. This allows avoidance of
slow and potentially misleading regions. This is a list of the following options:

· polyx Avoid calling variants in regions of single nucleotide stretches greater than
50. These can contribute to long variant calling runtimes when errors in polyX
stretches align in high depth to these regions and take a lot of work to resolve.
Since we don't expect decent resolution through these types of repeats, this helps
avoid extra calculations for assessing the noise. This is an alternative to
trimming polyX from the 3' ends for reads with trim_reads and adapters. Requires an
organism with a defined polyx file in genome resources. For structural variant
calling, adding polyx avoids calling small indels for Manta, where these can
contribute to long runtimes.

· lcr Avoid calling variants in low complexity regions (LCRs). Heng Li's variant
artifacts paper provides these regions, which cover ~2% of the genome but
contribute to a large fraction of problematic calls due to the difficulty of
resolving variants in repetitive regions. Removal can help facilitate comparisons
between methods and reduce false positives if you don't need calls in LCRs for your
biological analysis. Requires an organism with a defined lcr file in genome
resources.

· highdepth Remove high depth regions during variant calling, identified by collapsed
repeats around centromeres in hg19 and GRCh37 as characterized in the ENCODE
blacklist. This is on by default for VarDict and FreeBayes whole genome calling to
help with slow runtimes in these regions, and also on for whole genome structural
variant calling to avoid false positives from high depth repeats.

· altcontigs Skip calling entirely in alternative and unplaced contigs. This limits
analysis to standard chromosomes -- chr1-22,X,Y,MT for human -- to avoid slowdowns
on the additional contigs.

Variant calling
· variantcaller Variant calling algorithm. Can be a list of multiple options or false to
skip [false, freebayes, gatk-haplotype, haplotyper, platypus, mutect, mutect2, scalpel,
tnhaplotyper, tnscope, vardict, varscan, samtools, gatk]

· Paired (typically somatic, tumor-normal) variant calling is currently supported by
vardict, freebayes, mutect2, mutect (see disclaimer below), scalpel (indels only),
tnhaplotyper (Sentieon), tnscope (Sentieon) and varscan. See the pipeline
documentation on cancer-calling for details on pairing tumor and normal samples.

· You can generate both somatic and germline calls for paired tumor-normal samples
using different sets of callers. The pipeline documentation on calling
somatic-w-germline-variants details how to do this.

· mutect, a SNP-only caller, can be combined with indels from scalpel or sid. Mutect
operates in both tumor-normal and tumor-only modes. In tumor-only mode the indels
from scalpel will reflect all indels in the sample, as there is currently no way of
separating the germline from somatic indels in tumor-only mode.

indelcaller For the MuTect SNP only variant caller it is possible to add
calls from an indelcaller such as scalpel, pindel and somatic indel detector (for
Appistry MuTect users only). Currently an experimental option that adds these
indel calls to MuTect's SNP-only output. Only one caller supported. Omit to
ignore. [scalpel, pindel, sid, false]

· jointcaller Joint calling algorithm, combining variants called with the specified
variantcaller. Can be a list of multiple options but needs to match with appropriate
variantcaller. Joint calling is only needed for larger input sample sizes (>100
samples), otherwise use standard pooled population-calling:

· gatk-haplotype-joint GATK incremental joint discovery with HaplotypeCaller. Takes
individual gVCFs called by gatk-haploype and perform combined genotyping.

· freebayes-joint Combine freebayes calls using bcbio.variation.recall with recalling
at all positions found in each individual sample. Requires freebayes variant
calling.

· platypus-joint Combine platypus calls using bcbio.variation.recall with squaring
off at all positions found in each individual sample. Requires platypus variant
calling.

· samtools-joint Combine samtools calls using bcbio.variation.recall with squaring
off at all positions found in each individual sample. Requires samtools variant
calling.

· joint_group_size Specify the maximum number of gVCF samples to feed into joint calling.
Currently applies to GATK HaplotypeCaller joint calling and defaults to the GATK
recommendation of 200. Larger numbers of samples will first get combined prior to
genotyping.

· ploidy Ploidy of called reads. Defaults to 2 (diploid). You can also tweak specialty
ploidy like mitochondrial calling by setting ploidy as a dictionary. The defaults are:

ploidy:
default: 2
mitochondrial: 1
female: 2
male: 1

· background Provide pre-calculated files to use as backgrounds for different processes.
Organized as a dictionary with individual keys for different components of the pipeline.
You can enter as many or few as needed:

· variant A VCF file with variants to use as a background reference during variant
calling. For tumor/normal paired calling use this to supply a panel of normal
individuals.

· cnv_reference Background reference file for copy number calling. This can be either
a single file for one CNV method or a dictionary for multiple methods. Supports
CNVkit cnn inputs, 'GATK4 HDF5 panel of normals <‐
https://software.broadinstitute.org/gatk/documentation/article?id=11682>`_ and
seq2c combined mapping plus coverage files:

background:
cnv_reference:
cnvkit: /path/to/background.cnn
gatk-cnv: /path/to/background_pon.hdf5
seq2c: /path/to/background.tsv

Somatic variant calling
· min_allele_fraction Minimum allele fraction to detect variants in heterogeneous tumor
samples, set as the float or integer percentage to resolve (i.e. 10 = alleles in 10% of
the sample). Defaults to 10. Specify this in the tumor sample of a tumor/normal pair. It
is percentage, not ratio, it is divided /100.0 when calling vardict!

Variant annotation
· effects Method used to calculate expected variant effects; defaults to snpEff. Ensembl
variant effect predictor (VEP) is also available when downloaded using
datatarget-install. [snpeff, vep, false]

· effects_transcripts Define the transcripts to use for effect prediction annotation.
Options all: Standard Ensembl transcript list (the default); canonical: Report single
canonical transcripts (-canon in snpEff, -pick in VEP); canonical_cancer Canonical
transcripts with hand curated changes for more common cancer transcripts (effects snpEff
only).

· vcfanno Configuration files for vcfanno, allowing the application of additional
annotations to variant calls. By default, bcbio will try and apply:

· gemini -- External population level annotations from GEMINI. This is only run for
human samples with gemini data installed (datatarget-install).

· somatic -- Somatic annotations from COSMIC, ClinVar and friends. COSMIC need a
custom installation within bcbio (datatarget-install). Only added for tumor or
tumor/normal somatic calling.

· rnaedit -- RNA editing sites for RNA-seq variant calling runs.

bcbio installs pre-prepared configuration files in genomes/build/config/vcfanno or you
can specify the full path to a /path/your/anns.conf and optionally an equivalently named
/path/your/anns.lua file. This value can be a list for multiple inputs.

Structural variant calling
· svcaller -- List of structural variant callers to use. [lumpy, manta, cnvkit, gatk-cnv,
seq2c, purecn, titancna, delly, battenberg]. LUMPY and Manta require paired end reads.
cnvkit and gatk-cnv should not be used on the same sample due to incompatible
normalization approaches, please pick one or the other for CNV calling.

· svprioritize -- Produce a tab separated summary file of structural variants in regions
of interest. This complements the full VCF files of structural variant calls to
highlight changes in known genes. See the paper on cancer genome prioritization for the
full details. This can be either the path to a BED file (with chrom start end gene_name,
see Input file preparation) or the name of one of the pre-installed prioritization
files:

· cancer/civic (hg19, GRCh37, hg38) -- Known cancer associated genes from CIViC.

· cancer/az300 (hg19, GRCh37, hg38) -- 300 cancer associated genes contributed by
AstraZeneca oncology.

· cancer/az-cancer-panel (hg19, GRCh37, hg38) -- A text file of genes in the
AstraZeneca cancer panel. This is only usable for svprioritize which can take a
list of gene names instead of a BED file.

· actionable/ACMG56 -- Medically actionable genes from the The American College of
Medical Genetics and Genomics

· coding/ccds (hg38) -- Consensus CDS (CCDS) regions with 2bps added to internal
introns to capture canonical splice acceptor/donor sites, and multiple transcripts
from a single gene merged into a single all inclusive gene entry.

· fusion_mode Enable fusion detection in RNA-seq when using STAR (recommended) or Tophat
(not recommended) as the aligner. OncoFuse is used to summarise the fusions but
currently only supports hg19 and GRCh37. For explant samples disambiguate enables
disambiguation of STAR output [false, true]. This option is deprecated in favor of
fusion_caller.

· fusion_caller Specify a standalone fusion caller for fusion mode. Supports oncofuse for
STAR/tophat runs, pizzly and ericscript for all runs. If a standalone caller is
specified (i.e. pizzly or ericscript ), fusion detection will not be performed with
aligner. oncofuse only supports human genome builds GRCh37 and hg19. ericscript supports
human genome builds GRCh37, hg19 and hg38 after installing the associated fusion
databases (datatarget-install).

HLA typing
· hlacaller -- Perform identification of highly polymorphic HLAs with human build 38
(hg38). The recommended option is optitype, using the OptiType caller. Also supports
using the bwa HLA typing implementation with bwakit

Validation
bcbio pre-installs standard truth sets for performing validation, and also allows use of
custom local files for assessing reliability of your runs:

· validate A VCF file of expected variant calls to perform validation and grading of small
variants (SNPs and indels) from the pipeline. This provides a mechanism to ensure
consistency of calls against a known set of variants, supporting comparisons to
genotyping array data or reference materials.

· validate_regions A BED file of regions to evaluate small variant calls in. This defines
specific regions covered by the validate VCF file.

· svvalidate -- Dictionary of call types and pointer to BED file of known regions. For
example: DEL: known_deletions.bed does deletion based validation of outputs against the
BED file.

Each option can be either the path to a local file, or a partial path to a file in the
pre-installed truth sets. For instance, to validate an NA12878 run against the Genome in a
Bottle truth set:

validate: giab-NA12878/truth_small_variants.vcf.gz
validate_regions: giab-NA12878/truth_regions.bed
svvalidate:
DEL: giab-NA12878/truth_DEL.bed

follow the same naming schemes for small variants, regions and different structural
variant types. bcbio has the following validation materials for germline validations:

· giab-NA12878 -- Genome in a Bottle for NA12878, a Caucasian sample. Truth sets:
small_variants, regions, DEL; Builds: GRCh37, hg19, hg38

· giab-NA24385 -- Genome in a Bottle for NA24385, an Ashkenazic Jewish sample. Truth
sets: small_variants, regions; Builds: GRCh37, hg19, hg38

· giab-NA24631 -- Genome in a Bottle for NA24631, a Chinese sample. Truth sets:
small_variants, regions; Builds: GRCh37, hg19, hg38

· giab-NA12878-crossmap -- Genome in a Bottle for NA12878 converted to hg38 with
CrossMap. Truth sets: small_variants, regions, DEL; Builds: hg38

· giab-NA12878-remap -- Genome in a Bottle for NA12878 converted to hg38 with Remap.
Truth sets: small_variants, regions, DEL; Builds: hg38

· platinum-genome-NA12878 -- Illumina Platinum Genome for NA12878. Truth sets:
small_variants, regions; Builds: hg19, hg38

and for cancer validations:

· giab-NA12878-NA24385-somatic -- A sequenced NA12878/NA24385 mixture providing a
somatic-like truth set for detecting low frequency events. Build: Truth sets:
small_variants, regions. Builds: GRCh37, hg38

· dream-syn3 -- Synthetic dataset 3 from the ICGC-TCGA DREAM mutation calling challenge.
Truth sets: small_variants, regions, DEL, DUP, INV, INS. Builds: GRCh37.

· dream-syn4 -- Synthetic dataset 4 from the ICGC-TCGA DREAM mutation calling challenge.
Truth sets: small_variants, regions, DEL, DUP, INV. Builds: GRCh37.

· dream-syn3-crossmap -- Synthetic dataset 3 from the ICGC-TCGA DREAM mutation calling
challenge converted to human build 38 coordinates with CrossMap. Truth sets:
small_variants, regions, DEL, DUP, INV, INS. Builds: hg38.

· dream-syn4-crossmap -- Synthetic dataset 4 from the ICGC-TCGA DREAM mutation calling
challenge converted to human build 38 coordinates with CrossMap. Truth sets:
small_variants, regions, DEL, DUP, INV. Builds: hg38.

For more information on the hg38 truth set preparation see the work on validation on build
38 and conversion of human build 37 truth sets to build 38. The installation recipes
contain provenance details about the origins of the installed files.

UMIs
Unique molecular identifiers (UMIs) are short random barcodes used to tag individual
molecules and avoid amplification biased. Both single cell RNA-seq and variant calling
support UMIs. For variant calling, fgbio collapses sequencing duplicates for each UMI into
a single consensus read prior to running re-alignment and variant calling. This requires
mark_duplicates: true (the default) since it uses position based duplicates and UMI tags
for collapsing duplicate reads into consensus sequences.

To help with preparing fastq files with UMIs bcbio provides a script
bcbio_fastq_umi_prep.py. This handles two kinds of UMI barcodes:

· Separate UMIs: it converts reads output by an Illumina as 3 files (read 1, read 2, and
UMIs).

· Duplex barcodes with tags incorporated at the 5' end of read 1 and read 2

In both cases, these get converted into paired reads with UMIs in the fastq names,
allowing specification of umi_type: fastq_name in your bcbio YAML configuration. The
script runs on a single set of files or autopairs an entire directory of fastq files. To
convert a directory with separate UMI files:

bcbio_fastq_umi_prep.py autopair -c <cores_to_use> <list> <of> <fastq> <files>

To convert duplex barcodes present on the ends of read 1 and read 2:

bcbio_fastq_umi_prep.py autopair -c <cores_to_use> --tag1 5 --tag2 5 <list> <of> <fastq> <files>

Configuration options for UMIs:

· umi_type The UMI/cellular barcode scheme used for your data. For single cell RNA
sequencing, supports [harvard-indrop, harvard-indrop-v2, 10x_v2, icell8, surecell]. For
variant analysis with UMI based consensus calling, supports either fastq_name with UMIs
in read names or the path to a fastq file with UMIs for each aligned read.

· correct_umis: [path/to/whitelist_umi.txt]. For a restricted set of UMIs specify a text
file (one UMI per line). UMIs will be corrected with
http://fulcrumgenomics.github.io/fgbio/tools/latest/CorrectUmis.html

You can adjust the fgbio default options by adjusting Resources. The most common change is
modifying the minimum number of reads as input to consensus sequences. This default to 1
to avoid losing reads but you can set to larger values for high depth panels:

resources:
fgbio:
options: [--min-reads, 2]

RNA sequencing
· transcript_assembler If set, will assemble novel genes and transcripts and merge the
results into the known annotation. Can have multiple values set in a list. Supports
['cufflinks', 'stringtie'].

· transcriptome_align If set to True, will also align reads to just the transcriptome, for
use with EBSeq and others.

· expression_caller A list of optional expression callers to turn on. Supports
['cufflinks', 'express', 'stringtie', 'sailfish', 'dexseq', 'kallisto']. Salmon and
count based expression estimation are run by default.

· fusion_caller A list of optional fusion callers to turn on. Supports [oncofuse, pizzly].

· variantcaller Variant calling algorithm to call variants on RNA-seq data. Supports
[gatk-haplotype] or [vardict].

· spikein_fasta A FASTA file of spike in sequences to quantitate.

· quantify_genome_alignments If set to True, run Salmon quantification using the genome
alignments from STAR, when available. If STAR alignments are not available, use Salmon's
SA mode with decoys.

· bcbiornaseq A dictionary of key-value pairs to be passed as options to bcbioRNAseq.
Currently supports organism as a key and takes the latin name of the genome used (mus
musculus, homo sapiens, etc) and interesting_groups which will be used to color quality
control plots.:

bcbiornaseq:
organism: homo sapiens
interesting_groups: [treatment, genotype, etc, etc]

You will need to also turn on bcbiornaseq by turning it on via tools_on: [bcbiornaseq].

Fast RNA-seq
· transcriptome_fasta An optional FASTA file of transcriptome sequences to quantitate
rather than using bcbio installed transcriptome sequences.

Single-cell RNA sequencing
· minimum_barcode_depth Cellular barcodes with less than this many reads assigned to them
are discarded (default 10,000).

· cellular_barcodes A single file or a list of one or two files which have valid cellular
barcodes. Provide one file if there is only one barcode and two files if the barcodes
are split. If no file is provided, all cellular barcodes passing the
minimum_barcode_depth filter are kept.

· transcriptome_fasta An optional FASTA file of transcriptome sequences to quantitate
rather than the bcbio installed version.

· transcriptome_gtf An optional GTF file of the transcriptome to quantitate, rather than
the bcbio installed version. This is recommended for single-cell RNA-sequencing
experiments.

· singlecell_quantifier Quantifier to use for single-cell RNA-sequencing. Supports rapmap
or kallisto.

· cellular_barcode_correction Number of errors to correct in identified cellular barcodes.
Requires a set of known barcodes to be passed with the cellular_barcodes option.
Defaults to 1. Set to 0 to turn off error correction.

· sample_barcodes A text file with one barcode per line of expected sample barcodes. If
the file contains sample name for each barcode, this will be used to create a
tagcounts.mtx.metadata that match each cell with the sample name associated with the
barcode. The actual barcodes may be reverse complements of the sequences provided with
the samples. It worth to check before running bcbio. For inDrops procol samples
barcodes are in the fastq file for read3. This is an example of the sample_barcodes
file:

AATTCCGG,sample1
CCTTGGAA,sample2

· demultiplexed If set to True, each file will be treated as a cell or well and not a
collection of cells. Use this if your data has already been broken up into cells or
wells.

smallRNA sequencing
· adapters The 3' end adapter that needs to be remove. For NextFlex protocol you can add
adapters: ["4N", "$3PRIME_ADAPTER"]. For any other options you can use resources:
atropos:options:["-u 4", "-u -4"].

· species 3 letters code to indicate the species in mirbase classification (i.e. hsa for
human).

· aligner Currently STAR is the only one tested although bowtie can be used as well.

· expression_caller A list of expression callers to turn on: trna, seqcluster, mirdeep2,
mirge (read smallRNA-seq to learn how to set up bcbio to run mirge)

· transcriptome_gtf An optional GTF file of the transcriptome to for seqcluster.

· spikein_fasta A FASTA file of spike in sequences to quantitate.

· umi_type: 'qiagen_smallRNA_umi' Support of Qiagen UMI small RNAseq protocol.

ChIP/ATAC sequencing
· peakcaller bcbio only accepts [macs2]

· aligner Currently bowtie2 is the only one tested. bwa is also available.

· The phenotype and batch tags need to be set under metadata in the config YAML file. The
phenotype tag will specify the chip (phenotype: chip) and input samples (phenotype:
input). The batch tag will specify the input-chip pairs of samples for example, batch:
pair1. Same input can be used for different chip samples giving a list of distinct
values: batch: [sample1, sample2].

· chip_method: currently supporting standard CHIP-seq (TF or broad regions using chip) or
ATAC-seq (atac). Parameters will change depending on the option to get the best possible
results. Only macs2 supported for now.

You can pass different parameters for macs2 adding to Resources:

resources:
macs2:
options: ["--broad"]

Methylation
· aligner supports bismark

Quality control
· qc Allows you to specifically assign quality control modules to run. Generally you want
to leave this unset and allow bcbio to run the correct QC metrics for your experiment,
or remove specific QC steps you don't want using tools_off (Changing bcbio defaults).
However, this can allow turning off most of the QC by specifying a single quick running
step like picard. Available tools are fastqc, samtools, coverage, picard, contamination
(VerifyBamID), peddy, viral, damage, umi, small-rna, atropos, chipqc.

· mixup_check Detect potential sample mixups. Currently supports qSignature.
qsignature_full runs a larger analysis while qsignature runs a smaller subset on
chromosome 22. [False, qsignature, qsignature_full]

· kraken Turn on kraken algorithm to detect possible contamination. You can add kraken:
minikraken and it will use a minimal database to detect possible contaminants. As well,
you can point to a custom database directory and kraken will use it. You will find the
results in the qc directory. You need to use --datatarget kraken during installation to
make the minikraken database available.

· preseq Accepts lc_extrap or c_curve, and runs Preseq <‐
http://smithlabresearch.org/software/preseq>`_, a tool that predicts the yield for
future experiments. By default, it runs 300 steps of estimation using the segment length
of 100000. The default extrapolation limit for lc_extrap is 3x of the reads number. You
can override the parameters seg_len, steps, extrap_fraction using the Resources:
section:

resources:
preseq:
extrap_fraction: 5
steps: 500
seg_len: 5000

And you can also set extrap and step parameters directly, as well as provide any other
command line option via options:

resources:
preseq:
extrap: 10000000
step: 30000
options: ["-D"]

· bcbio uses MultiQC to combine QC output for all samples into a single report file. If
you need to tweak configuration settings from bcbio defaults, you can use Resources. For
instance to display read counts with full numbers instead of the default millions:

resources:
multiqc:
options: ["--cl_config", "'read_count_multiplier: 1'"]

or as thousands:

resources:
multiqc:
options: ["--cl_config", "'{read_count_multiplier: 0.001, read_count_prefix: K}'"]

Post-processing
· archive Specify targets for long term archival. cram removes fastq names and does 8-bin
compression of BAM files into CRAM format. cram-lossless generates CRAM files without
changes to quality scores or fastq name. Default: [] -- no archiving. Lossy cram has
some issues, lossless cram provides pretty good compression relative to BAM, and many
machines output binned values now, so cram-lossless is what we recommend you use.

Changing bcbio defaults
bcbio provides some hints to change default behavior be either turning specific defaults
on or off, with tools_on and tools_off. Both can be lists with multiple options:

· tools_off Specify third party tools to skip as part of analysis pipeline. Enables
turning off specific components of pipelines if not needed:

· gatk4 Use older GATK versions (3.x) for GATK commands like BQSR, HaplotypeCaller and
VQSR. By default bcbio includes GATK4 and uses it.

· vqsr turns off variant quality score recalibration for all samples.

· bwa-mem forces use of original bwa aln alignment. Without this, we use bwa mem with
70bp or longer reads.

· lumpy-genotype skip genotyping for Lumpy samples, which can be slow in the case of
many structural variants.

· seqcluster turns off use of seqcluster tool in srnaseq pipeline.

· tumoronly-prioritization turns off attempted removal of germline variants from tumor
only calls using external population data sources like ExAC and 1000 genomes.

· vardict_somatic_filter disables running a post calling filter for VarDict to remove
variants found in normal samples. Without vardict_somatic_filter in paired analyses no
soft filtering of germline variants is performed but all high quality variants pass.

· upload_alignment turns off final upload of large alignment files.

· pbgzip turns off use of bgzip with multiple threads.

· For quality control, you can turn off any specific tool by adding to tools_off. For
example, fastqc turns off quality control FastQC usage. and coverage_qc turns off
calculation of coverage statistics with samtools-stats and picard. See the Methylation
docs for details on tools.

· tools_on Specify functionality to enable that is off by default:

· qualimap runs Qualimap (qualimap uses downsampled files and numbers here are an
estimation of 1e7 reads.).

· qualimap_full runs Qualimap with full bam files but it may be slow.

· damage_filter annotates low frequency somatic calls in INFO/DKFZBias for DNA damage
artifacts using DKFZBiasFilter.

· tumoronly_germline_filter applies a LowPriority filter to tumor-only calls that match
population germline databases. The default is to just apply a tag EPR (external
prioritization) that flags variants present in external databases. Anything missing a
pass here is a likely germline.

· vqsr makes GATK try quality score recalibration for variant filtration, even for
smaller sample sizes.

· svplots adds additional coverage and summary plots for CNVkit and detected ensemble
variants.

· bwa-mem forces use of bwa mem even for samples with less than 70bp reads.

· gvcf forces gVCF output for callers that support it (GATK HaplotypeCaller, FreeBayes,
Platypus). For joint calling using a population of samples, please use jointcaller
(population-calling).

· vep_splicesite_annotations enables the use of the MaxEntScan and SpliceRegion plugin
for VEP. Both optional plugins add extra splice site annotations.

· gemini Create a GEMINI database of variants for downstream query using the new vcfanno
and vcf2db approach.

· gemini_orig Create a GEMINI database of variants using the older GEMINI loader. Only
works for GRCh37 and hg19.

· gemini_allvariants enables all variants to go into GEMINI, not only those that pass
filters.

· vcf2db_expand decompresses and expands the genotype columns in the vcfanno prepared
GEMINI databases, enabling standard SQL queries on genotypes and depths.

· bnd-genotype enables genotyping of breakends in Lumpy calls, which improves accuracy
but can be slow.

· lumpy_usecnv uses input calls from CNVkit as prior evidence to Lumpy calling.

· coverage_perbase calculates per-base coverage depth for analyzed variant regions.

· bcbiornaseq loads a bcbioRNASeq object for use with bcbioRNASeq.

parallelization
· nomap_split_size Unmapped base pair regions required to split analysis into blocks.
Creates islands of mapped reads surrounded by unmapped (or N) regions, allowing each
mapped region to run in parallel. (default: 250)

· nomap_split_targets Number of target intervals to attempt to split processing into. This
picks unmapped regions evenly spaced across the genome to process concurrently. Limiting
targets prevents a large number of small targets. (default: 200 for standard runs, 20
for CWL runs)

Ensemble variant calling
In addition to single method variant calling, we support calling with multiple calling
methods and consolidating into a final Ensemble callset.

The recommended method to do this uses a simple majority rule ensemble classifier that
builds a final callset based on the intersection of calls. It selects variants represented
in at least a specified number of callers:

variantcaller: [mutect2, varscan, freebayes, vardict]
ensemble:
numpass: 2
use_filtered: false

This example selects variants present in 2 out of the 4 callers and does not use filtered
calls (the default behavior). Because of the difficulties of producing a unified
FORMAT/genotype field across callers, the ensemble outputs contains a mix of outputs from
the different callers. It picks a representative sample in the order of specified caller,
so in the example above would have a MuTect2 call if present, otherwise a VarScan call if
present, otherwise a FreeBayes call. This may require custom normalization scripts during
post-processing when using these calls. bcbio.variation.recall implements this approach,
which handles speed and file sorting limitations in the bcbio.variation approach.

This older approach uses the bcbio.variation toolkit to perform the consolidation. An
example configuration in the algorithm section is:

variantcaller: [gatk, freebayes, samtools, gatk-haplotype, varscan]
ensemble:
format-filters: [DP < 4]
classifier-params:
type: svm
classifiers:
balance: [AD, FS, Entropy]
calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
trusted-pct: 0.65

The ensemble set of parameters configure how to combine calls from the multiple methods:

· format-filters A set of filters to apply to variants before combining. The example
removes all calls with a depth of less than 4.

· classifier-params Parameters to configure the machine learning approaches used to
consolidate calls. The example defines an SVM classifier.

· classifiers Groups of classifiers to use for training and evaluating during machine
learning. The example defines two set of criteria for distinguishing reads with allele
balance issues and those with low calling support.

· trusted-pct Define threshold of variants to include in final callset. In the example,
variants called by more than 65% of the approaches (4 or more callers) pass without
being requiring SVM filtering.

RESOURCES

       The resources section allows customization of locations of programs and memory and compute
       resources to devote to them:

          resources:
            bwa:
              cores: 12
              cmd: /an/alternative/path/to/bwa
            samtools:
              cores: 16
              memory: 2G
            gatk:
              jvm_opts: ["-Xms2g", "-Xmx4g"]
            mutect2_filter:
              options: ["--max-events-in-region", "2"]

       · cmd Location of an executable. By default, we assume executables are on the path.

       · cores Cores to use for multi-proccessor enabled software. This is how many cores will be
         allocated  per  job.  For  example  if  you  are  running 10 samples and passed -n 40 to
         bcbio-nextgen and the step you are running has cores: 8 set, a maximum of  five  samples
         will run in parallel, each using 8 cores.

       · jvm_opts  Specific  memory usage options for Java software. For memory usage on programs
         like  GATK,  specify  the  maximum  usage  per  core.  On  multicore  machines,   that's
         machine-memory  divided  by cores.  This avoids memory errors when running multiple jobs
         simultaneously, while the framework will adjust memory up when running multicore jobs.

       · memory Specify the memory per core used by a process. For programs where memory  control
         is  available,  like samtools sort, this limits memory usage. For other programs this is
         an estimate of usage, used by memory-management to avoid over-scheduling memory.  Always
         specify  this  as  the  memory usage for a single core, and the pipeline handles scaling
         this when a process uses multiple cores.

       · keyfile Specify the location of a program specific key file or license server,  obtained
         from a third party software tool. Supports licenses for novoalign and Sentieon. For more
         complex Sentieon setups this can also be a dictionary of environmental variables:

            resources:
              sentieon:
                keyfile:
                  SENTIEON_LICENSE_SERVER: 100.100.100.100:8888
                  SENTIEON_AUTH_MECH: XXX
                  SENTIEON_AUTH_DATA: signature

       · options Adjust specific command line options for a program. This can be hard to  support
         for  many  tools  due to conflicts with other existing options but is available for some
         tools:

            · mutect2, mutect2_filter: Adjust for Mutect2 calls and filtering.

   Temporary directory
       You also use the resource section  to  specify  system  specific  parameters  like  global
       temporary directories:

          resources:
            tmp:
              dir: /scratch

       This  is  useful on cluster systems with large attached local storage, where you can avoid
       some shared filesystem IO by writing temporary files to the local disk. When setting  this
       keep   in  mind  that  the  global  temporary  disk  must  have  enough  space  to  handle
       intermediates. The space differs between steps but generally you'd need to  have  2  times
       the  largest  input  file  per  sample  and  account for samples running simultaneously on
       multiple core machines.

       To handle clusters that specify local scratch space with an environmental variable,  bcbio
       will resolve environmental variables like:

          resources:
            tmp:
              dir: $YOUR_SCRATCH_LOCATION

   Sample or run specific resources
       To  override  any of the global resource settings in a sample specific manner, you write a
       resource section within your sample YAML configuration. For example, to  create  a  sample
       specific  temporary  directory and pass a command line option to novoalign, write a sample
       resource specification like:

          - description: Example
            analysis: variant2
            resources:
              novoalign:
                options: ["-o", "FullNW", "--rOQ"]
              tmp:
                dir: tmp/sampletmpdir

       To adjust resources for an entire run, you can add this resources specification at the top
       level of your sample YAML:

          details:
            - description: Example
          resources:
            default:
              cores: 16

   Logging directory
       By  default, bcbio writes the logging-output directory to log in the main directory of the
       run. You can configure this to a different location in your bcbio-system.yaml with:

          log_dir: /path/to/logs

INPUT FILE PREPARATION

       Input files for supplementing analysis, like variant_regions need to match  the  specified
       reference  genome.  A  common  cause of confusion is the two chromosome naming schemes for
       human genome build 37: UCSC-style in hg19 (chr1, chr2) and Ensembl/NCBI  style  in  GRCh37
       (1,  2).  To  help  avoid some of this confusion, in build 38 we only support the commonly
       agreed on chr1, chr2 style.

       It's important to ensure that the chromosome naming in your input files match those in the
       reference genome selected. bcbio will try to detect this and provide helpful errors if you
       miss it.

       To convert chromosome names, you can use Devon Ryan's collection of chromosome mappings as
       an input to sed.  For instance, to convert hg19 chr-style coordinates to GRCh37:

          wget --no-check-certificate -qO- http://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh37_UCSC2ensembl.txt \
             | awk '{if($1!=$2) print "s/^"$1"/"$2"/g"}' > remap.sed
          sed -f remap.sed original.bed > final.bed

GENOME CONFIGURATION FILES

Each genome build has an associated buildname-resources.yaml configuration file which
contains organism specific naming and resource files. bcbio-nextgen expects a resource
file present next to the genome FASTA file. Example genome configuration files are
available, and automatically installed for natively supported genomes. Create these by
hand to support additional organisms or builds.

The major sections of the file are:

· aliases -- Names for third-party programs used as part of the analysis, since naming
expectations can differ between software programs.

· variation -- Supporting data files for variant analysis. For human analyses, the dbSNP
and training files are from the GATK resource bundle. These are inputs into the
training models for recalibration. The automated CloudBioLinux data scripts will
download and install these in the variation subdirectory relative to the genome files.

· rnaseq -- Supporting data files for RNA-seq analysis. The automated installer and
updater handles retrieval and installation of these resources for supported genome
builds.

· srnaseq -- Supporting data files for smallRNA-seq analysis. Same as in rnaseq, the
automated installer and updater handle this for supported genome builds.

By default, we place the buildname-resources.yaml files next to the genome FASTA files in
the reference directory. For custom setups, you specify an alternative directory in the
ref:config-resources section of your bcbio_system.yaml file:

resources:
genome:
dir: /path/to/resources/files

REFERENCE GENOME FILES

The pipeline requires access to reference genomes, including the raw FASTA sequence and
pre-built indexes for aligners. The automated installer will install reference files and
indexes for commonly used genomes (see the upgrade-install documentation for command line
options).

For human genomes, we recommend using build 38 (hg38). This is fully supported and
validated in bcbio, and corrects a lot of issues in the previous build 37. We use the 1000
genomes distribution which includes HLAs and decoy sequences. For human build 37, GRCh37
and hg19, we use the 1000 genome references provided in the GATK resource bundle. These
differ in chromosome naming: hg19 uses chr1, chr2, chr3 style contigs while GRCh37 uses 1,
2, 3. They also differ slightly in content: GRCh37 has masked Pseudoautosomal regions on
chromosome Y allowing alignment to these regions on chromosome X.

You can use pre-existing data and reference indexes by pointing bcbio-nextgen at these
resources. We use the Galaxy .loc files approach to describing the location of the
sequence and index data, as described in data-requirements. This does not require a Galaxy
installation since the installer sets up a minimal set of .loc files. It finds these by
identifying the root galaxy directory, in which it expects a tool-data sub-directory with
the .loc files. It can do this in two ways:

· Using the directory of your bcbio-system.yaml. This is the default mechanism setup by
the automated installer and requires no additional work.

· From the path specified by the galaxy_config option in your bcbio-system.yaml. If you'd
like to move your system YAML file, add the full path to your galaxy directory here.
This is useful if you have a pre-existing Galaxy installation with reference data.

To manually make genomes available to bcbio-nextgen, edit the individual .loc files with
locations to your reference and index genomes. You need to edit sam_fa_indices.loc to
point at the FASTA files and then any genome indexes corresponding to aligners you'd like
to use (for example: bwa_index.loc for bwa and bowtie2_indices.loc for bowtie2). The
database key names used (like GRCh37 and mm10) should match those used in the genome_build
of your sample input configuration file.

ADDING CUSTOM GENOMES

       bcbio_setup_genome.py will help you to install a  custom  genome  and  apply  all  changes
       needed to the configuration files. It needs the genome in FASTA format, and the annotation
       file in GTF or GFF3 format. It can create index for all aligners used by bcbio.  Moreover,
       it  will  create  the  folder  rnaseq to allow you run the RNAseq pipeline without further
       configuration.

          bcbio_setup_genome.py -f genome.fa -g annotation.gtf -i bowtie2 star seq -n Celegans -b WBcel135

       If you want to add smallRNA-seq data files, you will need to add the  3  letters  code  of
       mirbase  for  your  genome  (i.e  hsa  for  human)  and the GTF file for the annotation of
       smallRNA data.  Here you can use  the  same  file  than  the  transcriptome  if  no  other
       available.

          bcbio_setup_genome.py -f genome.fa -g annotation.gtf -i bowtie2 star seq -n Celegans -b WBcel135 --species cel --srna_gtf another_annotation.gtf

       To use that genome just need to configure your YAML files as:

          genome_build: WBcel135

   Effects prediction
       To  perform  variant calling and predict effects in a custom genome you'd have to manually
       download and link this into your installation. First find the snpEff genome build:

          $ snpEff databases | grep Lactobacillus | grep pentosus
          Lactobacillus_pentosus_dsm_20314                                Lactobacillus_pentosus_dsm_20314                                              ENSEMBL_BFMPP_32_179            http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
          Lactobacillus_pentosus_kca1                                     Lactobacillus_pentosus_kca1                                                   ENSEMBL_BFMPP_32_179            http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip

       then download to the appropriate location:

          $ cd /path/to/bcbio/genomes/Lacto/Lactobacillus_pentosus
          $ mkdir snpEff
          $ cd snpEff
          $ wget http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
          $ unzip snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
          $ find . -name "Lactobacillus_pentosus_dsm_20314"
           ./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314
          $ mv ./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314 .

       finally add to your genome configuration file (seq/Lactobacillus_pentosus-resources.yaml):

          aliases:
            snpeff: Lactobacillus_pentosus_dsm_20314

       For adding an organism not present in snpEff, please see this mailing list discussion.

       The pipeline runs in parallel in two different ways:

       · multiple cores -- Analyses will run  in  parallel  using  multiple  cores  on  a  single
         machine. This requires only the multiprocessing Python library, included by default with
         most Python installations.

       · parallel messaging -- This allows  scaling  beyond  the  cores  available  on  a  single
         machine,  and  requires multiple machines with a shared filesystem like standard cluster
         environments.  Machine to machine communication occurs via messaging, using the  IPython
         parallel framework.

TUNING CORE AND MEMORY USAGE

bcbio has two ways to specify core usage, helping provide options for parallelizing
different types of processes:

· Total available cores: specified with -n on the commandline, this tells bcbio how many
total cores to use. This applies either to a local multicore run or a distributed job.

· Maximum cores to use for multicore processing of individual jobs. You specify this in
the resource section of either a sample YAML file (sample-resources) or
bcbio_system.yaml. Ideally you specify this in the default section (along with memory
usage). For example, this would specify that processes using multiple cores can get up
to 16 cores with 2G of memory per core:

resources:
default:
memory: 2G
cores: 16
jvm_opts: ["-Xms750m", "-Xmx2000m"]

bcbio uses these settings, along with memory requests, to determine how to partition jobs.
For example, if you had -n 32 and cores: 16 for a run on a single 32 core machine, this
would run two simultaneous bwa mapping jobs using 16 cores each.

Memory specifications (both in memory and jvm_opts) are per-core. bcbio takes care of
adjusting this memory to match the cores used. In the example above, if bcbio was running
a 16 core java process, it would use 32Gb of memory for the JVM, adjusting Xmx and Xms to
match cores used. Internally bcbio looks at the memory and CPU usage on a machine and
matches your configuration options to the available system resources. It will scale down
core requests if memory is limiting, avoiding over-scheduling resources during the run.
You ideally want to set both memory and jvm_opts to match the average memory per core on
the run machine and adjust upwards if this does not provide enough memory for some
processes during the run.

For single machine runs with a small number of samples, you generally want to set cores
close to or equal the number of total cores you're allocating to the job with -n. This
will allow individual samples to process as fast as possible and take advantage of
multicore software.

For distributed jobs, you want to set cores to match the available cores on a single node
in your cluster, then use -n as a multiple of this to determine how many nodes to spin up.
For example, cores: 16 and -n 64 would try to make four 16 core machines available for
analysis.

MULTIPLE CORES

       Running using multiple cores only requires setting the -n command line flag:

          bcbio_nextgen.py bcbio_sample.yaml -t local -n 12

IPYTHON PARALLEL

IPython parallel provides a distributed framework for performing parallel computation in
standard cluster environments. The bcbio-nextgen setup script installs both IPython and
pyzmq, which provides Python bindings for the ZeroMQ messaging library. The only
additional requirement is that the work directory where you run the analysis is accessible
to all processing nodes. This is typically accomplished with a distributed file system
like NFS, Gluster or Lustre.

Run an analysis using ipython for parallel execution:

bcbio_nextgen.py bcbio_sample.yaml -t ipython -n 12 -s lsf -q queue

The -s flag specifies a type of scheduler to use (lsf, sge, torque, slurm, pbspro).

The -q flag specifies the queue to submit jobs to.

The -n flag defines the total number of cores to use on the cluster during processing. The
framework will select the appropriate number of cores and type of cluster (single core
versus multi-core) to use based on the pipeline stage (see the internals-parallel section
in the internals documentation for more details). For multiple core steps, the number of
cores to use for programs like bwa, novoalign and gatk comes from the config-resources
section of the configuration. Ensure the cores specification matches the physical cores
available on machines in your cluster, and the pipeline will divide the total cores
specified by -n into the appropriate number of multicore jobs to run.

The pipeline default parameters assume a system with minimal time to obtain processing
cores and consistent file system accessibility. These defaults allow the system to fail
fast in the case of cluster issues which need diagnosis. For running on shared systems
with high resource usage and potential failures due to intermittent cluster issues, there
are turning parameters that increase resiliency. The --timeout flag specifies the numbers
of minutes to wait for a cluster to start up before timing out. This defaults to 15
minutes. The --retries flag specify the number of times to retry a job on failure. In
systems with transient distributed file system hiccups like lock errors or disk
availability, this will provide recoverability at the cost of resubmitting jobs that may
have failed for reproducible reasons.

Finally, the -r resources flag specifies resource options to pass along to the underlying
queue scheduler. This currently supports SGE's -l parameter, Torque's -l parameter and LSF
and SLURM native flags. This allows specification or resources to the scheduler (see the
qsub man page). You may specify multiple resources, so -r mem=4g -r ct=01:40:00 translates
to -l mem=4g -l ct=01:40:00 when passed to qsub or -r "account=a2010002" -r
"timelimit=04:00:00" when using SLURM, for instance. SLURM and Torque support
specification of an account parameter with -r account=your_name, which IPython transfers
into -A.

SGE supports special parameters passed using resources to help handle the heterogeneity of
possible setups.

Specify an SGE parallel environment that supports using multiple cores on a single node
with -r pename=your_pe. Since this setup is system specific it is hard to write general
code for find a suitable environment. Specifically, when there are multiple usable
parallel environments, it will select the first one which may not be correct. Manually
specifying it with a pename= flag to resources will ensure correct selection of the right
environment. If you're administering a grid engine cluster and not sure how to set this up
you'd typically want a smp queue using allocation_rule: $pe_slots like in this example
pename configuration or smp template.

SGE has other specific flags you may want to tune, depending on your setup. To specify an
advanced reservation with the -ar flag, use -r ar=ar_id. To specify an alternative memory
management model instead of mem_free use -r memtype=approach. It is further recommended to
configure mem_free (or any other chosen memory management model) as a consumable,
requestable resource in SGE to prevent overfilling hosts that do not have sufficient
memory per slot. This can be done in two steps. First, launch qmon as an admin, select
Complex Configuration in qmon, click on mem_free`, under the ``Consumable dialog select
JOB (instead of YES or NO) and finally click Modify for the changes to take effect.
Secondly, for each host in the queue, configure mem_free as a complex value. If a host
called myngshost has 128GB of RAM, the corresponding command would be qconf -mattr
exechost complex_values mem_free=128G myngshost

There are also special -r resources parameters to support pipeline configuration:

· -r conmem=4 -- Specify the memory for the controller process, in Gb. This currently
applies to SLURM processing and defaults to 4Gb.

· -r minconcores=2 -- The minimum number of cores to use for the controller process. The
controller one works on a single core but this can help in queues where you can only
specify multicore jobs.

· -r mincores=16 -- Specify the minimum number of cores to batch together for parallel
single core processes like variant calling. This will run multiple processes together
under a single submission to allow sharing of resources like memory, which is helpful
when a small percentage of the time a process like variant calling will use a lot of
memory. By default, bcbio will calculate mincores based on specifications for multicore
calling so this doesn't normally require a user to set.

TROUBLESHOOTING

Diagnosing job failures
Parallel jobs can often terminate with rather generic failures like any of the following:

· joblib/parallel.py", ... TypeError: init() takes at least 3 arguments (2 given)

· Multiprocessing exception:

· CalledProcessError: Command '<command line that failed>

These errors unfortunately don't help diagnose the problem, and you'll likely see the
actual error triggering this generic exception earlier in the run. This error can often be
hard to find due to parallelization.

If you run into a confusing failure like this, the best approach is to re-run with a
single core:

bcbio_nextgen.py your_input.yaml -n 1

which should produce a more helpful debug message right above the failure.

It's also worth re-trying the failed command line outside of bcbio to look for errors. You
can find the failing command by cross-referencing the error message with command lines in
log/bcbio-nextgen-commands.log. You may have to change temporary directories (tx/tmp**) in
some of the job outputs. Reproducing the error outside of bcbio is a good first step to
diagnosing and fixing the underlying issue.

No parallelization where expected
This may occur if the current execution is a re-run of a previous project:

· Files in checkpoints_parallel/*.done tell bcbio not to parallelize already executed
pipeline tasks. This makes restarts faster by avoiding re-starting a cluster (when using
distributed runs) for finished stages. If that behaviour is not desired for a task,
removing the checkpoint file will get things parallelizing again.

· If the processing of a task is nearly finished the last jobs of this task will be
running and bcbio will wait for those to finish.

IPython parallelization problems
Networking problems on clusters can prevent the IPython parallelization framework from
working properly. Be sure that the compute nodes on your cluster are aware of IP addresses
that they can use to communicate with each other (usually these will be local IP
addresses). Running:

python -c 'import socket; print socket.gethostbyname(socket.gethostname())'

Should return such an IP address (as opposed to localhost). This can be fixed by adding an
entry to the hosts file.

The line:

host-ip hostname

where host-ip is replaced by the actual IP address of the machine and hostname by the
machine's own hostname, should be aded to /etc/hosts on each compute node. This will
probably involve contacting your local cluster administrator.

MEMORY MANAGEMENT

       The memory information specified in  the  system  configuration  config-resources  enables
       scheduling  of  memory  intensive processes. The values are specified on a memory-per-core
       basis and thus bcbio-nextgen handles memory scheduling by:

       · Determining available cores and memory per machine

       · Calculating the memory  and  core  usage.   The  system  configuration  config-resources
         contains the expected core and memory usage of external programs.

       · Adjusting  the  specified  number  of  total cores to avoid over-scheduling memory. This
         allows running programs with more than the available memory per core without getting out
         of memory system errors.

       · Passing  total  memory  usage  along  to  schedulers.  The SLURM, SGE, Torque and PBSPro
         schedulers use this information to allocate memory to processes,  avoiding  issues  with
         other scheduled programs using available memory on a shared machine.

       As  a  result  of  these  calculations,  the  cores used during processing will not always
       correspond to the maximum cores provided in the input -n parameter. The goal is rather  to
       intelligently  maximize  cores and memory while staying within system resources. Note that
       memory specifications are for a single core, and the pipeline takes care of adjusting this
       to actual cores used during processing.

DETERMINING AVAILABLE CORES AND MEMORY PER MACHINE

       bcbio  automatically  tries  to determine the total available memory and cores per machine
       for balancing resource usage. For multicore runs,  it  retrieves  total  memory  from  the
       current  machine.  For parallel runs, it spawns a job on the queue and extracts the system
       information from that machine. This expects a homogeneous set of machines within a cluster
       queue.     You    can    see    the    determined    cores    and    total    memory    in
       provenance/system-ipython-queue.yaml.

       For heterogeneous clusters  or  other  cases  where  bcbio  does  not  correctly  identify
       available system resources, you can manually set the machine cores and total memory in the
       resource section of either a sample YAML file (sample-resources) or bcbio_system.yaml:

          resources:
            machine:
              memory: 48.0
              cores: 16

       The memory usage is total  available  on  the  machine  in  Gb,  so  this  specifies  that
       individual machines have 48Gb of total memory and 16 cores.

TUNING SYSTEMS FOR SCALE

bcbio-nextgen scales out on clusters including hundreds of cores and is stress tested on
systems with 1000 simultaneous processes. Scaling up often requires system specific tuning
to handle simultaneous processes. This section collects useful tips and tricks for
managing scaling issues.

Open file handles
A common failure mode is having too many open file handles. This error report can come
from the IPython infrastructure logs as ZeroMQ attempts to open sockets, or from the
processing logs as third party software gets file handles. You can check your available
file handles with ulimit -a | grep open. Setting open file handle limits is open system
and cluster specific and below are tips for specific setups.

In addition to open file handle limits (ulimit -n) large processes may also run into
issues with available max user processes (ulimit -u). Some systems set a low soft limit
(ulimit -Su) like 1024 but a higher hard limit (ulimit -Hu), allowing adjustment without
root privileges. The IPython controllers and engines do this automatically, but the main
bcbio_nextgen.py driver process cannot. If this scheduler puts this process on the same
node as worker processes, you may run into open file handle limits due to work happening
on the workers. To fix this, manually set ulimit -u a_high_number as part of the
submission process for the main process.

For a Ubuntu system, edit /etc/security/limits.conf to set the soft and hard nofile
descriptors, and edit /etc/pam.d/common-session to add pam_limits.so. See this blog post
for more details.

For CentOS/RedHat systems, edit /etc/security/limits.conf and
/etc/security/limits.d/90-nproc.conf to increase maximum open files and user limits.

SGE needs configuration at the qmaster level. Invoke qconf -mconf from a host with admin
privileges, and edit execd_params:

execd_params S_DESCRIPTORS=20000

IO and Network File Systems
bcbio-nextgen makes use of distributed network file systems to manage sharing large files
between compute nodes. While we strive to minimize disk-based processing by making use of
pipes, the pipeline still has a major IO component. To help manage IO and network
bottlenecks, this section contains pointers on deployments and benchmarking. Please
contribute your tips and thoughts.

· Harvard and Dell: See the 'Distributed File Systems' section of our post on scaling
bcbio-nextgen for details about the setup within Harvard FAS Research Computing and
thoughts on scaling and hardware. We also collaborate with Dell to test the pipeline on
Dell's Active Infrastructure for Life Sciences. We found the biggest initial factor
limiting scaling was network bandwidth between compute and storage nodes.

Spark
Some GATK tools like recalibration use Apache Spark for parallelization. By default bcbio
runs these with multicore parallelization on a single node, to fit in standard cluster and
local compute environments. If you have a custom Spark cluster on your system you can use
that for GATK by setting up the appropriate configuration in your sample-resources:

resources:
gatk-spark:
options: [--spark-master, 'spark://your-spark-cluster:6311']

bcbio runs with Common Workflow Language (CWL) compatible parallelization software. bcbio
generates a CWL workflow from a standard bcbio sample YAML description file and any tool
that supports CWL input can run the workflow. CWL-based tools do the work of managing
files and workflows, and bcbio performs the biological analysis using either a Docker
container or a local installation.

CURRENT STATUS

bcbio creates CWL for alignment, small variant calls (SNPs and indels), coverage
assessment, HLA typing, quality control and structural variant calling. It generates a CWL
v1.0.2 compatible workflow. The actual biological code execution during runs works with
either a bcbio docker container or a local installation of bcbio.

The implementation includes bcbio's approaches to splitting and batching analyses. At the
top level workflow, we parallelize by samples. Using sub-workflows, we split fastq inputs
into sections for parallel alignment over multiple machines following by merging. We also
use sub-workflows, along with CWL records, to batch multiple samples and run in parallel.
This enables pooled and tumor/normal cancer calling with parallelization by chromosome
regions based on coverage calculations.
[image: Variant calling overview] [image]

bcbio supports these CWL-compatible tools:

· Cromwell -- multicore local runs and distributed runs on HPC systems with shared
filesystems and schedulers like SLURM, SGE and PBSPro.

· Arvados -- a hosted platform that runs on top of parallel cloud environments. We include
an example below of running on the public Curoverse instance running on Microsoft Azure.

· DNANexus -- a hosted platform running distributed jobs on cloud environments, working
with both AWS and Azure.

· Seven Bridges -- parallel distributed analyses on the Seven Bridges platform and Cancer
Genomics Cloud.

· Toil -- parallel local and distributed cluster runs on schedulers like SLURM, SGE and
PBSPro.

· rabix bunny -- multicore local runs.

· cwltool -- a single core analysis engine, primarily used for testing.

We plan to continue to expand CWL support to include more components of bcbio, and also
need to evaluate the workflow on larger, real life analyses. This includes supporting
additional CWL runners. We're working on evaluating Galaxy/Planemo for integration with
the Galaxy community.

INSTALLATION

       bcbio-vm installs all dependencies required to generate CWL  and  run  bcbio,  along  with
       supported  CWL  runners.  There are two install choices, depending on your usage of bcbio:
       running CWL with a existing local bcbio install, or running with containers.

   Install bcbio-vm with a local bcbio
       To run bcbio without using containers, first install bcbio and make it available  in  your
       path.  You'll  need  both  the  bcbio  code  and  tools.   To only run the tests and bcbio
       validations, you don't need a full data installation so can install with --nodata.

       To then install bcbio-vm, add the --cwl flag to the install:

          bcbio_nextgen.py upgrade --cwl

       Adding this to any future upgrades will also update the bcbio-vm wrapper code and tools.

       When you begin running your own analysis and need the  data  available,  pre-prepare  your
       bcbio data directory with bcbio_nextgen.py upgrade --data --cwl.

   Install bcbio-vm with containers
       If  you don't have an existing local bcbio installation and want to run with CWL using the
       tools and data embedded in containers, you can do a stand along install of just  bcbio-vm.
       To install using Miniconda and bioconda packages on Linux:

          export TARGETDIR=~/install/bcbio-vm/anaconda
          export BINDIR=/usr/local/bin
          wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
          bash Miniconda3-latest-Linux-x86_64.sh -b -p $TARGETDIR
          $TARGETDIR/bin/conda install --yes -c conda-forge -c bioconda python=3 bcbio-nextgen
          $TARGETDIR/bin/conda install --yes -c conda-forge -c bioconda python=3 bcbio-nextgen-vm
          mkdir -p $BINDIR
          ln -s $TARGETDIR/bin/bcbio_vm.py $BINDIR/bcbio_vm.py
          ln -s $TARGETDIR/bin/conda $BINDIR/bcbiovm_conda
          ln -s $TARGETDIR/bin/python $BINDIR/bcbiovm_python

       In  the  above  commands, the bcbio-vm install goes in $TARGETDIR.  The example is in your
       home directory but set it anywhere you have space.  Also, as an  alternative  to  symbolic
       linking to a $BINDIR, you can add the install bin directory to your PATH:

          export PATH=$TARGETDIR/bin:$PATH

       This  install  includes  bcbio-nextgen libraries, used in generating CWL and orchestrating
       runs, but is not a full bcbio installation. It requires Docker present on your system this
       is all you need to get started running examples, since the CWL runners will pull in Docker
       containers with the bcbio tools.

GETTING STARTED

To make it easy to get started, we have pre-built CWL descriptions that use test data.
These run in under 5 minutes on a local machine and don't require a bcbio installation if
you have Docker available on your machine:

1. Download and unpack the test repository:

wget -O test_bcbio_cwl.tar.gz https://github.com/bcbio/test_bcbio_cwl/archive/master.tar.gz
tar -xzvpf test_bcbio_cwl.tar.gz
cd test_bcbio_cwl-master/somatic

2. Run the analysis using either Cromwell, Rabix bunny or Toil. If you have Docker
available on your machine, the runner will download the correct bcbio container and you
don't need to install anything else to get started. If you have an old version of the
container you want to update to the latest with docker pull quay.io/bcbio/bcbio-vc.
There are shell scripts that provide the command lines for running:

bash run_cromwell.sh
bash run_bunny.sh
bash run_toil.sh

Or you can run directly using the bcbio_vm.py wrappers:

bcbio_vm.py cwlrun cromwell somatic-workflow
bcbio_vm.py cwlrun toil somatic-workflow
bcbio_vm.py cwlrun bunny somatic-workflow

Thes wrappers automatically handle temporary directories, permissions, logging and
re-starts. If running without Docker, use a local installation of bcbio add
--no-container to the commands in the shell scripts.

GENERATING CWL FOR INPUT TO A TOOL

The first step in running your analysis project in bcbio is to generate CWL. If you're
already familiar with bcbio, the process of preparing information about your sample inputs
and analysis are almost identical:

· A standard bcbio sample configuration file defining the samples. This can either be a
full prepared YAML file or a template file and CSV with sample data.

· A bcbio_system.yaml file defining the system environment for running the program. This
includes the resource specification with cores and memory per core for your machines.
For choosing cores and memory per cores, you generally want to set this to match the
parameters of a single machine either for a local run or on a cluster.

In addition to resources specifications, the bcbio system file now also includes paths
to the reference biodata and optionally input file directories if you want to avoid
specifying full paths to your inputs in the bcbio_vm.py template command. bcbio will
recursively look up file locations within those inputs, and this has the advantage of
working identically for non-local file locations. Here is an example for a 16 core
machine with 3.5Gb of memory per core:

local:
ref: /path/to/bcbio/genomes/Hsapiens
inputs:
- /path/to/input/files
resources:
default:
cores: 16
memory: 3500M
jvm_opts: [-Xms1g, -Xmx3500m]

Generate CWL with:

bcbio_vm.py template --systemconfig bcbio_system.yaml template.yaml samples.csv [optional list of fastq or BAM inputs]
bcbio_vm.py cwl --systemconfig bcbio_system.yaml samples/config/samples.yaml

producing a sample-workflow output directory with the CWL.

On a first CWL generation run with a new genome, this process will run for a longer time
as it needs to make your reference compatible with CWL. This includes creating single
tar.gz files from some reference directories so they can get passed to CWL steps where
they'll get unpacked. This process only happens a single time and keeps unpacked versions
so your reference setup is compatible with both old bcbio IPython and new CWL runs.

You can now run this with any CWL compatible runner and the bcbio_vm.py cwlrun wrappers
standardize running across multiple tools in different environments.

RUNNING WITH CROMWELL (LOCAL, HPC, CLOUD)

       The Cromwell workflow management system runs bcbio either locally on a single  machine  or
       distributed on a cluster using a scheduler like SLURM, SGE or PBSPro.

       To run a bcbio CWL workflow locally using Docker:

          bcbio_vm.py cwlrun cromwell sample-workflow

       If you want to run from a locally installed bcbio add --no-container to the commandline.

       To run distributed on a SLURM cluster:

          bcbio_vm.py cwlrun cromwell sample-workflow --no-container -q your_queue -s slurm -r timelimit=0-12:00

       Tweak scheduler parameters using the same options as the older bcbio IPython approach.

       To  control  the  resources  used  Cromwell,  set --joblimit to the allowed jobs allocated
       concurrently. This isn't total cores used, but rather the number of jobs either locally or
       remotely  scheduled  concurrently.  Since  CWL  steps are heterogeneous and use only cores
       necessary for that job, the total cores used will max out at joblimit times maximum  cores
       for  an  individual  process.  Setting  this  helps avoid over-committing jobs to a shared
       scheduler during highly parallel processes like variant calling.

       Cromwell can also run directly on cloud resources: docs-cloud-gcp.

RUNNING WITH TOIL (LOCAL, HPC)

       The Toil pipeline management system runs CWL workflows in parallel on a local machine,  on
       a cluster or at AWS.

       To run a bcbio CWL workflow locally with Toil using Docker:

          bcbio_vm.py cwlrun toil sample-workflow

       If you want to run from a locally installed bcbio add --no-container to the commandline.

       To run distributed on a Slurm cluster:

          bcbio_vm.py cwlrun toil sample-workflow -- --batchSystem slurm

RUNNING ON ARVADOS (HOSTED CLOUD)

       bcbio  generated  CWL workflows run on Arvados and these instructions detail how to run on
       the Arvdos public instance. Arvados cwl-runner comes pre-installed with bcbio-vm.  We have
       a  publicly  accessible  project,  called  bcbio_resources that contains the latest Docker
       images, test data and genome references you can use for runs.

       Retrieve API keys from the Arvados public instance.  Login,  then  go  to  'User  Icon  ->
       Personal  Token'.   Copy  and  paste  the  commands  given  there  into your shell. You'll
       specifically need to set ARVADOS_API_HOST and ARVADOS_API_TOKEN.

       To run an analysis:

       1. Create a new project from the web interface (Projects -> Add a new project).  Note  the
          project    ID    from    the    URL    of    the    project    (an    identifier   like
          qr1hi-j7d0g-7t73h4hrau3l063).

       2. Upload reference data to Arvados Keep. Note the genome collection UUID.  You  can  also
          use  the  existing  genomes  pre-installed  in the bcbio_resources project if using the
          public Arvados playground:

             arv-put --name testdata_genomes --project-uuid $PROJECT_ID testdata/genomes/hg19

       3. Upload input data to Arvados Keep. Note the collection UUID:

             arv-put --name testdata_inputs --project-uuid $PROJECT_ID testdata/100326_FC6107FAAXX testdata/automated testdata/reference_material

       4. Create an Arvados section in a bcbio_system.yaml file specifying locations to look  for
          reference  and  input  data.  input  can be one or more collections containing files or
          associated files in the original sample YAML:

             arvados:
               reference: qr1hi-4zz18-kuz1izsj3wkfisq
               input: [qr1hi-j7d0g-h691y6104tlg8b4]
             resources:
               default: {cores: 4, memory: 2G, jvm_opts: [-Xms750m, -Xmx2500m]}

       5. Generate the CWL to run your samples. If you're using multiple input files with  a  CSV
          metadata file and template start with creation of a configuration file:

             bcbio_vm.py template --systemconfig bcbio_system_arvados.yaml testcwl_template.yaml testcwl.csv

          To generate the CWL from the system and sample configuration files:

             bcbio_vm.py cwl --systemconfig bcbio_system_arvados.yaml testcwl/config/testcwl.yaml

       6. In  most  cases,  Arvados  should  directly pick up the Docker images you need from the
          public bcbio_resources project in your instance. If you need to manually  add  to  your
          project,  you can copy latest bcbio Docker image into your project from bcbio_resources
          using  arv-copy.  You'll  need  to  find  the  UUID   of   quay.io/bcbio/bcbio-vc   and
          arvados/jobs:

             arv-copy $JOBS_ID --project-uuid $PROJECT_ID --src qr1hi --dst qr1hi
             arv-copy $BCBIO_VC_ID --project-uuid $PROJECT_ID --src qr1hi --dst qr1hi

          or import local Docker images to your Arvados project:

             docker pull arvados/jobs:1.0.20180216164101
             arv-keepdocker --project $PROJECT_ID -- arvados/jobs 1.0.20180216164101
             docker pull quay.io/bcbio/bcbio-vc
             arv-keepdocker --project $PROJECT_ID -- quay.io/bcbio/bcbio-vc latest

       7. Run the CWL on the Arvados public cloud using the Arvados cwl-runner:

             bcbio_vm.py cwlrun arvados arvados_testcwl-workflow -- --project-uuid $PROJECT_ID

RUNNING ON DNANEXUS (HOSTED CLOUD)

       bcbio  runs  on  the  DNAnexus  platform  by  converting bcbio generated CWL into DNAnexus
       workflows and apps using dx-cwl. This describes the process using the bcbio  workflow  app
       (bcbio-run-workflow)            and            bcbio            workflow            applet
       (bcbio_resources:/applets/bcbio-run-workflow) in the public bcbio_resources project,  Both
       are  regularly  updated and maintained on the DNAnexus platform. Secondarily, we also show
       how to install and create workflows locally for additional control and debugging.

       0.  Set some useful environmental variables:

           · $PNAME -- The name of the project you're analyzing. For  convenience  here  we  keep
             this  the  same for your local files and remote DNAnexus project, although that does
             not have to be true.

           · $DX_AUTH_TOKEN -- The DNAnexus authorization token  for  access,  used  for  the  dx
             command line tool and bcbio scripts.

           · $DX_PROJECT_ID  --  The  DNAnexus  GUID  identifier  for  your  project  (similar to
             project-F8Q7fJj0XFJJ3XbBPQYXP4B9).  You   can   get   this   from   dx   env   after
             creating/selecting a project in steps 1 and 2.

       1.  Create an analysis project:

              dx new project $PNAME

       2.  Upload sample data to the project:

              dx select $PNAME
              dx upload -p --path /data/input *.bam

       3.  Create a bcbio system YAML file with projects, locations of files and desired core and
           memory usage for jobs. bcbio uses the core  and  memory  specifications  to  determine
           machine instance types to use:

              dnanexus:
                project: PNAME
                ref:
                  project: bcbio_resources
                  folder: /reference_genomes
                inputs:
                  - /data/input
                  - /data/input/regions
              resources:
                default: {cores: 8, memory: 3000M, jvm_opts: [-Xms1g, -Xmx3000m]}

       4.  Create  a  bcbio sample CSV file referencing samples to run. The files can be relative
           to the inputs directory specified above; bcbio will search recursively for  files,  so
           you  don't  need  to  specify  full  paths if your file names are unique. Start with a
           sample specification:

              samplename,description,batch,phenotype
              file1.bam,sample1,b1,tumor
              file2.bam,sample2,b1,normal
              file3.bam,sample3,b2,tumor
              file4.bam,sample4,b2,normal

       5.  Pick a template file that describes the bcbio configuration variables. You can  define
           parameters  either globally (in the template) file or by sample (in the csv) using the
           standard bcbio templating.  An example template for GATK4 germline variant calling is:

              details:
               - algorithm:
                   aligner: bwa
                   variantcaller: gatk-haplotype
                 analysis: variant2
                 genome_build: hg38

       6.  Supply the three inputs (bcbio_system.yaml,  project.csv  and  template.yaml)  to  the
           either  the  bcbio-run-workflow app or applet. This example uses a specific version of
           the bcbio app for full reproducibility; any future re-runs will always use  the  exact
           same versioned tools and workflows. You can do this using the web interface or via the
           command line with a small script like:

              TEMPLATE=germline
              APP_VERSION=0.0.2
              FOLDER=/bcbio/$PNAME
              dx select "$PROJECT"
              dx mkdir -p $FOLDER
              for F in $TEMPLATE-template.yaml $PNAME.csv bcbio_system-dnanexus.yaml
              do
                      dx rm -a /$FOLDER/$F || true
                      dx upload --path /$FOLDER/ $F
              done
              dx ls $FOLDER
              dx rm -a -r /$FOLDER/dx-cwl-run || true
              dx run bcbio-run-workflow/$APP_VERSION -iyaml_template=/$FOLDER/$TEMPLATE-template.yaml -isample_spec=/$FOLDER/$PNAME.csv -isystem_configuration=/$FOLDER/bcbio_system-dnanexus.yaml -ioutput_folder=/$FOLDER/dx-cwl-run

           Alternatively if you want the latest bcbio code, change the final command to  use  the
           applet. Everything else in the script is identical:

              dx run bcbio_resources:/applets/bcbio-run-workflow -iyaml_template=/$FOLDER/$TEMPLATE-template.yaml -isample_spec=/$FOLDER/$PNAME.csv -isystem_configuration=/$FOLDER/bcbio_system-dnanexus.yaml -ioutput_folder=/$FOLDER/dx-cwl-run

       The  app  will  lookup  all  files,  prepare a bcbio CWL workflow, convert into a DNAnexus
       workflow, and submit to the platform. The workflow runs as a  standard  DNAnexus  workflow
       and you can monitor through the command line (with dx find executions --root job-YOURJOBID
       and dx watch) or the web interface (Monitor tab).

       If you prefer not to use the DNAnexus app, you can also submit jobs locally by  installing
       bcbio-vm  on  your  local  machine.  This can also be useful to test generation of CWL and
       manually ensure identification of all your samples and associated files  on  the  DNAnexus
       platform.

       1. Follow  the  automated-sample-config  workflow  to  generate  a full configuration, and
          generate a CWL description of the workflow:

             TEMPLATE=germline
             rm -rf $PNAME $PNAME-workflow
             bcbio_vm.py template --systemconfig bcbio_system-dnanexus.yaml $TEMPLATE-template.yaml $PNAME.csv
             bcbio_vm.py cwl --systemconfig bcbio_system-dnanexus.yaml $PNAME/config/$PNAME.yaml

       2. Determine project information and login credentials. You'll want to note the Auth token
          used and Current workspace project ID:

             dx env

       3. Compile the CWL workflow into a DNAnexus workflow:

             dx-cwl compile-workflow $PNAME-workflow/main-$PNAME.cwl \
                --project PROJECT_ID --token $DX_AUTH_TOKEN \
                --rootdir $FOLDER/dx-cwl-run

       4. Upload sample information from generated CWL and run workflow:

             FOLDER=/bcbio/$PNAME
             dx mkdir -p $DX_PROJECT_ID:$FOLDER/$PNAME-workflow
             dx upload -p --path $DX_PROJECT_ID:$FOLDER/$PNAME-workflow $PNAME-workflow/main-$PNAME-samples.json
             dx-cwl run-workflow $FOLDER/dx-cwl-run/main-$PNAME/main-$PNAME \
                    $FOLDER/$PNAME-workflow/main-$PNAME-samples.json \
                    --project PROJECT_ID --token $DX_AUTH_TOKEN \
                    --rootdir $FOLDER/dx-cwl-run

RUNNING ON SEVEN BRIDGES (HOSTED CLOUD)

       bcbio  runs  on the Seven Bridges including the main platform and specialized data sources
       like the Cancer Genomics Cloud and Cavatica. Seven Bridges uses generated CWL directly and
       bcbio  has  utilities to query your remote data on the platform and prepare CWL for direct
       submission.

       1. Since Seven Bridges is available on multiple  platforms  and  data  access  points,  we
          authenticate   with   a  configuration  file  in  $HOME/.sevenbridges/credentials  with
          potentially multiple profiles defining API access URLs  and  authentication  keys.   We
          reference  the  specified  credentials  when setting up a bcbio_system-sbg.yaml file to
          ensure correct authentication.

       2. Upload your inputs and bcbio reference  data  using  the  Seven  Bridges  command  line
          uploader.  We  plan  to  host  standard bcbio reference data in a public project so you
          should only need to upload your project specific data:

             sbg-uploader.sh -p chapmanb/bcbio-test --folder inputs --preserve-folder fastq_files regions

       3. Create bcbio_system-sbg.yaml file defining locations of inputs:

             sbgenomics:
               profile: default
               project: chapmanb/bcbio-test
               inputs:
                 - /testdata/100326_FC6107FAAXX
                 - /testdata/automated
                 - /testdata/genomes
                 - /testdata/reference_material
             resources:
               default:
                 cores: 2
                 memory: 3G
                 jvm_opts: [-Xms750m, -Xmx3000m]

       4. Follow the automated-sample-config workflow  to  generate  a  full  configuration,  and
          generate a CWL description of the workflow:

             PNAME=somatic
             bcbio_vm.py template --systemconfig=bcbio_system-sbg.yaml ${PNAME}_template.yaml $PNAME.csv
             bcbio_vm.py cwl --systemconfig=bcbio_system-sbg.yaml $PNAME/config/$PNAME.yaml

       5. Run the job on the Seven Bridges platform:

             PNAME=somatic
             SBG_PROJECT=bcbio-test
             bcbio_vm.py cwlrun sbg ${PNAME}-workflow -- --project ${SBG_PROJECT}

DEVELOPMENT NOTES

bcbio generates a common workflow language description. Internally, bcbio represents the
files and information related to processing as a comprehensive dictionary. This world
object describes the state of a run and associated files, and new processing steps update
or add information to it. The world object is roughly equivalent to CWL's JSON-based input
object, but CWL enforces additional annotations to identify files and models new
inputs/outputs at each step. The work in bcbio is to move from our laissez-faire approach
to the more structured CWL model.

The generated CWL workflow is in run_info-cwl-workflow:

· main-*.cwl -- the top level CWL file describing the workflow steps

· main*-samples.json -- the flattened bcbio world structure represented as CWL inputs

· wf-*.cwl -- CWL sub-workflows, describing sample level parallel processing of a section
of the workflow, with potential internal parallelization.

· steps/*.cwl -- CWL descriptions of sections of code run inside bcbio. Each of these are
potential parallelization points and make up the nodes in the workflow.

To help with defining the outputs at each step, there is a WorldWatcher object that can
output changed files and world dictionary objects between steps in the pipeline when
running a bcbio in the standard way. The variant pipeline has examples using it. This is
useful when preparing the CWL definitions of inputs and outputs for new steps in the bcbio
CWL step definitions.

TODO

       · Support  the  full variant calling workflow with additional steps like ensemble calling,
         heterogeneity detection and disambiguation.

       · Port RNA-seq and small RNA workflows to CWL.

       bcbio has two approaches to running on cloud providers like  Amazon  Web  Services  (AWS),
       Google  Cloud  (GCP) and Microsoft Azure. For smaller projects we use a simplified ansible
       based approach which automates spinning up single multicore machines  for  running  either
       traditional or docs-cwl bcbio runs.

       For  larger  distributed  projects,  we're actively working on using docs-cwl support with
       runners like Cromwell that directly interface and run on cloud  services.  We'll  document
       these approaches here as they're tested and available.

       For  getting started, the CWL docs-cwl-installation documentation describes how to install
       bcbio-vm, which provides a wrapper around bcbio  that  automates  interaction  with  cloud
       providers  and  Docker.  bcbio_vm.py also cleans up the command line usage to make it more
       intuitive and provides a superset of functionality available in bcbio_nextgen.py.

GOOGLE CLOUD (GCP)

       Cromwell runs bcbio CWL pipelines on Google Cloud using the Google Pipelines API.

   Setup
       To setup a Google Compute environment, you'll make use of the Web based console and gcloud
       and gsutil from the Google Cloud SDK, which provide command line interfacts to manage data
       in Google Storage and Google Compute instances. You can install with:

          bcbio_conda install -c conda-forge -c bioconda google-cloud-sdk

       For authentication, you want to set up  a  Google  Cloud  Platform  service  account.  The
       environmental   variable   GOOGLE_APPLICATION_CREDENTIALS   identifies   a  JSON  file  of
       credentials.  which bcbio passes to Cromwell for authentication:

          gcloud auth login
          gcloud projects create your-project
          gcloud iam service-accounts create your-service-account
          gcloud projects add-iam-policy-binding your-project --member \
            "serviceAccount:your-service-account@your-project.iam.gserviceaccount.com" --role "roles/owner"
          gcloud iam service-accounts keys create ~/.config/gcloud/your-service-account.json \
            --iam-account your-service-account@your-project.iam.gserviceaccount.com
          export GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/your-service-account.json

       You'll need a project for your run along, with the Google  Genomics  API  enabled,  and  a
       Google Storage bucket for your data and run intermediates:

          gcloud config set project your-project
          gcloud services enable genomics.googleapis.com
          gsutil mb gs://your-project

       Additional documentation for Cromwell: Google Pipelines API and Google authentication.

   Data preparation
       Cromwell  can  localize  data present in Google Storage buckets as part of the run process
       and bcbio will translate the data present in these storage bucket into references for  the
       CWL run inputs.

       Upload your data with gsutil:

          gsutil cp your_data.bam gs://your-project/inputs/

       Create a bcbio_system-gcp.yaml input file for docs-cwl-generate:

          gs:
            ref: gs://bcbiodata/collections
            inputs:
              - gs://your-project/inputs
          resources:
            default: {cores: 8, memory: 3G, jvm_opts: [-Xms750m, -Xmx3000m]}

       Then  create  a  sample  input CSV and template YAML file for automated-sample-config. The
       first column of the CSV file should contain references to your input files  (your_file.bam
       or  your_file_R1.fastq.gz;your_file_R2.fastq.gz),  which  avoids  needing  to  specify the
       inputs on the command line.

       Generate a Common Workflow Language representation:

          bcbio_vm.py template --systemconfig bcbio_system-gcp.yaml ${TEMPLATE}-template.yaml $PNAME.csv
          bcbio_vm.py cwl --systemconfig bcbio_system-gcp.yaml $PNAME/config/$PNAME.yaml

   Running
       Run the CWL using Cromwell by specifying the project and root Google  Storage  bucket  for
       intermediates:

          bcbio_vm.py cwlrun cromwell $PNAME-workflow --cloud-project your-project \
              --cloud-root gs://your-project/work_cromwell

AMAZON WEB SERVICES (AWS BATCH)

       We're working to support Amazon Web Services (AWS) using AWS Batch and Cromwell, following
       the AWS for Genomics documentation. This documents the current work in progress; it is not
       yet fully running and needs additional Cromwell development for AWS CWL support.

   Setup
       0.  Optionally,  create a bcbio IAM user and bcbio keypair for creating AWS Batch specific
           resources. bcbio-vm can automate this process, although they can also be pre-existing.
           If you'd like to use bcbio-vm automation, you'll need to have an account at Amazon and
           your Access Key ID and Secret Key ID from the AWS security credentials page. These can
           be  IAM  credentials  instead  of  root credentials as long as they have administrator
           privileges. Make them available to bcbio using the standard environmental variables:

              export AWS_ACCESS_KEY_ID=your_access_key
              export AWS_SECRET_ACCESS_KEY=your_secret_key

           With this in place, ceate public/private keys and a bcbio IAM user with:

              bcbio_vm.py aws iam --region=us-east-1

       1.  Use either existing credentials or those created by bcbio, setup AWS  Credentials  for
           accessing AWS resources from your machine by editing ~/.aws/credentials:

              [default]
              aws_access_key_id = YOURACCESSID
              aws_secret_access_key = yoursecretkey
              region = us-east-1

       2.  Automation creation of resources for AWS Batch. This includes creating a custom Amazon
           Machine Image (AMI) for AWS Batch, which allows  automatic  allocation  of  additional
           disk space during workflow runs. It also sets up an AWS Batch environment, VPC and IAM
           for running workflows.  A single bcbio-vm commands runs both CloudFormation scripts:

              bcbio_vm.py aws cromwell --keypair bcbio --bucket bcbio-batch-cromwell-test

           This will output the S3 bucket and job queue for running Cromwell:

              AMI: ami-00bd75374ccaa1fc6
              Region: us-east-1
              S3 bucket: s3://your-project
              Job Queue (Spot instances): arn:aws:batch:us-east-1:678711657553:job-queue/GenomicsDefaultQueue-358a1deb9f4536b
              High priority Job Queue: arn:aws:batch:us-east-1:678711657553:job-queue/GenomicsHighPriorityQue-3bff21e3c4f44d4

   Data preparation
       The easiest way to organize AWS projects is using an analysis folder inside an S3  bucket.
       Create  a  bucket  and  folder for your analysis and upload input files (fastq or BAM) and
       other associated files.. Bucket names should include only lowercase letters,  numbers  and
       hyphens  (-)  to conform to S3 bucket naming restrictions and avoid issues with resolution
       of SSL keys. You can create buckets and upload files using the the AWS cli client  or  AWS
       S3 web console:

          aws s3 sync /local/inputs s3://your-bucket/inputs

       Create a bcbio_system-aws.yaml input file for docs-cwl-generate:

          s3:
            ref: s3://bcbiodata/collections
            inputs:
              - s3://your-bucket/inputs
          resources:
            default: {cores: 8, memory: 3G, jvm_opts: [-Xms750m, -Xmx3000m]}

       Generate a Common Workflow Language representation:

          CLOUD=aws
          bcbio_vm.py template --systemconfig bcbio_system-$CLOUD.yaml ${TEMPLATE}-template.yaml $PNAME.csv
          bcbio_vm.py cwl --systemconfig bcbio_system-$CLOUD.yaml $PNAME/config/$PNAME.yaml

   Running
       Run  the  CWL  using Cromwell by specifying the batch job queue Amazon Resource Name (ARN)
       and bucket from the setup process:

          bcbio_vm.py cwlrun cromwell $PNAME-workflow \
            -cloud-project arn:aws:batch:us-east-1:678711657553:job-queue/GenomicsDefaultQueue-358a1deb9f4536b \
            -cloud-root s3://your-project

AMAZON WEB SERVICES (OLD)

We're phasing out this approach to AWS support in bcbio and are actively moving to Common
Workflow Language based approaches. This documents the old Elasticluster approach to build
a cluster on AWS with an encrypted NFS mounted drive and an optional Lustre shared
filesystem.

Data preparation
You need a template file describing the type of run to do and a CSV file mapping samples
in the bucket to names and any other metadata. See the automated-sample-config docs for
more details about these files. Also upload both of these files to S3.

With that in place, prepare and upload the final configuration to S3 with:

bcbio_vm.py template s3://your-project/your-analysis/template.yaml s3://your-project/your-analysis/name.csv

This will find the input files in the s3://your-project/your-analysis bucket, associate
fastq and BAM files with the right samples, and add a found BED files as variant_regions
in the configuration. It will then upload the final configuration back to S3 as
s3://your-project/your-analysis/name.yaml, which you can run directly from a bcbio cluster
on AWS. By default, bcbio will use the us-east S3 region, but you can specify a different
region in the s3 path to the metadata file:
s3://your-project@eu-central-1/your-analysis/name.csv

We currently support human analysis with both the GRCh37 and hg19 genomes. We can also add
additional genomes as needed by the community and generally welcome feedback and comments
on reference data support.

AWS setup
The first time running bcbio on AWS you'll need to setup permissions, VPCs and local
configuration files. We provide commands to automate all these steps and once finished,
they can be re-used for subsequent runs. To start you'll need to have an account at Amazon
and your Access Key ID and Secret Key ID from the AWS security credentials page. These can
be IAM credentials instead of root credentials as long as they have administrator
privileges. Make them available to bcbio using the standard environmental variables:

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key

With this in place, two commands setup your elasticluster and AWS environment to run a
bcbio cluster. The first creates public/private keys, a bcbio IAM user, and sets up an
elasticluster config in ~/.bcbio/elasticluster/config:

bcbio_vm.py aws iam --region=us-east-1

The second configures a VPC to host bcbio:

bcbio_vm.py aws vpc --region=us-east-1

The aws vpc command is idempotent and can run multiple times if you change or remove parts
of the infrastructure. You can also rerun the aws iam command, but if you'd like to
generate a new elasticluster configuration file (~/.bcbio/elasticluster/config) add the
recreate flag: bcbio_vm.py aws iam --recreate. This generates a new set of IAM credentials
and public/private keys. These are only stored in the ~/.bcbio directory so you need to
fully recreate them if you delete the old ones.

Running a cluster
Following this setup, you're ready to run a bcbio cluster on AWS. We start from a standard
Ubuntu AMI, installing all software for bcbio and the cluster as part of the boot process.

To configure your cluster run:

bcbio_vm.py aws config edit

This dialog allows you to define the cluster size and machine resources you'd like to use.
The defaults only have small instances to prevent accidentally starting an expensive run.
If you're planning a run with less than 32 cores, do not use a cluster and instead run
directly on a single machine using one of the large r3 or c3 instances.

This script also sets the size of the encrypted NFS-mounted drive, which you can use to
store processing data when running across a distributed cluster. At scale, you can replace
this with a Lustre shared filesystem. See below for details on launching and attaching a
Lustre filesystem to a cluster.

To ensure everything is correctly configured, run:

bcbio_vm.py aws info

When happy with your setup, start the cluster with:

bcbio_vm.py aws cluster start

The cluster will take five to ten minutes to start and be provisioned. If you encounter
any intermittent failures, you can rerun the cluster configuration step with bcbio_vm.py
aws cluster setup or the bcbio-specific installation with bcbio_vm.py aws cluster
bootstrap.

Running Lustre
Elasticluster mounts the /encrypted directory as a NFS share available across all of the
worker machines. You can use this as a processing directory for smaller runs but for
larger runs may need a scalable distributed file system. bcbio supports using Intel Cloud
Edition for Lustre (ICEL) to set up a Lustre scratch filesystem on AWS.

· Subscribe to ICEL in the Amazon Marketplace.

· By default, the Lustre filesystem will be 2TB and will be accessible to all hosts in the
VPC. Creation takes about ten minutes and can happen in parallel while elasticluster
sets up the cluster. Start the stack:

bcbio_vm.py aws icel create

If you encounter any intermittent failures when installing collectl plugin, that means
lustre server is created successfully, you can rerun the lustre configuration step with
bcbio_vm.py aws icel create --setup. If you had any failure creating the lustre server
before the collectl plugin installation, you should stop it, and try again.

· Once the ICEL stack and elasticluster cluster are both running, mount the filesystem on
the cluster:

bcbio_vm.py aws icel mount

· The cluster instances will reboot with the Lustre filesystem mounted.

Running an analysis
To run the analysis, connect to the head node with:

bcbio_vm.py aws cluster ssh

Create your project directory and link the global bcbio configuration file in there with:

· NFS file system (no Lustre):

mkdir /encrypted/your-project
cd !$ && mkdir work && cd work

· Lustre file system:

sudo mkdir /scratch/cancer-dream-syn3-exome
sudo chown ubuntu !$
cd !$ && mkdir work && cd work

If you started a single machine, run with:

bcbio_vm.py run -n 8 s3://your-project/your-analysis/name.yaml

Where the -n argument should be the number of cores on the machine.

To run on a full cluster:

bcbio_vm.py ipythonprep s3://your-project/your-analysis/name.yaml slurm cloud -n 60
sbatch bcbio_submit.sh

Where 60 is the total number of cores to use across all the worker nodes. Of your total
machine cores, allocate 2 for the base bcbio_vm script and IPython controller instances.
The SLURM workload manager distributes jobs across your cluster on a queue called cloud.
A slurm-PID.out file in the work directory contains the current status of the job, and
sacct_std provides the status of jobs on the cluster. If you are new to SLURM, here is a
summary of useful SLURM commands.

On successful completion, bcbio uploads the results of the analysis back into your s3
bucket and folder as s3://your-project/your-analysis/final. You can now cleanup the
cluster and Lustre filesystem.

Graphing resource usage
AWS runs include automatic monitoring of resource usage with collectl. bcbio_vm uses
collectl statistics to plot CPU, memory, disk and network usage during each step of a run.
To prepare resource usage plots after finishing an analysis, first copy the
bcbio-nextgen.log file to your local computer. Either use bcbio_vm.py elasticluster sftp
bcbio to copy from the work directory on AWS
(/encrypted/your-project/work/log/bcbio-nextgen.log) or transfer it from the output S3
bucket (your-project/your-analysis/final/DATE_your-project/bcbio-nextgen.log).

If your run worked cleanly you can use the log input file directly. If you had failures
and restarts, or would only like to graph part of the run, you can edit the timing steps.
Run grep Timing bcbio-nextgen.log > your-run.txt to get the timing steps only, then edit
as desired.

Retrieve the collectl statistics from the AWS cluster and prepare the resource usage
graphs with:

bcbio_vm.py graph bcbio-nextgen.log

By default the collectl stats will be in monitoring/collectl and plots in
monitoring/graphs based on the above log timeframe. If you need to re-run plots later
after shutting the cluster down, you can use the none cluster flag by running bcbio_vm.py
graph bcbio-nextgen.log --cluster none.

If you'd like to run graphing from a local non-AWS run, such as a local HPC cluster, run
bcbio_vm.py graph bcbio-nextgen.log --cluster local instead.

For convenience, there's a "serialize" flag ('-s') that saves the dataframe used for
plotting. In order to explore the data and extract specific datapoints or zoom, one could
just deserialize the output like a python pickle file:

import cPickle as pickle with gzip.open("./monitoring/collectl_info.pickle.gz",
"rb") as decomp:
collectl_info = pickle.load(decomp) data, hardware, steps = collectl_info[1][0],
collectl_info[1][1], collectl_info[1][2]

And plot, slice, zoom it in an jupyter notebook using matplotlib, [highcharts](‐
https://github.com/arnoutaertgeerts/python-highcharts).

In addition to plots, the summarize_timing.py utility script prepares a summary table of
run times per step.

Shutting down
The bcbio Elasticluster and Lustre integration can spin up a lot of AWS resources. You'll
be paying for these by the hour so you want to clean them up when you finish running your
analysis. To stop the cluster:

bcbio_vm.py aws cluster stop

To remove the Lustre stack:

bcbio_vm.py aws icel stop

Double check that all instances have been properly stopped by looking in the AWS console.

Manual configuration
Experienced elasticluster users can edit the configuration files themselves. bcbio
provides a small wrapper that automatically reads and writes these configurations to avoid
users needing to understand elasticluster internals, but all functionality is fully
available. Edit your ~/.bcbio/elasticluster/config file to change parameters. You can
also see the latest example configuration. in the bcbio-vm GitHub repository for more
details on the other available options.

bcbio-nextgen runs in a temporary work directory which contains a number of processing
intermediates. Pipeline completion extracts the final useful output files into a separate
directory, specified by the upload-configuration. This configuration allows upload to
local directories, Galaxy, or Amazon S3. Once extracting and confirming the output files,
you can delete the temporary directory to save space.

COMMON FILES

       The output directory contains sample specific output files labeled by sample  name  and  a
       more  general project directory. The sample directories contain all of the sample specific
       output files, while the project directory contains global files like project summaries  or
       batched  population level variant calls. See the teaching documentation for a full variant
       calling example with additional details about configuration setting and  resulting  output
       files.

   Project directory
       · project-summary.yaml  --  Top  level  YAML  format  summary file with statistics on read
         alignments and duplications as well as analysis specific metrics.

       · programs.txt -- Program versions for bcbio-nextgen and software  run  in  the  pipeline.
         This enables reproduction of analyses.

       · multiqc  run  MultiQC  to gather all QC metrics from different tools, such as, cutadapt,
         featureCounts, samtools, STAR ... into an unique HTML report.

       · metadata.csv -- CSV with the metadata in the YAML file.

   Sample directories
       · SAMPLE/qc -- Directory of quality control runs for the sample.  These include charts and
         metrics for assessing quality of sequencing and analysis.

       · SAMPLE-ready.bam -- A prepared BAM file of the aligned reads.  Depending on the analysis
         used, this may include trimmed, recalibrated and realigned reads following alignment.

VARIANT CALLING

   Project directory
       · grading-summary.csv -- Grading details comparing each  sample  to  a  reference  set  of
         calls. This will only have information when providing a validation callset.

       · BATCH-caller.vcf  --  Variants  called for a population/batch of samples by a particular
         caller.

       · BATCH-caller.db -- A GEMINI database associating variant calls with a  wide  variety  of
         third  party  annotations.  This  provides  a  queryable framework for assessing variant
         quality statistics.

   Sample directories
       · SAMPLE-caller.vcf -- Variants calls for an individual sample.

       · SAMPLE-gdc-viral-completeness.txt -- Optional viral contamination estimates. File is  of
         the  format  depth, 1x, 5x, 25x. depth is the number of reads aligning to the virus. 1x,
         5x, 25x are percentage of the viral sequence covered by reads of 1x, 5x, 25x depth. Real
         viral  contamination  will have broad coverage across the entire genome, so high numbers
         for these values, depending on sequencing depth.  High  depth  and  low  viral  sequence
         coverage means a likely false positive.

RNA-SEQ

   Project directory
       · annotated_combined.counts  --  featureCounts  counts matrix with gene symbol as an extra
         column.

       · combined.counts -- featureCounts counts matrix with gene symbol as an extra column.

       · combined.dexseq -- DEXseq counts matrix with exonID as first column.

       · combined.gene.sf.tmp -- Sailfish gene count matrix normalized to TPM.

       · combined.isoform.sf.tpm -- Sailfish transcript count matix normalized to TPM.

       · combined.sf -- Sailfish raw output, all samples files are pasted one after another.

       · tx2gene.csv -- Annotation file needed for DESeq2 to use Sailfish output.

   Sample directories
       · SAMPLE-transcriptome.bam -- BAM file aligned to transcriptome.

       · SAMPLE-ready.counts -- featureCounts gene counts output.

       · salmon -- Salmon output.

SINGLE CELL RNA-SEQ

   Project directory
       · tagcounts.mtx -- count matrix compatible with dgCMatrix type in R.

       · tagcounts-dupes.mtx -- count matrix compatible with dgCMatrix type in  R  but  with  the
         duplicated reads counted.

       · tagcounts.mtx.colnames -- cell names that would be the columns for the matrix.

       · tagcounts.mtx.rownames -- gene names that would be the rows for the matrix.

       · tagcounts.mtx.metadata  --  metadata  that  match  the  colnames for the matrix. This is
         coming from the barcode.csv file and the metadata given in the YAML  config  file.   for
         the matrix.

       · cb-histogram.txt  --  total  number  of  dedup  reads  assigned  to  a  cell.  Comparing
         colSums(tagcounts.mtx) to this number can tell you how many reads mapped to genes.

       To create Seurat object:

       in bash:

          mkdir data
          cd data
          result_dir=bcbio_project/final/project_dir
          cp $result_dir/tagcounts.mtx matrix.mtx
          cp $result_dir/tagcounts.mtx.colnames barcodes.tsv
          cp $result_dir/tagcounts.mtx.rownames features.tsv
          for f in *;do gzip $f;done;
          cd ..

       in R:

          library(Seurat)
          counts <- Read10X(data.dir = "data", gene.column = 1)
          seurat_object <- CreateSeuratObject(counts = counts, min.features = 100)
          saveRDS(seurat_object, "seurat.bcbio.RDS")

   Sample directories
       · SAMPLE-transcriptome.bam -- BAM file aligned to transcriptome.

       · SAMPLE-mtx.* -- gene counts as explained in the project directory.

SMALL RNA-SEQ

   Project directory
       · counts_mirna.tsv -- miRBase miRNA count matrix.

       · counts.tsv -- miRBase isomiRs count matrix. The ID is made of 5 tags: miRNA name,  SNPs,
         additions,  trimming  at  5  and trimming at 3.  Here there is detail explanation of the
         naming .

       · counts_mirna_novel.tsv -- miRDeep2 miRNA count matrix.

       · counts_novel.tsv -- miRDeep2 isomiRs. See counts.tsv explanation for more detail.  count
         matrix.

       · seqcluster  --  output  of  seqcluster  tool.   Inside this folder, counts.tsv has count
         matrix for all clusters found over the genome.

       · seqclusterViz     --      input      file      for      interactive      browser      at
         https://github.com/lpantano/seqclusterViz

       · report  --  Rmd  template to help with downstream analysis like QC metrics, differential
         expression, and clustering.

   Sample directories
       · SAMPLE-mirbase-ready.counts -- counts for miRBase miRNAs.

       · SAMPLE-novel-ready -- counts for miRDeep2 novel miRNAs.

       · tRNA -- output for tdrmapper.

DOWNSTREAM ANALYSIS

       This section collects useful scripts and tools to do downstream analysis of  bcbio-nextgen
       outputs. If you have pointers to useful tools, please add them to the documentation.

       · Calculate and plot coverage with matplolib, from Luca Beltrame.

       · Another way to visualize coverage for targeted NGS (exome) experiments with bedtools and
         R, from Stephen Turner

       · assess the efficiency of targeted enrichment sequencing with ngscat

       This section provides useful concepts for  getting  started  digging  into  the  code  and
       contributing  new functionality. We welcome contributors and hope these notes help make it
       easier to get started.

DEVELOPMENT GOALS

During development we seek to maximize functionality and usefulness, while avoiding
complexity. Since these goals are sometimes in conflict, it's useful to understand the
design approaches:

· Support high level configurability but avoid exposing all program options. Since
pipelines support a wide variety of tools, each with a large number of options, we try
to define configuration variables at high level based on biological intent and then
translate these into best-practice options for each tool. The goal is to avoid having an
overwhelming number of input configuration options.

· Provide best-practice pipelines that make recommended decisions for processing. Coupled
with goal of minimizing configuration parameters, this requires trust and discussion
around algorithm choices. An example is bwa alignment, which uses bwa aln for reads
shorter than 75bp and bwa mem for longer reads, based on recommendations from Heng Li.
Our general goal is to encourage discussion and development of best-practices to make it
easy to do the right thing.

· Support extensive debugging output. In complex distributed systems, programs fail in
unexpected ways even during production runs. We try to maximize logging to help identify
and diagnose these type of unexpected problems.

· Avoid making mistakes. This results in being conservative about decisions like deleting
file intermediates. Coupled with extensive logging, we trade off disk usage for making
it maximally easy to restart and debug problems. If you'd like to delete work or log
directories automatically, we recommend doing this as part of your batch scripts
wrapping bcbio-nextgen.

· Strive for a clean, readable code base. We strive to make the code a secondary source of
information after hand written docs. Practically, this means maximizing information
content in source files while using in-line documentation to clarify as needed.

· Focus on a functional coding style with minimal use of global mutable objects. This
approach works well with distributed code and isolates debugging to individual functions
rather than globally mutable state.

· Make sure your changes integrate correctly by running the test suite before submitting a
pull request. the pipeline is automatically tested in Travis-CI, and a red label will
appear in the pull request if the former causes any issue.

OVERVIEW

       The most useful modules inside bcbio, ordered by likely interest:

       · pipeline  -- Top level functionality that drives the analysis pipeline. main.py contains
         top level definitions of pipelines like variant calling and  RNAseq,  and  is  the  best
         place to start understanding the overall organization of the code.

       · ngsalign  --  Integration  with aligners for high-throughput sequencing data. We support
         individual aligners with their own separate modules.

       · variation -- Tools for  variant  calling.  Individual  variant  calling  and  processing
         approaches each have their own submodules.

       · rnaseq -- Run RNA-seq pipelines, currently supporting TopHat/Cufflinks.

       · provenance  --  Track  third  party  software  versions, command lines and program flow.
         Handle writing of debugging details.

       · distributed -- Handle distribution of programs across multiple cores, or across multiple
         machines using IPython.

       · workflow  --  Provide  high  level  tools  to  run  customized  analyses.  They tie into
         specialized analyses or visual front ends  to  make  running  bcbio-nextgen  easier  for
         specific common tasks.

       · broad  --  Code  to  handle  calling  Broad tools like GATK and Picard, as well as other
         Java-based programs.

DEVELOPMENT INFRASTRUCTURE

   GitHub
       bcbio-nextgen uses GitHub for code development, and we welcome pull requests. GitHub makes
       it  easy  to  establish  custom forks of the code and contribute those back. The Biopython
       documentation has great information on using git and  GitHub  for  a  community  developed
       project.  In short, make a fork of the bcbio code by clicking the Fork button in the upper
       right corner of the GitHub page, commit your changes to this custom fork and keep it up to
       date  with  the  main bcbio repository as you develop. The github help pages have detailed
       information  on  keeping  your  fork  updated  with  the  main  github  repository   (e.g.
       https://help.github.com/articles/syncing-a-fork/).   After  commiting  changes,  click New
       Pull Request from your fork when you'd like to submit  your  changes  for  integration  in
       bcbio.

   Creating a separate bcbio installation
       When  developing,  you'd  like  to avoid breaking your production bcbio instance.  Use the
       installer to create a separate  bcbio  instance  without  downloading  any  data.   Before
       installing  the  second bcbio instance, investigate your PATH and PYTHONPATH variables. It
       is   better   to   avoid   mixing   bcbio   instances   in   the    PATH.    Also    watch
       ~/.conda/environments.txt.

       To  install  in ${HOME}/local/share/bcbio (your location might be different, make sure you
       have ~30G of disk quota there):

          wget https://raw.github.com/chapmanb/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
          python bcbio_nextgen_install.py ${HOME}/local/share/bcbio --tooldir=${HOME}/local --nodata --isolate

       Make soft links to the data from your production bcbio instance  (your  installation  path
       could be different from /n/app/bcbio):

          ln -s /n/app/bcbio/biodata/genomes/ ${HOME}/local/share/genomes
          ln -s /n/app/bcbio/biodata/galaxy/tool-data ${HOME}/local/share/bcbio/galaxy/tool-data

       Add this directory to your PATH (note that it is better to clear you PATH from the path of
       the production bcbio instance and its tools):

          echo $PATH
          # use everything you need except of production bcbio
          export PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:
          export PATH=${HOME}/local/share/bcbio/anaconda/bin:${HOME}/local/bin:$PATH

       Or            directly            call            the            testing            bcbio:
       ${HOME}/local/share/bcbio/anaconda/bin/bcbio_nextgen.py.

   Injecting bcbio code into bcbio installation
       To install from your bcbio-nextgen source tree for testing do:

          # make sure you are using the development bcbio instance
          which bcbio_python
          # local git folder
          cd ~/code/bcbio-nextgen
          bcbio_python setup.py install

       One  tricky  part  that  we  don't  yet  know  how to work around is that pip and standard
       setup.py install have different ideas about how to write  Python  eggs.  setup.py  install
       will create an isolated python egg directory like bcbio_nextgen-1.1.5-py3.6.egg, while pip
       creates an egg pointing to a top level bcbio directory. Where this gets tricky is that the
       top level bcbio directory takes precedence. The best way to work around this problem is to
       manually   remove   the   current   pip   installed    bcbio-nextgen    code    (rm    -rf
       /path/to/anaconda/lib/python3.6/site-packages/bcbio*)  before  managing  it  manually with
       bcbio_python  setup.py  install.  We'd  welcome  tips  about  ways  to  force   consistent
       installation across methods.

BUILDING THE DOCUMENTATION LOCALLY

       If you have added or modified this documentation, to build it locally and see how it looks
       like you can do so by running:

          cd docs
          make html

       The documentation will be built under docs/_build/html, open index.html with your  browser
       to load your local build.

       If  you  want  to  use the same theme that Read The Docs uses, you can do so by installing
       sphinx_rtd_theme via pip. You will also need to add this in the docs/conf.py file  to  use
       the theme only locally:

          html_theme = 'default'
          on_rtd = os.environ.get('READTHEDOCS', False)
          if not on_rtd:  # only import and set the theme if we're building docs locally
              import sphinx_rtd_theme
              html_theme = 'sphinx_rtd_theme'
              html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]

ADDING TOOLS

   Aligner
       Write  new  aligners within their own submodule inside the ngsalign directory. bwa.py is a
       good example to follow along with. There are two functions to implement,  based  on  which
       type of alignment you'd like to allow:

       · align_bam  --  Performs  alignment given an input BAM file.  Expected to return a sorted
         BAM output file.

       · align -- Performs alignment given FASTQ inputs  (gzipped  or  not).  This  is  generally
         expected  to  implement an approach with unix-pipe that minimizes intermediates and disk
         IO, returning a sorted BAM output file. For back-compatibility this can  also  return  a
         text based SAM file.

       See the names section for more details on arguments.

       Other required implementation details include:

       · galaxy_loc_file  --  Provides the name of the Galaxy loc file used to identify locations
         of  indexes  for  this  aligner.  The  automated  installer  sets  up  these  loc  files
         automatically.

       · remap_index_fn -- A function that remaps an index from the Galaxy location file into the
         exact one for this aligner. This is useful for tools which aren't supported by a  Galaxy
         .loc file but you can locate them relative to another index.

       Once  implemented,  plug  the  aligner  into  the  pipeline  by  defining it as a _tool in
       bcbio/pipeline/alignment.py. You can then use it as normal by specifying the name  of  the
       aligner in the aligner section of your configuration input.

   Variant caller
       New  variant  calling  approaches live within their own module inside bcbio/variation. The
       freebayes.py implementation is a good example to follow for  providing  your  own  variant
       caller.  Implement  a  function to run variant calling on multiple BAMs in an input region
       that takes the following inputs:

       · align_bams -- A list of BAM files to call simultaneously.

       · items -- List of data dictionaries associated with each of the  samples  in  align_bams.
         Enables  customization  of  variant  calling  based  on sample configuration inputs. See
         documentation on the data dictionary for all of the information  contained  inside  each
         data  item.  Having  multiple  configurations  allows  customization  of sample specific
         variant calls using parameters supplied to sample-configuration.

       · ref_file -- Fasta reference genome file.

       · assoc_files -- Useful associated files for variant calling. This includes the DbSNP  VCF
         file.   It's   a   named   tuple  mapping  to  files  specified  in  the  configuration.
         bcbio/pipeline/shared.py has the available inputs.

       · region -- A tuple of (chromosome, start, end) specifying the region to call in.

       · out_file-- The output file to write to. This should contain calls for all input  samples
         in the supplied region.

       Once  implemented,  add the variant caller into the pipeline by updating caller_fns in the
       variantcall_sample function in bcbio/variation/genotype.py. You can use it  by  specifying
       it in the variantcaller parameter of your sample configuration.

ADDING NEW ORGANISMS

While bcbio-nextgen and supporting tools receive the most testing and development on human
or human-like diploid organisms, the algorithms are generic and we strive to support the
wide diversity of organisms used in your research. We welcome contributors interested in
setting up and maintaining support for their particular research organism, and this
section defines the steps in integrating a new genome. We also welcome suggestions and
implementations that improve this process.

Setup CloudBioLinux to automatically download and prepare the genome:

· Add the genome database key and organism name to list of supported organisms in the
CloudBioLinux configuration (config/biodata.yaml).

· Add download details to specify where to get the fasta genome files (‐
cloudbio/biodata/genomes.py). CloudBioLinux supports common genome providers like UCSC
and Ensembl directly.

Add the organism to the supported installs within bcbio:

· This happens in two places: for the initial installer (scripts/bcbio_nextgen_install.py)
and the updater (bcbio/install.py).

Test installation of genomes by pointing to your local cloudbiolinux edits during a data
installation:

mkdir -p tmpbcbio-install
ln -s ~/bio/cloudbiolinux tmpbcbio-install
bcbio_nextgen.py upgrade --data --genomes DBKEY

Add configuration information to bcbio-nextgen by creating a
config/genomes/DBKEY-resources.yaml file. Copy an existing minimal template like canFam3
and edit with pointers to snpEff and other genome resources. The VEP database directory
has Ensembl names. SnpEff has a command to list available databases:

snpEff databases

Finally, send pull requests for CloudBioLinux and bcbio-nextgen and we'll happily
integrate the new genome.

This will provide basic integration with bcbio and allow running a minimal pipeline with
alignment and quality control. We also have utility scripts in CloudBioLinux to help with
preparing dbSNP (utils/prepare_dbsnp.py) and RNA-seq (utils/prepare_tx_gff.py) resources
for some genomes. For instance, to prepare RNA-seq transcripts for mm9:

bcbio_python prepare_tx_gff.py --genome-dir /path/to/bcbio/genomes Mmusculus mm9

We are still working on ways to best include these as part of the standard build and
install since they either require additional tools to run locally, or require preparing
copies in S3 buckets.

STANDARD FUNCTION ARGUMENTS

   names
       This dictionary provides lane and other BAM run group naming information used to correctly
       build BAM files. We use the rg attribute as the ID within a BAM file:

          {'lane': '7_100326_FC6107FAAXX',
           'pl': 'illumina',
           'pu': '7_100326_FC6107FAAXX',
           'rg': '7',
           'sample': 'Test1'}

   data
       The data dictionary is a large dictionary representing processing, configuration and files
       associated with a sample. The standard work  flow  is  to  pass  this  dictionary  between
       functions,  updating with associated files from the additional processing. Populating this
       dictionary  only  with  standard  types  allows  serialization  to  JSON  for  distributed
       processing.

       The  dictionary  is dynamic throughout the workflow depending on the step, but some of the
       most useful key/values available throughout are:

       · config -- Input configuration variables about how to process in  the  algorithm  section
         and locations of programs in the resources section.

       · dirs -- Useful directories for building output files or retrieving inputs.

       · metadata  --  Top  level  metadata  associated  with  a sample, specified in the initial
         configuration.

       · genome_resources -- Naming aliases and associated  files  associated  with  the  current
         genome     build.    Retrieved    from    organism    specific    configuration    files
         (buildname-resources.yaml) this specifies the location of supplemental organism specific
         files like support files for variation and RNA-seq analysis.

       It  also  contains  information  the  genome  build, sample name and reference genome file
       throughout. Here's an example of these inputs:

          {'config': {'algorithm': {'aligner': 'bwa',
                                    'callable_regions': 'analysis_blocks.bed',
                                    'coverage_depth': 'low',
                                    'coverage_interval': 'regional',
                                    'mark_duplicates': 'samtools',
                                    'nomap_split_size': 50,
                                    'nomap_split_targets': 20,
                                    'num_cores': 1,
                                    'platform': 'illumina',
                                    'quality_format': 'Standard',
                                    'realign': 'gkno',
                                    'recalibrate': 'gatk',
                                    'save_diskspace': True,
                                    'upload_fastq': False,
                                    'validate': '../reference_material/7_100326_FC6107FAAXX-grade.vcf',
                                    'variant_regions': '../data/automated/variant_regions-bam.bed',
                                    'variantcaller': 'freebayes'},
                      'resources': {'bcbio_variation': {'dir': '/usr/share/java/bcbio_variation'},
                                    'bowtie': {'cores': None},
                                    'bwa': {'cores': 4},
                                    'cortex': {'dir': '~/install/CORTEX_release_v1.0.5.14'},
                                    'cram': {'dir': '/usr/share/java/cram'},
                                    'gatk': {'cores': 2,
                                             'dir': '/usr/share/java/gatk',
                                             'jvm_opts': ['-Xms750m', '-Xmx2000m'],
                                             'version': '2.4-9-g532efad'},
                                    'gemini': {'cores': 4},
                                    'novoalign': {'cores': 4,
                                                  'memory': '4G',
                                                  'options': ['-o', 'FullNW']},
                                    'picard': {'cores': 1,
                                               'dir': '/usr/share/java/picard'},
                                    'snpEff': {'dir': '/usr/share/java/snpeff',
                                               'jvm_opts': ['-Xms750m', '-Xmx3g']},
                                    'stampy': {'dir': '~/install/stampy-1.0.18'},
                                    'tophat': {'cores': None},
                                    'varscan': {'dir': '/usr/share/java/varscan'},
                                    'vcftools': {'dir': '~/install/vcftools_0.1.9'}}},
          'genome_resources': {'aliases': {'ensembl': 'human',
                                            'human': True,
                                            'snpeff': 'hg19'},
                                'rnaseq': {'transcripts': '/path/to/rnaseq/ref-transcripts.gtf',
                                           'transcripts_mask': '/path/to/rnaseq/ref-transcripts-mask.gtf'},
                                'variation': {'dbsnp': '/path/to/variation/dbsnp_132.vcf',
                                              'train_1000g_omni': '/path/to/variation/1000G_omni2.5.vcf',
                                              'train_hapmap': '/path/to/hg19/variation/hapmap_3.3.vcf',
                                              'train_indels': '/path/to/variation/Mills_Devine_2hit.indels.vcf'},
                                'version': 1},
           'dirs': {'fastq': 'input fastq directory',
                        'galaxy': 'directory with galaxy loc and other files',
                        'work': 'base work directory'},
           'metadata': {'batch': 'TestBatch1'},
           'genome_build': 'hg19',
           'name': ('', 'Test1'),
           'sam_ref': '/path/to/hg19.fa'}

       Processing also injects other useful key/value pairs.  Here's  an  example  of  additional
       information supplied during a variant calling workflow:

          {'prep_recal': 'Test1/7_100326_FC6107FAAXX-sort.grp',
           'summary': {'metrics': [('Reference organism', 'hg19', ''),
                                   ('Total', '39,172', '76bp paired'),
                                   ('Aligned', '39,161', '(100.0\\%)'),
                                   ('Pairs aligned', '39,150', '(99.9\\%)'),
                                   ('Pair duplicates', '0', '(0.0\\%)'),
                                   ('Insert size', '152.2', '+/- 31.4')],
                       'pdf': '7_100326_FC6107FAAXX-sort-prep-summary.pdf',
                       'project': 'project-summary.yaml'},
           'validate': {'concordant': 'Test1-ref-eval-concordance.vcf',
                        'discordant': 'Test1-eval-ref-discordance-annotate.vcf',
                        'grading': 'validate-grading.yaml',
                        'summary': 'validate-summary.csv'},
           'variants': [{'population': {'db': 'gemini/TestBatch1-freebayes.db',
                                        'vcf': None},
                         'validate': None,
                         'variantcaller': 'freebayes',
                         'vrn_file': '7_100326_FC6107FAAXX-sort-variants-gatkann-filter-effects.vcf'}],
           'vrn_file': '7_100326_FC6107FAAXX-sort-variants-gatkann-filter-effects.vcf',
           'work_bam': '7_100326_FC6107FAAXX-sort-prep.bam'}

PARALLELIZATION FRAMEWORK

       bcbio-nextgen   supports  parallel  runs  on  local  machines  using  multiple  cores  and
       distributed on a cluster using IPython using a general framework.

       The first parallelization step starts up a set of resources for processing. On  a  cluster
       this  spawns  a  IPython  parallel  controller and set of engines for processing. The prun
       (parallel run) start function is the entry point to spawning  the  cluster  and  the  main
       argument  is  a  parallel  dictionary  which  contains  arguments to the engine processing
       command. Here is an example input from an IPython parallel run:

          {'cores': 12,
           'type': 'ipython'
           'progs': ['aligner', 'gatk'],
           'ensure_mem': {'star': 30, 'tophat': 8, 'tophat2': 8},
           'module': 'bcbio.distributed',
           'queue': 'batch',
           'scheduler': 'torque',
           'resources': [],
           'retries': 0,
           'tag': '',
           'timeout': 15}

       The cores and type arguments must be present, identifying the total cores to use and  type
       of  processing,  respectively. Following that are arguments to help identify the resources
       to use. progs specifies the programs used, here the aligner, which bcbio looks up from the
       input  sample  file,  and  gatk. ensure_mem is an optional argument that specifies minimum
       memory requirements to programs if used in the workflow. The remaining arguments  are  all
       specific to IPython to help it spin up engines on the appropriate computing cluster.

       A  shared component of all processing runs is the identification of used programs from the
       progs argument. The run creation process looks up required memory and  CPU  resources  for
       each program from the config-resources section of your bcbio_system.yaml file. It combines
       these resources  into  required  memory  and  cores  using  the  logic  described  in  the
       memory-management section of the parallel documentation. Passing these requirements to the
       cluster creation process ensures the available machines match program requirements.

       bcbio-nextgen's pipeline.main  code  contains  examples  of  starting  and  using  set  of
       available  processing engines. This example starts up machines that use samtools, gatk and
       cufflinks then runs an RNA-seq expression analysis:

          with prun.start(_wprogs(parallel, ["samtools", "gatk", "cufflinks"]),
                          samples, config, dirs, "rnaseqcount") as run_parallel:
              samples = rnaseq.estimate_expression(samples, run_parallel)

       The pipelines often reuse a single set of machines for multiple distributed  functions  to
       avoid the overhead of starting up and tearing down machines and clusters.

       The run_parallel function returned from the prun.start function enables running on jobs in
       the parallel on the created machines.  The  ipython  wrapper  code  contains  examples  of
       implementing  this.  It  is  a  simple  function that takes two arguments, the name of the
       function to run and a set of multiple arguments to pass to that function:

          def run(fn_name, items):

       The items arguments need to be strings, lists and dictionaries to allow  serialization  to
       JSON  format.  The  internals  of the run function take care of running all of the code in
       parallel and returning the results back to the caller function.

       In this setup, the main processing code is fully independent from the parallel method used
       so  running  on  a  single  multicore machine or in parallel on a cluster return identical
       results and require no changes to the logical code defining the pipeline.

       During re-runs, we avoid the expense of spinning  up  processing  clusters  for  completed
       tasks  using simple checkpoint files in the checkpoints_parallel directory. The prun.start
       wrapper writes these on completion of processing for  a  group  of  tasks  with  the  same
       parallel  architecture,  and on subsequent runs will go through these on the local machine
       instead of parallelizing. The processing code supports these quick re-runs by checking for
       and avoiding re-running of tasks when it finds output files.

       Plugging  new  parallelization  approaches  into this framework involves writing interface
       code that handles the two steps. First, create a cluster of ready to  run  machines  given
       the parallel function with expected core and memory utilization:

       · num_jobs -- Total number of machines to start.

       · cores_per_job -- Number of cores available on each machine.

       · mem  --  Expected  memory  needed  for  each machine. Divide by cores_per_job to get the
         memory usage per core on a machine.

       Second, implement a run_parallel function that handles using these resources to distribute
       jobs  and  return  results.  The multicore wrapper and ipython wrapper are useful starting
       points for understanding the current implementations.

OVERVIEW

         [image: Variant calling overview] [image]

PARALLEL

       bcbio calculates callable regions following alignment using goleft  depth.  These  regions
       determine  breakpoints  for  analysis,  allowing parallelization by genomic regions during
       variant calling. Defining non-callable regions allows  bcbio  to  select  breakpoints  for
       parallelization  within  chromosomes  where  we  won't  accidentally  impact small variant
       detection. The callable regions also supplement the  variant  calls  to  define  positions
       where  not  called  bases  are  homozygous  reference, as opposed to true no-calls with no
       evidence. The callable regions plus variant calls is an alternative to gVCF  output  which
       explicitly enumerates reference calls in the output variant file.
         [image:   Parallel  approaches]  [image]  Overview  of  cluster  types  during  parallel
         execution.UNINDENT

       · Variant calling and bcbio training for the Harvard Chan Bioinformatics Core In Depth NGS
         Data Analysis Course (10 October 2018): slides

       · Building  a diverse set of validations; lightning talk at the GCCBOSC2018 Bioinformatics
         Community Conference: slides

       · bcbio training at the GCCBOSC2018 Bioinformatics Community Conference, focusing on bcbio
         CWL  integration  with  examples  of variant calling analyses on Personal Genome Project
         examples (26 June 2018): slides;; video

       · Description of bcbio and Common Workflow integration with  a  focus  on  parallelization
         strategies.  From a bcbio discussion with Peter Park's lab at Harvard Medical School (26
         January 2018): slides

       · In depth description of  bcbio  and  Common  Workflow  Language  integration,  including
         motivation  and  practical  examples  of running on clusters, DNAnexus, SevenBridges and
         Arvados. From the Boston  Bioinformatics  Interest  Group  meeting  (2  November  2017):
         slides; video

       · bcbio practical interoperability with the Common Workflow Language at BOSC 2017 (22 July
         2017): slides; video

       · teaching variant calling, bcbio and  GATK4  validation  at  the  Summer  2017  NGS  Data
         Analysis Course at Harvard Chan School (6 July 2017): slides

       · Training  course  for  the  Cancer  Genomics Cloud, describing how bcbio uses the Common
         Workflow Language to run in multiple infrastructures (1 May 2017): slides

       · MIT  Bioinformatics  Interest  Group  about  how  Common   Workflow   Language   enables
         interoperability with multiple workflow engines (3 November 2016): slides and video

       · Broad Institute software engineering seminar about bcbio validation and integration with
         Common Workflow Language and Workflow Definition Language (28 September 2016): slides

       · Materials from teaching at the Summer 2016 NGS Data  Analysis  Course  at  Harvard  Chan
         School (11 August 2016): slides

       · Bioinformatics  Open  Source  Conference  (BOSC) 2016 lightning talk on bcbio and common
         workflow language (8 July 2016): slides and video.

       · Materials from teaching from the Spring 2016 NGS Data Analysis Course  at  Harvard  Chan
         School (28 April 2016): slides

       · Statistical  Genetics  and  Network  Science  Meeting  at  Channing  Division of Network
         Medicine (23 March 2016): slides

       · Presentation at Curoverse Brown Bag Seminar on bcbio and in  progress  integration  work
         with Common Workflow Language and Arvados (11 January 2016): slides

       · Materials  from  teaching  oriented  example at Cold Spring Harbor Laboratory's Advanced
         Sequencing Technology and Applications course.  (18 November 2015): slides

       · Supporting the common workflow language and Docker in bcbio Bio in Docker  symposium  (9
         November 2015): slides

       · Validation  on  human  build 38, HLA typing, low frequency cancer calling and structural
         variation for Boston Bioinformatics Interest Group  (BIG)  meeting  (5  November  2015):
         slides

       · Presentation  on  Research  Scientist  Careers  for Iowa State Bioinformatics Course (23
         September 2015): slides

       · Prioritization of structural variants based on known biological information at BOSC 2015
         (10 July 2015): slides; video

       · Overview  of  variant calling for NGS Data Analysis Course at Harvard Medical School (19
         May 2015): slides

       · NGS Glasgow (23 April 2015): slides

       · Boston Computational Biology and Bioinformatics meetup (1 April 2015): slides

       · Program in Genetic Epidemiology and Statistical Genetics seminar series at Harvard  Chan
         School (6 February 2015): slides

       · Talk at Good Start Genetics (23 January 2015): slides

       · Boston area Bioinformatics Interest Group (15 October 2014): slides

       · University of Georgia Institute of Bioinformatics (12 September 2014): slides

       · Intel Life Sciences discussion (7 August 2014): slides

       · Bioinformatics Open Source Conference (BOSC) 2014: slides, conference website

       · Galaxy Community Conference 2014: slides, conference website

       · bcbio hackathon at Biogen (3 June 2014)

       · Harvard ABCD group slides (17 April 2014)

       · BIG meeting (February 2014)

       · Novartis slides (21 January 2014)

       · Mt Sinai: Strategies for accelerating the genomic sequencing pipeline: Mt Sinai workshop
         slides, Mt Sinai workshop website

       · Genome Informatics 2013 GI 2013 Presentation slides

       · Bioinformatics Open Source Conference 2013: BOSC 2013 Slides, BOSC 2013 Video, BOSC 2013
         Conference website

       · Arvados Summit 2013: Arvados Summit Slides, Arvados Summit website

       · Scientific Python 2013: SciPy 2013 Video, SciPy 2013 Conference website

       Feel free to reuse any images or text from these talks. The slides are on GitHub.

ABSTRACT

Community Development of Validated Variant Calling Pipelines

Brad Chapman, Rory Kirchner, Oliver Hofmann and Winston Hide Harvard School of Public
Health, Bioinformatics Core, Boston, MA, 02115

Translational research relies on accurate identification of genomic variants. However,
rapidly changing best practice approaches in alignment and variant calling, coupled with
large data sizes, make it a challenge to create reliable and reproducible variant calls.
Coordinated community development can help overcome these challenges by sharing testing
and updates across multiple groups. We describe bcbio-nextgen, a distributed
multi-architecture pipeline that automates variant calling, validation and organization of
results for query and visualization. It creates an easily installable, reliable
infrastructure from best-practice open source tools with the following goals:

· Quantifiable: Validates variant calls against known reference materials developed by the
Genome in a Bottle consortium. The bcbio.variation toolkit automates scoring and
assessment of calls to identify regressions in variant identification as calling
pipelines evolve. Incorporation of multiple variant calling approaches from Broad's GATK
best practices and the Marth lab's gkno software enables informed comparisons between
current and future algorithms.

· Scalable: bcbio-nextgen handles large population studies with hundreds of whole genome
samples by parallelizing on a wide variety of schedulers and multicore machines, setting
up different ad hoc cluster configurations for each workflow step. Work in progress
includes integration with virtual environments, including Amazon Web Services and
OpenStack.

· Accessible: Results automatically feed into tools for query and investigation of
variants. The GEMINI framework provides a queryable database associating variants with a
wide variety of genome annotations. The o8 web-based tool visualizes the work of variant
prioritization and assessment.

· Community developed: bcbio-nextgen is widely used in multiple sequencing centers and
research laboratories. We actively encourage contributors to the code base and make it
easy to get started with a fully automated installer and updater that prepares all third
party software and reference genomes.

LINKS FROM THE PRESENTATION

       · HugeSeq

       · Genome Comparison & Analytic Testing at Bioplanet

       · Peter Block’s “Community” book

       · CloudBioLinux  and  Homebrew  Science  as  installation  frameworks;  Conda  as   Python
         environment

       · bcbio documentation at ReadTheDocs

       · Arvados framework for meta data tracking, NGS processing and data provenance

       · Notes on improved scaling for NGS workflows

       · Genomic Reference Materials from Genome in a Bottle

       · Comparison of aligners and callers using NIST reference materials

       · Callers and minimal BAM preparation workflows

       · Coverage assessment

SINGLE CELL RNA-SEQ ANALYSIS

       Setting  up  bcbio  single cell RNA-seq analysis tutorial outlines the steps needed to run
       bcbio in that use case.

CANCER TUMOR-NORMAL VARIANT CALLING

       This is a teaching  orientated  example  of  using  bcbio  from  the  Cold  Spring  Harbor
       Laboratory's  Advanced  Sequencing  Technology  and  Applications course. This uses cancer
       tumor normal data from the ICGC-TCGA DREAM synthetic 3  challenge,  subset  to  exomes  on
       chromosome 6 to reduce runtimes. It demonstrates:

       · Running a cancer tumor/normal workflow through bcbio.

       · Analysis with human genome build 38.

       · SNP and indel detection, with 3 variant callers and an ensemble method.

       · Structural variant calling, with 2 callers.

       · Prioritization of structural variants for cancer associated genes in CIViC.

       · HLA typing.

       · Validation of both small and structural variants against truth sets.

   Loading pre-run analysis
       To  save  downloading the genome data and running the analysis, we have a pre-prepared AMI
       with the data and analysis run. Use the AWS Console to launch the pre-built AMI --  search
       Community  AMIs  for  ami-5e84fe34.  Any  small  instance  type  is fine for exploring the
       configuration, run directory and output files. Make sure you associate a public IP  and  a
       security group that allows remote ssh.

       Once    launched,    ssh    into   the   remote   machine   with   ssh   -i   your-keypair
       ubuntu@public.ip.address to explore the inputs and outputs.   The  default  PATH  contains
       bcbio  and  third  party programs in /usr/local/bin, with the biological data installed in
       /usr/local/share/bcbio. The run is in a ~/run/cancer-syn3-chr6.

   Input configuration file
       To run bcbio, you prepare a small configuration file describing your  analysis.   You  can
       prepare  it  manually  or  use  an  automated  configuration  method.   The  example has a
       pre-written configuration file with tumor/normal data located in the config directory  and
       this section walks through the settings.

       You  define  the  type of analysis (variant calling) along with the input files and genome
       build:

          analysis: variant2
          files: [../input/cancer-syn3-chr6-tumor-1.fq.gz, ../input/cancer-syn3-chr6-tumor-2.fq.gz]
          genome_build: hg38

       Sample description and assignment as a  tumor  sample,  called  together  with  a  matched
       normal:

          description: syn3-tumor
          metadata:
            batch: syn3
            phenotype: tumor
            sex: female

       Next it defines parameters for running the analysis. First we pick our aligner (bwa mem):

          algorithm:
            aligner: bwa

       Post-alignment, we mark duplicates but do not perform recalibration and realignment:

          mark_duplicates: true
          recalibrate: false
          realign: false

       We call variants in exome regions on chromosome 6 using a BED file input, call variants as
       low as 2% in the tumor sample, and use 3 variant callers along  with  an  ensemble  method
       that combines results for any found in 2 out of 3:

          variant_regions: ../input/NGv3-chr6-hg38.bed
          min_allele_fraction: 2
          variantcaller: [vardict, freebayes, varscan]
          ensemble:
            numpass: 2

       For  structural variant calling, we use two callers and prioritize variants to those found
       in the CIViC database:

          svcaller: [lumpy, manta]
          svprioritize: cancer/civic

       Call HLA types with OptiType:

          hlacaller: optitype

       Finally,  we  validate  both  the  small  variants  and  structural  variants.  These  use
       pre-installed  validation  sets that come with bcbio. We limit validation regions to avoid
       low complexity regions, which cause bias in validating indels:

          exclude_regions: [lcr]
          validate: dream-syn3-crossmap/truth_small_variants.vcf.gz
          validate_regions: dream-syn3-crossmap/truth_regions.bed
          svvalidate:
            DEL: dream-syn3-crossmap/truth_DEL.bed
            DUP: dream-syn3-crossmap/truth_DUP.bed
            INV: dream-syn3-crossmap/truth_INV.bed

   Output files
       Output files are in ~/run/cancer-syn3-chr6/final, extracted from the full  work  directory
       in ~/run/cancer-syn3-chr6/work.

       The  directories  with  sample  information  are  in  syn3-tumor/.  Aligned BAMs include a
       -ready.bam file with all of the original  reads  (including  split  and  discordants)  and
       separate files with only the split (-sr.bam) and discordant (-disc.bam) reads:

          syn3-tumor-ready.bam
          syn3-tumor-ready.bam.bai
          syn3-tumor-sr.bam
          syn3-tumor-sr.bam.bai
          syn3-tumor-disc.bam
          syn3-tumor-disc.bam.bai

       SNP and indel calls for 3 callers, plus combined ensemble calls:

          syn3-tumor-ensemble.vcf.gz
          syn3-tumor-ensemble.vcf.gz.tbi
          syn3-tumor-freebayes.vcf.gz
          syn3-tumor-freebayes.vcf.gz.tbi
          syn3-tumor-varscan.vcf.gz
          syn3-tumor-varscan.vcf.gz.tbi
          syn3-tumor-vardict.vcf.gz
          syn3-tumor-vardict.vcf.gz.tbi

       Structural  variant  calls for 2 callers, plus a simplified list of structural variants in
       cancer genes of interest:

          syn3-tumor-sv-prioritize.tsv
          syn3-tumor-lumpy.vcf.gz
          syn3-tumor-lumpy.vcf.gz.tbi
          syn3-tumor-manta.vcf.gz
          syn3-tumor-manta.vcf.gz.tbi

       HLA typing results:

          syn3-tumor-hla-optitype.csv

       Validation results from comparisons against truth set, including plots:

          syn3-tumor-sv-validate.csv
          syn3-tumor-sv-validate-DEL.png
          syn3-tumor-sv-validate-df.csv
          syn3-tumor-sv-validate-DUP.png
          syn3-tumor-sv-validate-INV.png
          syn3-tumor-validate.png

       The top level directory for the project, 2015-11-18_syn3-cshl/ has files relevant  to  the
       entire run. There is a consolidated quality control report:

          multiqc/multiqc_report.html

       Povenance information, with log files of all commands run and program versions used:

          bcbio-nextgen.log
          bcbio-nextgen-commands.log
          programs.txt
          data_versions.csv

       A  top level summary of metrics for alignment, variant calling and coverage that is useful
       downstream:

          project-summary.yaml

   Preparing and Running
       The steps to prepare an AMI from a bare machine and run the analysis. These  are  pre-done
       on the teaching AMI to save time:

       1. Use the AWS Console to launch a Ubuntu Server 14.04 (ami-d05e75b8). Start an m4.4xlarge
          instance with a 100Gb SSD. Make  sure  you  associate  a  public  IP  and  can  ssh  in
          externally.

       2. SSH to your instance:

             ssh -i ~/.ec2/your-key.pem ubuntu@public-ip

       3. Install bcbio with hg38 data:

             sudo apt-get update
             sudo apt-get install -y build-essential zlib1g-dev wget curl python-setuptools git \
                                     openjdk-7-jdk openjdk-7-jre ruby libncurses5-dev libcurl4-openssl-dev \
                                     libbz2-dev unzip pigz bsdmainutils
             wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
             python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir /usr/local \
                    --genomes hg38 --aligners bwa --sudo --isolate -u development

       4. Install the analysis data:

             mkdir -p run
             cd run
             wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/teaching/cancer-syn3-chr6-prep.sh
             bash cancer-syn3-chr6-prep.sh

       5. Run the analysis:

             cd cancer-syn3-chr6/work
             bcbio_nextgen.py ../config/cancer-syn3-chr6.yaml -n 16

       https://github.com/bcbio/bcbio-nextgen

SMALL RNA-SEQ

       Data  was analyzed with bcbio-nextgen (https://github.com/bcbio/bcbio-nextgen) using piDNA
       to detect the adapter, cutadapt to remove it, STAR/bowtie to align against the genome  and
       seqcluster  to  detect  small  RNA transcripts. miRNAs were detected using miraligner tool
       with miRBase as the reference miRNA database. tRNA profiles were detected using  tdrmapper
       tool.  mirdeep2 was used for discovery of novel miRNAs. FastQC was used for QC metrics and
       multiqc for reporting.

       Download                                    BIB                                    format:
       https://github.com/bcbio/bcbio-nextgen/tree/master/docs/contents/misc/bcbio-smallrna.bib

   Tools
       · Tsuji  J,  Weng  Z.  (2016)  DNApi: A De Novo Adapter Prediction Algorithm for Small RNA
         Sequencing                            Data.                             11(10):e0164228.
         http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164228

       · Andrews,  S.  (2010).  FastQC: A quality control tool for high throughput sequence data.
         Bioinformatics. doi:citeulike-article-id:11583827

       · Didion, J. P., Martin, M., & Collins, F. S. (2017). Atropos:  specific,  sensitive,  and
         speedy trimming of sequencing reads. http://doi.org/10.7287/peerj.preprints.2452v4

       · Dale,  R.  K.,  Pedersen,  B. S., & Quinlan, A. R. (2011). Pybedtools: A flexible Python
         library for manipulating  genomic  datasets  and  annotations.  Bioinformatics,  27(24),
         3423–3424. doi:10.1093/bioinformatics/btr539

       · Quinlan,  A.  R.,  &  Hall,  I.  M.  (2010). BEDTools: A flexible suite of utilities for
         comparing      genomic       features.       Bioinformatics,       26(6),       841–842.
         doi:10.1093/bioinformatics/btq033

       · Tarasov,  A.,  Vilella,  A. J., Cuppen, E., Nijman, I. J., & Prins, P. (2015). Sambamba:
         Fast  processing  of  NGS  alignment   formats.   Bioinformatics,   31(12),   2032–2034.
         doi:10.1093/bioinformatics/btv098

       · Heger,        A.       (2009).       Pysam.       github.com.       Retrieved       from
         https://github.com/pysam-developers/pysam

       · Li, H. (2011). A statistical framework for SNP calling, mutation discovery,  association
         mapping   and   population   genetical   parameter   estimation  from  sequencing  data.
         Bioinformatics, 27(21), 2987–2993. doi:10.1093/bioinformatics/btr509

       · Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan,  J.,  Homer,  N.,  …  Durbin,  R.
         (2009).   The  Sequence  Alignment/Map  format  and  SAMtools.  Bioinformatics,  25(16),
         2078–2079. doi:10.1093/bioinformatics/btp352

       · Pantano, L., Estivill, X., & Martí, E. (2010). SeqBuster, a bioinformatic tool  for  the
         processing  and  analysis of small RNAs datasets, reveals ubiquitous miRNA modifications
         in  human  embryonic  cells.  Nucleic  Acids  Research,  38(5),  e34.   Retrieved   from
         http://www.ncbi.nlm.nih.gov/pubmed/20008100

       · Pantano,  L.,  Friedlander,  M.  R.,  Escaramis,  G., Lizano, E., Pallares-Albanell, J.,
         Ferrer, I., … Marti, E.  (2015).  Specific  small-RNA  signatures  in  the  amygdala  at
         premotor  and  motor stages of Parkinson’s disease revealed by deep sequencing analysis.
         Bioinformatics (Oxford, England). doi:10.1093/bioinformatics/btv632

       For the alignment, add what you have used:

       · Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., … Gingeras,
         T.  R.  (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15–21.
         doi:10.1093/bioinformatics/bts635

       · Langmead,  B.,  Trapnell,  C.,  Pop,  M.,  &  Salzberg,  S.  L.  (2009).  Ultrafast  and
         memory-efficient  alignment  of short DNA sequences to the human genome. Genome Biology,
         10, R25. doi:10.1186/gb-2009-10-3-r25

       · Kim, D., Langmead, B. & Salzberg, SL. (2016). HISAT: a fast  spliced  aligner  with  low
         memory requirements. Nature Methods, 12(4): 357–360. doi: 10.1038/nmeth.3317

       If you used TopHat2 for alignment:

       · Kim,  D.,  Pertea,  G.,  Trapnell,  C.,  Pimentel, H., Kelley, R. & Salzberg SL. (2013).
         TopHat2: accurate alignment of transcriptomes in the presence of  insertions,  deletions
         and gene fusions. Genome Biology, 14(4): R36. doi: 10.1186/gb-2013-14-4-r36

       · Brueffer,  C.  &  Saal,  LH.  (2016).  TopHat-Recondition:  A  post-processor for TopHat
         unmapped reads. BMC Bioinformatics, 17(1):199. doi: 10.1186/s12859-016-1058-x

       If you have in the output novel miRNA discovering, add:

       · Friedlander, M. R., MacKowiak, S. D., Li, N., Chen, W., & Rajewsky, N. (2012).  MiRDeep2
         accurately identifies known and hundreds of novel microRNA genes in seven animal clades.
         Nucleic Acids Research, 40(1), 37–52. doi:10.1093/nar/gkr688

       If you have tRNA mapping output, add:

       · Selitsky, S. R., &  Sethupathy,  P.  (2015).  tDRmapper:  challenges  and  solutions  to
         mapping, naming, and quantifying tRNA-derived RNAs from human small RNA-sequencing data.
         BMC Bioinformatics, 16(1), 354. doi:10.1186/s12859-015-0800-0

       If you have miRge activated:

       · Yin  Lu,  Alexander  S.  Baras,  Marc  K  Halushka.  miRge2.0:  An   updated   tool   to
         comprehensively analyze microRNA sequencing data. bioRxiv.org.

       If you have MINTmap activated:

       · Loher,  P,  Telonis, AG, Rigoutsos, I. MINTmap: fast and exhaustive profiling of nuclear
         and mitochondrial tRNA fragments from short RNA-seq data. Sci Rep. 2017;7  :41184.  doi:
         10.1038/srep41184. PubMed PMID:28220888 PubMed Central PMC5318995.

   Data
       · Griffiths-Jones,  S.  (2004). The microRNA Registry. Nucleic Acids Research, 32(Database
         issue), D109–11. doi:10.1093/nar/gkh023

       · Griffiths-Jones,  S.  (2006).  miRBase:  the  microRNA  sequence  database.  Methods  in
         Molecular Biology (Clifton, N.J.), 342, 129–38. doi:10.1385/1-59745-123-1:129

       · Griffiths-Jones,  S.,  Saini,  H.  K., Van Dongen, S., & Enright, A. J. (2008). miRBase:
         Tools   for   microRNA    genomics.    Nucleic    Acids    Research,    36(SUPPL.    1).
         doi:10.1093/nar/gkm952

       · Kozomara, A., & Griffiths-Jones, S. (2011). MiRBase: Integrating microRNA annotation and
         deep-sequencing data. Nucleic Acids Research, 39(SUPPL. 1). doi:10.1093/nar/gkq1027

       · Kozomara,  A.,  &  Griffiths-Jones,  S.  (2014).  MiRBase:  Annotating  high  confidence
         microRNAs    using    deep    sequencing   data.   Nucleic   Acids   Research,   42(D1).
         doi:10.1093/nar/gkt1181

       · genindex

       · modindex

       · search

AUTHOR

       Brad Chapman

COPYRIGHT

       2019, bcbio-nextgen contributors