lunar (1) igdiscover.1.gz

Provided by: igdiscover_0.11-3_all bug

NAME

       igdiscover - IgDiscover Documentation

       IgDiscover  analyzes  antibody  repertoires and discovers new V genes from high-throughput
       sequencing reads.  Heavy chains, kappa and lambda light chains are supported (to  discover
       VH, VK and VL genes).

       IgDiscover is the result of a collaboration between the Gunilla Karlsson Hedestam group at
       the Department of Microbiology, Tumor and Cell Biology at  Karolinska  Institutet,  Sweden
       and  the  Bioinformatics  Long-Term  Support  facility  at  Science  for  Life  Laboratory
       (SciLifeLab), Sweden.

       If you use IgDiscover, please cite:
          Corcoran, Martin M. and Phad, Ganesh E. and Bernat, Néstor Vázquez and Stahl-Hennig,
          Christiane and Sumida, Noriyuki and Persson, Mats A.A. and Martin, Marcel and
          Karlsson Hedestam, Gunilla B.
          Production of individualized V gene databases reveals high levels of immunoglobulin genetic
          diversity.
          Nature Communications 7:13642 (2016)
          https://dx.doi.org/10.1038/ncomms13642

DocumentationSource codeReport an issueProject page on PyPI (Python package index)

         [image]

INSTALLATION

       IgDiscover is written in Python 3 and is developed on Linux. The tool also runs on  macOS,
       but is not as well tested on that platform.

       For  installation  on  either system, we recommend that you follow the instructions below,
       which will first explain how to install the Conda package manager. IgDiscover is available
       as a Conda-package from the bioconda channel.  Using Conda will make the installation easy
       because all dependencies are also available as Conda packages and can  thus  be  installed
       automatically along with IgDiscover.

       There are also non-Conda installation instructions if you cannot use Conda.

   Installing IgDiscover with Conda
       1. Install  Conda by following the conda installation instructions as appropriate for your
          system. You will need to choose between a “Miniconda” and “Anaconda”  installation.  We
          recommend  Miniconda  as  the  download  is  smaller.  If you are in a hurry, these two
          commands are usually sufficient to install Miniconda on Linux (read the linked document
          for macOS instructions):

             wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
             bash Miniconda3-latest-Linux-x86_64.sh

          When the installer asks you about modifying the PATH in your .bashrc file, answer yes.

       2. Close  the  terminal  window  and  open a new one. Then test whether conda is installed
          correctly by running

             conda --version

          If you see the conda version number, it worked.

       3. Set up Conda so that it  can  access  the  bioconda  channel.   For  that,  follow  the
          instructions on the bioconda website or simply run these commands:

             conda config --add channels defaults
             conda config --add channels bioconda
             conda config --add channels conda-forge

       4. Install IgDiscover with this command:

             conda create -n igdiscover igdiscover

          This  will  create  a  new  so-called “environment” for IgDiscover (retry if it fails).
          Whenever you want to run IgDiscover, you will need to  activate  the  environment  with
          this command:

             source activate igdiscover

       5. Make  sure you have activated the igdiscover environment.  Then test whether IgDiscover
          is correctly installed with this command:

             igdiscover --version

          If you see the version number of IgDiscover, it worked! If  an  error  message  appears
          that  says  "The  'networkx'  distribution was not found and is required by snakemake",
          install networkx manually with:

             pip install networkx==2.1

          Then retry to check the igdiscover version.

       6. You can now run IgDiscover on the test data set to familiarize  yourself  with  how  it
          works.

   Troubleshooting on Linux
       If  you use zsh instead of bash (applies to Bio-Linux, for example), the $PATH environment
       variable will not be setup correctly by the Conda installer. The miniconda installer  adds
       a  line  export PATH=... to the to the end of your /home/your-user-name/.bashrc file. Copy
       that line from the file and add it to the  end  of  the  file  /home/your-user-name/.zshrc
       instead.

       Alternatively, change your default shell to bash by running chsh -s /bin/bash.

       If you use conda and see an error that includes something like this:

          ImportError: .../.local/lib/python3.5/site-packages/sqt/_helpers.cpython-35m-x86_64-linux-gnu.so: undefined symbol: PyFPE_jbuf

       Or  you  see  any error that mentions a .local/ directory, then a previous installation of
       IgDiscover is interfering with the conda installation.

       The easiest way to solve this problem is to delete the  directory  .local/  in  your  home
       directory, see also how to remove IgDiscover from a Linux system.

   Troubleshooting on macOS
       If you get the error

          ValueError: unknown locale: UTF-8

       Then follow these instructions.

   Development version
       To  install  IgDiscover  directly  from  the  most  recent source code, read the developer
       installation instructions.

MANUAL INSTALLATION

       IgDiscover requires quite a few other software tools that are not included in  most  Linux
       distributions (or mac OS) and which are also not available from the Python packaging index
       (PyPI) because they are not Python tools.  If  you  do  not  use  the  recommended  simple
       installation  instructions  via  Conda,  you need to install those non-Python dependencies
       manually. Regular Python dependencies are automatically pulled in when  IgDiscover  itself
       is  installed  in  the  last step with the pip install command. The instructions below are
       written for Linux and require modifications if you want to try this on OS X.

       NOTE:
          We recommend the much simpler installation via Conda instead of using the  instructions
          in this section.

   Install non-Python dependencies
       The dependencies are: MUSCLE, IgBLAST, PEAR, and -- optionally -- flash.

       1. Install Python 3.5 or newer. It most likely is already installed on your system, but in
          Debian/Ubuntu, you can get it with

             sudo apt-get install python3

       2. Create the directory where binaries will be installed. We assume $HOME/.local/bin here,
          but this can be anywhere as long as they are in your $PATH.

             mkdir -p ~/.local/bin

          Add this line to the end of your ~/.bashrc file:

             export PATH=$HOME/.local/bin:$PATH

          Then either start a new shell or run source ~/.bashrc to get the changes.

       3. Install MUSCLE. This is available as a package in Ubuntu:

             sudo apt-get install muscle

          If  your distribution does not have a 'muscle' package or if you are not allowed to run
          sudo:

             wget -O - http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz | tar xz
             mv muscle3.8.31_i86linux64 ~/.local/bin/

       4. Install PEAR:

             wget http://sco.h-its.org/exelixis/web/software/pear/files/pear-0.9.6-bin-64.tar.gz
             tar xvf pear-0.9.6-bin-64.tar.gz
             mv pear-0.9.6-bin-64/pear-0.9.6-bin-64 ~/.local/bin/pear

       5. Install IgBLAST:

             wget ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/1.4.0/ncbi-igblast-1.4.0-x64-linux.tar.gz
             tar xvf ncbi-igblast-1.4.0-x64-linux.tar.gz
             mv ncbi-igblast-1.4.0/bin/igblast? ~/.local/bin/

          IgBLAST requires some data files that must  be  downloaded  separately.  The  following
          commands put the files into ~/.local/igdata:

             mkdir ~/.local/igdata
             cd ~/.local/igdata
             wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/internal_data
             wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/database/
             wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/optional_file/

          Also, you must set the $IGDATA environment variable to point to the directory with data
          files. Add this line to your ~/.bashrc:

             export IGDATA=$HOME/.local/igdata

          Then run source ~/.bashrc to get the changes.

       7. Optionally, install flash:

             wget -O FLASH-1.2.11.tar.gz http://sourceforge.net/projects/flashpage/files/FLASH-1.2.11.tar.gz/download
             tar xf FLASH-1.2.11.tar.gz
             cd FLASH-1.2.11
             make
             mv flash ~/.local/bin/

   Install IgDiscover
       Install IgDiscover with the Python package manager pip, which will  download  and  install
       IgDiscover and its dependencies:

          pip3 install --user igdiscover

       Both  commands  also  install all remaining dependencies. The --user option instructs both
       commands to install everything into $HOME/.local.

       Finally, check the installation with

          igdiscover --version

       and you should see the version number of IgDiscover.

       You should now run IgDiscover on the test data set.

TEST DATA SET

       After installing IgDiscover, you should run it once on a small test data that we  provide,
       both to test your installation and to familiarize yourself with running the program.

       1. Download  und unpack the test data set (version 0.5). To do this from the command-line,
          use these commands:

             wget https://bitbucket.org/igdiscover/testdata/downloads/igdiscover-testdata-0.5.tar.gz
             tar xvf igdiscover-testdata-0.5.tar.gz
          The test data set contains some paired-end reads from human  IgM  heavy  chain  dataset
          ERR1760498  and  a database of IGHV, IGHD, IGHJ sequences based on Ensembl annotations.
          You should use a database of higher quality for your own experiments.

       2. Initialize the IgDiscover pipeline directory:

             igdiscover init --db igdiscover-testdata/database/ --reads igdiscover-testdata/reads.1.fastq.gz discovertest

          The name discovertest is the name of the pipeline directory that will be created.  Note
          that  only the path to the first reads file needs to be given. The second file is found
          automatically. There may be a couple of messages “Skipping 'x' because it contains  the
          same sequence as 'y'”, which you can ignore.

          The  command  will  have  printed a message telling you that the pipeline directory has
          been initialized, that you should edit the configuration file, and how to actually  run
          IgDiscover after that.

       3. The  generated  igdiscover.yaml  configuration file does not actually need to be edited
          for the test dataset, but you may still want to have a read through it as you will need
          to do so for you own data. You may want to do this while the pipeline is running in the
          next step. The configuration is in YAML format. When editing the file, just follow  the
          way it is already structured.

       4. Run the analysis. To do so, change into the pipeline directory and run this command:

             cd discovertest && igdiscover run

          On this small dataset, running the pipeline should take not more than about 5 minutes.

       5. Finally,  inspect  the  results  in the discovertest/iteration-01 or discovertest/final
          directories.  The  discovered  V  genes   and   extra   information   are   listed   in
          discovertest/iteration-01/new_V_germline.tab.     Discovered    J    genes    are    in
          discovertest/iteration-01/new_J.tab. There are also corresponding .fasta files with the
          sequences only.

          See the explanation of final result files.

   Other test data sets
       ENA project PRJEB15295 contains the data for our Nature Communications paper from 2016, in
       particular ERR1760498, which is the data for the human “H1”  sample  (multiplex  PCR,  IgM
       heavy chain).

       Data used for testing TCR detection (human, RACE): SRR2905677 and SRR2905710.

USER GUIDE

   Overview
       IgDiscover  works  on  a single library at a time. It works within an “analysis directory”
       for the library, which contains all intermediate and result files.

       To start an analysis, you need:

       1. A FASTA or FASTQ file with single-end reads or two FASTQ files  with  paired-end  reads
          (also, the files must be gzip-compressed)

       2. A database of V/D/J genes (three FASTA files named V.fasta, D.fasta, J.fasta)

       3. A configuration file that describes the library

       If  you  do  not have a V/D/J database, yet, you may want to read the section about how to
       obtain V/D/J sequences.

       To run an analysis, proceed as follows.

       NOTE:
          If you are on  macOS,  it  may  be  necessary  to  run  export  SHELL=/bin/bash  before
          continuing.

       1. Create and initialize the analysis directory.

          First,  pick a name for your analysis. We will use myexperiment in the following.  Then
          run igdiscover init:

             igdiscover init myexperiment

          A dialog will appear and ask for the file with the first (forward)  reads.   Find  your
          compressed  FASTQ  file  that  contains  them and select it.  Typical file names may be
          Library1_S1_L001_R1_001.fastq.gz or mylibrary.1.fastq.gz.  You do not  need  to  choose
          the second read file!  It is found automatically.

          Next,  choose  the  directory with your database.  The directory must contain the three
          files V.fasta, D.fasta, J.fasta.  These files contain  the  V,  D,  J  gene  sequences,
          respectively.   Even  if have have only light chains in your data, a D.fasta file needs
          to be provided, just use one with the heavy chain D gene sequences.

          If you do not want a graphical user interface, use the two command-line parameters --db
          and --reads1 to provide this information instead:

             igdiscover init --db path/to/my/database/ --reads1 mylibrary.1.fastq.gz myexperiment

          Again,  the  second reads file will be found automatically.  Use --single-reads instead
          of --reads1 if you have single-end reads or a dataset with already merged  reads.   For
          --single-reads,  a  FASTA  file  (not  only  FASTQ)  is  also allowed.  In any case, an
          analysis directory named myexperiment will have been created.

       2. Adjust the configuration file

          The previous step created  a  configuration  file  named  myexperiment/igdiscover.yaml,
          which you may need to adjust. In particular, the number of discovery rounds is set to 3
          by default, which takes a long time. Reducing this to 2 or even 1 often works  just  as
          well.

       3. Run the analysis

          Change into the newly created analysis directory and run the analysis:

             igdiscover run

          Depending  on  the  size  of your library, your computer, and the number of iterations,
          this will now take from a few hours to a day. See the running  IgDiscover  section  for
          more  fine-grained  control over what to run and how to resume the process if something
          failed.

   Obtaining a V/D/J database
       We use the term “database” to refer to three FASTA files that contain  the  sequences  for
       the  V,  D  and  J  genes.   IMGT provides sequences for download.  For discovering new VH
       genes, for example, you need to get the IGHV, IGHD and IGHJ files  of  your  species.   As
       IgDiscover uses this only as a starting point, using a similar species will also work.

       When using an IMGT database, it is very important to change the long IMGT sequence headers
       to short headers as IgBLAST does not accept the  long  headers.  We  recommend  using  the
       program  edit_imgt_file.pl.  If you installed IgDiscover from Conda, the script is already
       installed and you can run it by typing the name. It is also available on the  IgBlast  FTP
       site.

       Run  it  for  all  three downloaded files, and then rename files appropritely to make sure
       that they named V.fasta, D.fasta and J.fasta.

       You always need a file with D genes even if you analyze light chains.

       In case you have used  IgBLAST  previously,  note  that  there  is  no  need  to  run  the
       makeblastdb tool as IgDiscover will do that for you.

   Input data requirements
   Paired-end or single-end data
       IgDiscover can process input data of three different types:

       • Paired-end reads in gzipped FASTQ format,

       • Single-end reads in gzipped FASTQ format,

       • Single-end reads in gzipped FASTA format.

       IgDiscover was tested mainly on paired-end Illumina MiSeq reads (2x300bp), but it can also
       handle 454 and Ion Torrent data.

       Depending on the input file type, use a variant  of  one  of  the  following  commands  to
       initialize the analysis directory:

          igdiscover init --single-reads=file.fasta.gz  --database=my-database-dir/ myexperiment
          igdiscover init --reads1=file.1.fasta.gz  --database=my-database-dir/ myexperiment
          igdiscover init --reads1=file.1.fastq.gz  --database=my-database-dir/ myexperiment

   Read layout
       Paired-end  reads are first merged and then processed in the same way as single-end reads.
       Reads that could not be merged are discarded. Single-end reads and merged paired-end reads
       are expected to follow this structure (from 5' to 3'):

       • The forward primer sequence. This is optional.

       • A  random barcode (molecular identifier). This is optional. Set the configuration option
         barcode_length_5p to 0 if you don’t have random  barcodes  or  if  you  don’t  want  the
         program to use them.

       • Optionally,  a  run  of  G  nucleotides. This is an artifact of the RACE protocol (Rapid
         amplification of cDNA ends). If you have this, set race_g to true in  the  configuration
         file.

       • 5' UTR

       • Leader

       • Re-arranged V, D and J gene sequences for heavy chains; only V and J for light chains

       • An optional random barcode. Set the configuration option barcode_length_3p to the length
         of this barcode. You can currently not have both a 5' and a 3' barcode.

       • The reverse primer. This is optional.

       We use IgBLAST to detect the location of the V, D, J genes through the igdiscover  igblast
       subcommand.  The  G  nucleotides  after  the  barcode  are  split off if the configuration
       specifies race_g: true. The leader sequence is detected by looking for a start codon  near
       60 bp upstream of the start of the V gene match.

   Configuration
       The  igdiscover  init command creates a configuration file igdiscover.yaml in the analysis
       directory. To configure your analysis, change that file with a text editor before  running
       the analysis with igdiscover run.

       The  syntax  should  be mostly self-explanatory.  The file is in YAML format, but you will
       not need to learn that.  Just follow the examples given in the file.  A few rules that may
       be good to know are the following ones:

       1. Lines starting with the # symbol are comments (they are ignored)

       2. A  configuration  option that is meant to be switched on or off will say something like
          stranded: false if it is off.  Change this to stranded: true to switch  the  option  on
          (and vice versa).

       3. The  primer  sequences  are given as a list, and must be written in a certain way - one
          sequence per line, and a - (dash) in front, like so:

             forward_primers:
             - ACGTACGTACGT
             - AACCGGTTAACC

          Even if you have only one primer sequence, you still need to use this syntax.

       To find  out  what  the  configuration  options  achieve,  see  the  explanations  in  the
       configuration file itself.

       The main parameters parameters that may require adjusting are the following.

       The  iterations  option  sets  the  number  of  rounds  of  V  gene discovery that will be
       performed. By default, three iterations are run. Even with a very  restricted  starting  V
       database  (for  example with only a single V gene sequence), this is usually sufficient to
       identify most novel germline sequences.

       When the starting database is more complete, for  example,  when  analyzing  a  human  IgM
       library  with  the current IMGT heavy chain database, a single iteration may be sufficient
       to produce an individualized database.

       If you do not want to discover any new genes  and  only  want  to  produce  an  expression
       profile, for example, then use iterations: 0.

       The  ignore_j  option should be set to true when producing a V gene database for a species
       where J sequences are unknown:

          ignore_j: true

       Setting the parameters stranded, forward_primers and reverse_primers to the correct values
       can  be  used  to remove 5' and 3' primers from the sequences.  Doing this is not strictly
       necessary for IgDiscover. It is simplest if you do not specify any primer sequences.

   Pregermline and germline filter criteria
       This provides IgDiscover with stringency requirements for V gene discovery that enable the
       program to filter out false positives. Usually the ”pregermline filter” can be used in the
       default mode since  all  these  sequences  will  be  subsequently  passed  to  the  higher
       stringency  ”germline  filter”  where the criteria are set to maximize stringency. Here is
       how it looks in the configuration file:

          pre_germline_filter:
            unique_cdr3s: 2      # Minimum number of unique CDR3s (within exact matches)
            unique_js: 2         # Minimum number of unique J genes (within exact matches)
            check_motifs: false  # Check whether 5' end starts with known motif
            whitelist: true      # Add database sequences to the whitelist
            cluster_size: 0      # Minimum number of sequences assigned to cluster
            differences: 0       # Merge sequences if they have at most this number of differences
            allow_stop: true     # Whether to allow non-productive sequences containing stop codons
            cross_mapping_ratio: 0.02  # Threshold for removal of cross-mapping artifacts (set to 0 to disable)
            allele_ratio: 0.1    # Required minimum ratio between alleles of a single gene

          # Filtering criteria applied to candidate sequences in the last iteration.
          # These should be more strict than the pre_germline_filter criteria.
          #
          germline_filter:
            unique_cdr3s: 5      # Minimum number of unique CDR3s (within exact matches)
            unique_js: 3         # Minimum number of unique J genes (within exact matches)
            check_motifs: false  # Check whether 5' end starts with known motif
            whitelist: true      # Add database sequences to the whitelist
            cluster_size: 100    # Minimum number of sequences assigned to cluster
            differences: 0       # Merge sequences if they have at most this number of differences
            allow_stop: false    # Whether to allow non-productive sequences containing stop codons
            cross_mapping_ratio: 0.02  # Threshold for removal of cross-mapping artifacts (set to 0 to disable)
            allele_ratio: 0.1    # Required minimum ratio between alleles of a single gene

       Factors that affect germline discovery include library source (IgM vs  IgK,  IgL  or  IgG)
       library  size,  sequence error rate and individual genomic factors (for example the number
       of J segments present in an individual).

       In general, setting a higher cutoff of unique_cdr3s and unique_js will minimize the number
       of false positives in the output. Example:

          unique_cdr3s: 10      # Minimum number of unique CDR3s (within exact matches)
          unique_js: 4          # Minimum number of unique J genes (within exact matches)

       If the differences parameter is set to a value higher than 0, the germline filter inspects
       clusters of sequences that are closely related (when the edit distance between them is  at
       most  differences)  and retains only the most common sequence of each cluster. Previously,
       we believed this would removes some false positives due  to  accumulated  random  sequence
       errors of highly expressed alleles that otherwise would pass the cutoff criteria. However,
       we found out that we miss true positives, in particular if there are two  alleles  in  the
       sample  that differ in only a single nucleotide. We have now implemented other measures to
       avoid false positives and recommend against setting the  differences  to  something  other
       than 0.

       Read  also  about  the cross mapping, for which germline filtering corrects, and about the
       germline filters.

       Changed in version The: default for the differences setting was changed from 1 to 0.

   Running IgDiscover
   Resuming failed runs
       The command igdiscover run, which is used to start the  pipeline,  can  also  be  used  to
       resume  execution  if  there  was  an  interruption  (a  transient  failure).  Reasons for
       interruptions might be:

       • Ctrl+C was pressed on the keyboard

       • A full harddisk

       • If running on a cluster, the program may have been terminated because  it  exceeded  its
         allocated time

       • Too little RAM

       • Power loss

       To resume execution after you have fixed the problem, go to the analysis directory and run
       igdiscover run again. It will skip the steps  that  have  already  finished  successfully.
       This  capability  comes from the workflow management system snakemake, on which igdiscover
       run is based.  Snakemake will determine automatically which steps need  to  be  re-run  in
       order to get to a full result and then run only those.

       Alterations  to the configuration file after an interruption are possible, but affect only
       steps that have not already finished successfully. For example, assume you  interrupted  a
       run  with  Ctrl+C  after  it is already past the step in which barcodes are removed. Then,
       even if you change the barcode length in the configuration, the barcode removal step  will
       not  be  re-run when you resume the pipeline and the previous barcode length is in effect.
       See also the next section.

   Changing parameters and re-running parts of the pipeline
       When you experiment  with  parameters  in  the  igdiscover.yaml  file,  such  as  germline
       filtering  criteria, you do not need to re-run the entire pipeline from the beginning, but
       can re-use the results that already exist. This can save a  lot  of  processing  time,  in
       particular when you avoid re-running IgBLAST in this way.

       As described in the previous section, igdiscover run automatically figures out which files
       need to be re-created if a run was interrupted. Unfortunately, this mechanism is currently
       not  smart  enough to also look for changes in the igdiscover.yaml file. Thus, if the full
       pipeline has finished successfully, then re-running igdiscover run  will  just  print  the
       message Nothing to be done.  even after you have changed the configuration file.

       You  will  therefore need to know yourself which file you want to regenerate.  Then follow
       the following steps. Note that these will remove parts of the existing results, and if you
       need to keep them, make a copy of your analysis directory first.

       1. Change the configuration setting.

       2. Delete the file that needs to be re-generated. Assume it is filename

       3. Run  igdiscover run filename to re-create the file. Only that file will be created, not
          the ones that usually would be created afterwards.

       4. Optionally, run igdiscover run (without a file name this time) to update the  remaining
          files (those that depend on the file that was just updated).

       For example, assume you want to modify some germline filtering setting and then re-run the
       pipeline. Change the setting in your igdiscover.yaml, then run these commands:

          rm iteration-01/new_V_germline.tab
          igdiscover run iteration-01/new_V_germline.tab

       The above will have regenerated the  iteration-01/new_V_germline.tab  file  and  also  the
       iteration-01/new_V_germline.fasta file since they are generated by the same script. If you
       want to update any other files, then also run

          igdiscover run

   The analysis directory
       IgDiscover writes all intermediate files, the final V gene database, statistics and  plots
       into the analysis directory that was created with igdiscover init.  Inside that directory,
       there is a final/ subdirectory that contains the analysis results.

       These are the files and subdirectories that  can  be  found  in  the  analysis  directory.
       Subdirectories are described in detail below.

       igdiscover.yaml
              The configuration file.  Make sure to adjust this to your needs as described above.

       reads.1.fastq.gz, reads.2.fastq.gz
              Symbolic links to the raw paired-end reads.

       database/
              The  input V/D/J database (as three FASTA files).  The files are a copy of the ones
              you selected when running igdiscover init.

       reads/ Processed reads (merged, de-duplicated etc.)

       iteration-xx/
              Iteration-specific analysis directory, where “xx” is a  number  starting  from  01.
              Each  iteration  is  run  in  one  of  these  directories.  The first iteration (in
              iteration-01) uses the  original  input  database,  which  is  also  found  in  the
              database/  directory.   The database is updated and then used as input for the next
              iteration.

       final/ After the last iteration, IgBLAST is run again on the input  sequences,  but  using
              the  final  database  (the one created in the very last iteration).  This directory
              contains all the results, such as plots of the repertoire profiles.  If you set the
              number of iterations to 0 in the configuration file, this directory is the only one
              that is created.

   Final results
       Final results are found in the final/ subdirectory of the analysis directory.

       final/database/(V,D,J).fasta
              These three files represent the  final,  individualized  V/D/J  database  found  by
              IgDiscover.   The  D and J files are copies of the original starting database; they
              are not updated by IgDiscover.

       final/dendrogram_(V,D,J).pdf
              These three PDF files contain dendrograms of the  V,  D  and  J  sequences  in  the
              individualized database.

       final/assigned.tab.gz
              V/D/J  gene  assignments  and  other  information  for  each sequence.  The file is
              created by parsing the IgBLAST output in the igblast.txt.gz file.  This is a  table
              that  contains  one  row  for  each  input  sequence.   See  below  for  a detailed
              description of the columns.

       final/filtered.tab.gz
              Filtered V/D/J gene  assignments.  This  is  the  same  as  the  assigned.tab  file
              mentioned  above,  but  with  low-quality assignments filtered out.  Run igdiscover
              filter --help to see the filtering criteria.

       final/expressed_(V,D,J).tab, final/expressed_(V,D,J).pdf
              The V, D and J gene expression counts. Some assignments are filtered out to  reduce
              artifacts.  In  particular,  an allele-ratio filter of 10% is applied. For D genes,
              only those with an E-value of at most 1E-4 and a  coverage  of  at  least  70%  are
              counted.  See  also  the help for the igdiscover count subcommand, which is used to
              create these files.

              The .tab file contains the counts as a table, while the PDF file contains a plot of
              the same values.

              These  tables  also exist in the iteration-specific directories (iteration-xx). For
              those, note that the numbers do not include the genes that were discovered in  that
              iteration.  For  example, iteration-01/expressed_V.tab shows only expression counts
              of the V genes in the starting database.

       final/errorhistograms.pdf
              A PDF with one page per  V  gene/allele.   Each  page  shows  a  histogram  of  the
              percentage differences for that gene.

       final/clusterplots/
              This  is  a  directory  that contains one PNG file for each discovered gene/allele.
              Each image shows a clusterplot of all the sequences assigned to  that  gene.   Note
              that  the  shown clusterplots are by default restricted to showing only at most 300
              sequences, while the actual clustering used by IgDiscover uses 1000 sequences.

       If you are interested in the results of each iteration, you can inspect the  iteration-xx/
       directories.   They are structured in the same way as the final/ subdirectory, except that
       the results are based on the intermediate databases of that iteration.  They also  contain
       the following additional files.

       iteration-xx/candidates.tab
              A  table  with  candidate  novel V alleles (or genes).  This is a list of sequences
              found through the windowing strategy or linkage cluster analysis, as  discussed  in
              our paper. See the full description of candidates.tab.

       iteration-xx/read_names_map.tab
              For  each candidate novel V allele listed in candidates.tab, this file contains one
              row that lists which sequences went into generating this candidate. Only the  exact
              matches are listed, that is, the number of listed sequence names should be equal to
              the value in the exact column.  Each  line  in  this  file  contains  tab-separated
              values.  The  first  is  name  of  the  candidate,  the others are the names of the
              sequences. Some of these sequences may be consensus sequences if  barcode  grouping
              was enabled, so in that case, this will not be a read name.

       iteration-xx/new_V_germline.fasta, iteration-xx/new_V_pregermline.fasta
              The  discovered  list  of V genes for this iteration.  The file is created from the
              candidates.tab file by applying either the germline or  pre-germline  filter.   The
              file  resulting  from  application  of  the  germline  filter  is  used in the last
              iteration only.  The file resulting from application of the pre-germline filter  is
              used in earlier iterations.

       iteration-xx/annotated_V_germline.tab, iteration-xx/annotated_V_pregermline.tab
              A  version  of  the  candidates.tab  file that is annotated with extra columns that
              describe why a candidate was filtered out. See the description of this file.

   Other files
       For completeness, here is a description of the files in the reads/ and stats/ directories.
       They are created during pre-processing and are not iteration specific.

       reads/1-limited.1.fastq.gz, reads/1-limited.1.fastq.gz
              Input  reads  file  limited to the first N entries. This is just a symbolic link to
              the input file if the limit configuration option is not set.

       reads/2-merged.fastq.gz
              Reads merged with PEAR or FLASH

       reads/3-forward-primer-trimmed.fastq.gz
              Merged reads with 5' primer sequences removed. (This file is automatically  removed
              when it is not needed anymore.)

       reads/4-trimmed.fastq.gz
              Merged reads with 5' and 3' primer sequences removed.

       reads/5-filtered.fasta
              Merged,  primer-trimmed  sequences  converted  to  FASTA,  and  too short sequences
              removed.  (This file is automatically removed when it is not needed anymore.)

       reads/sequences.fasta.gz
              Fully pre-processed sequences.  That  is,  filtered  sequences  without  duplicates
              (using VSEARCH)

       stats/reads.txt
              Statistics of pre-processed sequences.

       stats/readlengths.txt, stats/readlengths.pdf
              Histogram    of    the   lengths   of   pre-processed   sequences   (created   from
              reads/sequences.fasta)

   Format of output files
   assigned.tab.gz
       This file is a gzip-compressed table with tab-separated  values.  It  is  created  by  the
       igdiscover  igblast  subcommand  and is the result of parsing raw output from IgBLAST.  It
       contains a few additional columns that do not come directly from IgBLAST.  In  particular,
       the  CDR3 sequence is detected, the sequence before the V gene match is split into UTR and
       leader, and the RACE-specific run of G nucleotides is also detected.  The first row  is  a
       header  row  with  column  names.  Each subsequent row describes the IgBLAST results for a
       single pre-processed input sequence.

       Note: This file is typically quite large.  LibreOffice can open the  file  directly  (even
       though it is compressed), but make sure you have enough RAM.

       Columns:

       count  How  many  copies of input sequence this query sequence represents. Copied from the
              ;size=3;  entry  in  the  FASTA   header   field   that   is   added   by   VSEARCH
              -derep_fulllength.

       V_gene, D_gene, J_gene
              V/D/J gene match for the query sequence

       stop   whether the sequence contains a stop codon (either “yes” or “no”)

       productive

       V_covered, D_covered, J_covered
              percentage of bases of the reference gene that is covered by the bases of the query
              sequence

       V_evalue, D_evalue, J_evalue
              E-value of V/D/J hit

       FR1_SHM, CDR1_SHM, FR2_SHM, CDR2_SHM, FR3_SHM, V_SHM, J_SHM
              rate of somatic hypermutation (actually, an error rate)

       V_errors, J_errors
              Absolute number of errors (differences) in the V and J gene match

       UTR    Sequence of the 5' UTR (the part before the V gene match up to, but not  including,
              the start codon)

       leader Leader sequence (the part between UTR and the V gene match)

       CDR1_nt, CDR1_aa, CDR2_nt, CDR2_aa, CDR3_nt, CDR3_aa
              nucleotide and amino acid sequence of CDR1/2/3

       V_nt, V_aa
              Nucleotide and amino acid sequence of V gene match

       V_CDR3_start
              Start  coordinate  of  CDR3  within  V_nt.  Set  to  zero  if no CDR3 was detected.
              Comparisons involving the V gene ignore those V bases that are part of the CDR3.

       V_end, VD_junction, D_region, DJ_junction, J_start
              nucleotide sequences for various match regions

       name, barcode, race_G, genomic_sequence
              see the following explanation

       The UTR, leader, barcode, race_G and genomic_sequence columns are filled in the  following
       way.

       1. Split  the  5'  end  barcode from the sequence (if barcode length is zero, this will be
          empty), put it in the barcode column.

       2. Remove the initial run of G bases from the remaining sequence, put that in  the  race_G
          column.

       3. The remainder is put into the genomic_sequence column.

       4. If  there  is  a  V  gene  match,  take  the  sequence before it and split it up in the
          following way. Search for the start codon and write the part before  it  into  the  UTR
          column. Write the part starting with the start column into the leader column.

   filtered.tab.gz
       This  table  is  the  same  as  the  assigned.tab.gz  table,  except  that rows containing
       low-quality matches have been filtered out.  Rows fulfilling any of the following criteria
       are filtered:

       • The J gene was not assigned

       • A stop was codon found

       • The V gene coverage is less than 90%

       • The J gene coverage is less than 60%

       • The V gene E-value is greater than 10-3

   candidates.tab
       This table contains the candidates for novel V genes found by the discover subcommand.  As
       the other files, it is a text file in tab-separated values  format,  with  the  first  row
       containing the column headings.  It can be opened directly in LibreOffice, for example.

       Candidates  are  found  by  inspecting  all the sequences assigned to a database gene, and
       clustering them in multiple ways.  The  candidate  sequences  are  found  by  computing  a
       consensus from each found cluster.

       Each  row  describes  a  single  candidate,  but possibly multiple clusters.  If there are
       multiple clusters from a single gene that lead to the same consensus sequence,  then  they
       get  only  one  row.  The cluster column lists the source clusters for the given sequence.
       Duplicate sequences can still occur when two different genes lead to  identical  consensus
       sequences.  (These duplicated sequences are merged by the germline filters.)

       Below,  we  use  the term cluster set to refer to all the sequences that are in any of the
       listed clusters.

       Some clusters lead to ambiguous consensus sequences (those that include N  bases).   These
       have already been filtered out.

       name   The name of the candidate gene. See novel gene names.

       source The  original  database  gene  to which the sequences from this row were originally
              assigned.  All candidates coming from the same source gene are grouped together.

       chain  Chain type: VH for heavy, VK for light chain lambda, VL for light chain kappa

       cluster
              From which type of cluster or clusters the consensus was computed.   If  there  are
              multiple  clusters  that  give  rise  to  the same consensus sequence, they are all
              listed here, separated by  semicolon.   A  cluster  name  such  as  2-4  is  for  a
              percentage  difference window: Such a cluster consists of all sequences assigned to
              the source gene that have a percentage difference to it between 2 and 4 percent.

              A cluster name such as cl3 describes a cluster generated  through  linkage  cluster
              analysis.   The clusters are simply named cl1, cl2, cl3 etc.  If any cluster number
              seems to be missing (such as when cl1 and cl3 occur, but not cl2), then this  means
              that the cluster led to an ambiguous consensus sequence that has been filtered out.
              Since the cl clusters are created from a random subsample of the data (in order  to
              keep  computation  time down), they are never larger than the size of the subsample
              (currently 1000).

              The name db represents a cluster that is identical to the database sequence.  If no
              actual  cluster  corresponding  to the database sequence is found, but the database
              sequence is expressed, a db cluster is inserted artificially in order to make  sure
              that  the  sequence  is  not  lost.  The cluster name all represents the set of all
              sequences assigned to the source gene.  This means that  an  unambiguous  consensus
              could  be  computed  from  all the sequences.  Typically, this happens during later
              iterations when there are no more novel sequences among the sequences  assigned  to
              the database gene.

       cluster_size
              The  number  of sequences from which the consensus was computed.  Equivalently, the
              size of the cluster set (all clusters described in this row).  Sequences  that  are
              in multiple clusters at the same time are counted only once.

       Js     The number of unique J genes associated with the sequences in the cluster set.

              Consensus  sequences  are  computed  only  from  V  gene sequences, but each V gene
              sequence is part of a full V/D/J sequence.  We therefore know for each  V  sequence
              which  J  gene it was found with.  This number says how many different J genes were
              found for all sequences that the consensus in this row was computed from.

       CDR3s  The number of unique CDR3 sequences associated with the sequences  in  the  cluster
              set.   See  also  the  description  for  the  Js column.  This number says how many
              different CDR3 sequences were found for all sequences that the  consensus  in  this
              row was computed from.

       exact  The  number  of  exact  occurrences  of  the consensus sequence among all sequences
              assigned to the source gene, ignoring the 3' junction region.

              To clarify: While the  consensus  sequence  is  computed  only  from  a  subset  of
              sequences  assigned to a source gene, all sequences assigned to the source gene are
              searched for exact occurrences of that consensus sequence.

              When comparing sequences, they are first truncated at the 3' end by removing  those
              (typically 8) bases that correspond to the CDR3 region.

       barcodes_exact
              How  many  unique  barcode sequences were used by the sequences in the set of exact
              sequences (described above).

       Ds_exact
              How many unique D genes were used by the sequences in the set  of  exact  sequences
              (described  above).  Only  those  D gene assignments are included in this count for
              which the number of errors is zero, the E-value is at most a given  threshold,  and
              for which the number of covered bases is at least a given percentage.

       Js_exact
              How  many  unique  J genes were used by the sequences in the set of exact sequences
              (described above).

       CDR3s_exact
              How many unique CDR3 sequences were used by the  sequences  in  the  set  of  exact
              sequences (described above).

       clonotypes
              The  estimated  number  of  clonotypes  within the set of exact sequences (which is
              described above).  The value is computed by clustering the  unique  CDR3  sequences
              associated  with all exact occurrences, allowing up to six differences (mismatches,
              insertions, deletions) and then counting the number of resulting clusters.

       database_diff
              The number of differences between the consensus sequence and the  sequence  of  the
              source  gene.  (Given as edit distance, that is insertion, deletion, mismatch count
              as one difference each.)

       has_stop
              Indicates whether the consensus sequence contains a stop codon.

       looks_like_V
              Whether the consensus sequence “looks like” a true V gene (1  if  yes,  0  if  no).
              Currently,  this  checks  whether the 5' end of the sequence matches a known V gene
              motif.

       CDR3_start
              Where the CDR3 starts within the discovered V gene sequence.  This  uses  the  most
              common  CDR3  start  location  among  the  sequences  from  which this consensus is
              derived.

       consensus
              The consensus sequence itself.

       The igdiscover discover command can also be run by hand with other  parameters,  in  which
       case additional columns may appear.

       N_bases
              Number of N bases in the consensus

   annotated_V_*.tab
       The  two  files annotated_V_germline.tab and annotated_V_pregermline.tab are copies of the
       candidates.tab file with two extra columns that show why a candidate was filtered  in  the
       germline and pre-germline filtering steps. The two columns are:

          • is_filtered  –  This  is  a number that indicates how many filtering criteria exclude
            this candidate apply.

          • why_filtered – This is a semicolon-separated list of filtering reasons.

       The following values can occur in the why_filtered column:

       too_low_dbdiff
              The number of differences between this candidate and the database is lower than the
              required number.

       too_many_N_bases
              The candidate contains too many N nucleotide wildcard characters.

       too_low_CDR3s_exact
              The CDR3s_exact value for this candidate is lower than required.

       too_high_CDR3_shared_ratio
              The CDR3_shared_ratio is higher than the configured threshold.

       too_low_Js_exact
              The Js_exact value is lower than the configured threshold.

       has_stop
              The  filter  configuration disallows stop codons, but this candidate has one and is
              not whitelisted.

       too_low_cluster_size
              The cluster_size of this candidate is lower than the configured threshold, and  the
              candidate is not whitelisted.

       is_duplicate
              A  filtering  criterion not listed above applies to this candidate. This covers all
              the filters that need to compare candidates to  each  other:  cross-mapping  ratio,
              clonotype allele ratio, exact ratio, Ds_exact ratio.

   Names for discovered genes
       Each  gene  discovered  by  IgDiscover  gets  a  unique  name such as “VH4.11_S1234”.  The
       “VH4.11” is the name of the database  gene  to  which  the  novel  V  gene  was  initially
       assigned.  The number 1234 is derived from the nucleotide sequence of the novel gene. That
       is, if you discover the same sequence in two different runs of the IgDiscover, or just  in
       different  iterations, the number will be the same. This may help when manually inspecting
       results.

       Be aware that you still need to check the sequence itself since even  different  sequences
       can sometimes lead to the same number (a “hash collision”).

       The _S1234 suffixes do not accumulate.  Before IgDiscover adds the suffix in an iteration,
       it removes the suffix if it already exists.

   Subcommands
       The igdiscover program has multiple subcommands.  You should already be familiar with  the
       two commands init and run.  Each subcommand comes with its own help page that shows how to
       use that subcommand.  Run the command with the --help option to see the help. For example,

          igdiscover run --help

       shows the help for the run subcommand.

       The following additional subcommands may be useful for further analysis.

       commonv
              Find common V genes between two different antibody libraries

       upstream
              Cluster upstream sequences (UTR and leader) for each gene

       dendrogram
              Draw a dendrogram of sequences in a FASTA file.

       rename Rename sequences in a target FASTA file using a template FASTA file

       union  Compute union of sequences in multiple FASTA files

       The following subcommands are used internally, and listed here for completeness.

       filter Filter a table with IgBLAST results

       count  Count and plot V, D, J gene usage

       group  Group sequences by barcode and V/J assignment  and  print  each  group’s  consensus
              (unused in IgDiscover)

       germlinefilter
              Create  new  V  gene  database  from  V  gene  candidates  using  the  germline and
              pre-germline filter criteria.

       discover
              Discover candidate new V genes within a single antibody library

       clusterplot
              For each V gene, plot a clustermap of the sequences assigned to it

       errorplot
              Plot histograms of differences to reference V gene

   Germline and pre-germline filtering
       V gene sequences found by the clustering step of the program (the discover subcommand) are
       stored in the candidates.tab file. The entries are “candidates” because many of these will
       be PCR or other artifacts and therefore do not represent true novel V genes. The  germline
       and  pre-germline  filters  take  care  of removing artifacts. They germline filter is the
       “real” filter and used only in the last iteration  in  order  to  obtain  the  final  gene
       database. The pre-germline filter is less strict and used in all the earlier iterations.

       The  germline  filters  are  implemented  in  the igdiscover germlinefilter subcommand. It
       performs the following filtering and processing steps:

       • Discard sequences with N bases

       • Discard sequences that come from a consensus over too few source sequences

       • Discard sequences with too few unique CDR3s (CDR3s_exact column)

       • Discard sequences with too few unique Js (Js_exact column)

       • Discard sequences identical to one of the database sequences (if DB given)

       • Discard sequences that do not match a set of known good motifs

       • Discard sequences that contain a stop codon (has_stop column)

       • Discard near-duplicate sequences

       • Discard cross-mapping artifacts

       • Discard sequences whose “allele ratio” is too low.

       If a whitelist of sequences is provided (by default, this is the input V  gene  database),
       then the candidates that appear on it

       • are not checked for the cluster size criterion,

       • do not need to match a set of known good motifs,

       • are never considered near-duplicates (but they are checked for cross-mapping and for the
         allele ratio),

       • are allowed to contain a stop codon.

       Whitelisting allows IgDiscover to identify known germline sequences that are expressed  at
       low  levels in a library. If enabled with whitelist: true (the default) in the pregermline
       and germline filter sections of the configuration  file,  the  sequences  present  in  the
       starting database are treated as validated germline sequences and will not be discarded if
       due  to  too  small  cluster  size  as  long  as  they  fulfill  the  remaining   criteria
       (unique_cdr3s, unique_js etc.).

       You can see why a candidate was filtered by inspecting the annotated_V_*.tab files

   Cross-mapping artifacts
       If  two  very  similar  sequences  appear in the database used by IgBLAST, then sequencing
       errors may lead to  one  sequence  incorrectly  being  assigned  to  the  other.  This  is
       particularly  problematic  if  one of the sequences is highly expressed while the other is
       not expressed at all. The not expressed sequence is even included in the list  of  V  gene
       candidates  because  it is in the input database and therefore whitelisted. We call this a
       “cross-mapping artifact”.

       The germline filtering step  of  IgDiscover  therefore  aims  to  eliminate  cross-mapping
       artifacts by checking all pairs of sequences for the following:

       • The two sequences have a distance of 1,

       • they are both in the database for that particular iteration (only then can cross-mapping
         occur)

       • the ratio between the expression levels of the two  sequences  (using  the  cluster_size
         field  in the candidates.tab file) is less than the value cross_mapping_ratio defined in
         the configuration file (0.02 by default).

       If all that is the case, then the sequence with the lower expression is discarded.

   Allele-ratio filtering
       When multiple alleles of the same gene appear in the list of V gene  candidates,  such  as
       IGHV1-2*02  and  IGHV1-2*04,  the  germline filter computes the ratio of the values in the
       exact and the clonotypes columns between them.  If  the  ratio  is  under  the  configured
       threshold,  the  candidate  with  the  lower  count  is discarded. See the exact_ratio and
       clonotype_ratio settings in the germline_filter and  pregermline_filter  sections  of  the
       configuration file.

       New in version 0.7.0.

   Data from the Sequence Read Archive (SRA)
       To  work  with  datasets  from  the  Sequence  Read  Archive, you may want to use the tool
       fastq-dump, which can download the reads in the format required by  IgDiscover.  You  just
       need  to  know  the  accession  number,  such as “SRR2905710” and then run this command to
       download the files to the current directory:

          fastq-dump --split-files --gzip SRR2905710

       The --split-files option ensures that the paired-end reads  are  stored  in  two  separate
       files,  one  for  the  forward and one for the reverse read, respectively.  (If you do not
       provide it, you will get an interleaved FASTQ  file  that  currently  cannot  be  read  by
       IgDiscover).  The  --gzip option creates compressed output.  The command creates two files
       in the current directory. In the above example, they would be named  SRR2905710_1.fastq.gz
       and SRR2905710_2.fastq.gz.

       The  program fastq-dump is part of the SRA toolkit. On Debian-derived Linux distributions,
       you can typically install it with sudo apt-get install sra-toolkit. On Conda,  install  it
       with conda install -c bioconda sra-tools.

   Does random subsampling influence results?
       Random  subsampling  indeed  influences  somewhat which sequences are found by the cluster
       analysis, particularly in the beginning. However, the probability is large that all highly
       expressed  sequences  are  represented  in  the  random  sample. Also, due to the database
       growing with subsequent iterations, the set of sequences assigned  to  a  single  database
       gene  becomes  smaller  and  more homogeneous. This makes it increasingly likely that also
       sequences expressed at lower levels result in a cluster since they now make  up  a  larger
       fraction of each subsample.

       Also,  many  of  the clusters which are captured in one subsample but not in the other are
       artifacts that are then filtered out anyway by the pre-germline or germline filter.

       On human data with a nearly complete starting database, the subsampling seems to  have  no
       influence  at  all, as we determined experimentally. We repeated a run of the program four
       times on the same human dataset, using identical parameters  each  time  except  that  the
       subsampling  was done in a different way. Although intermediate results differed, all four
       personalized databases that the program produced were exactly identical.

       Concordance is lower, though, when the input database is not as complete as the human one.

       The way in which random subsampling is done is modified by the seed configuration setting,
       which  is  set  to  1  by  default. If its value is the same for two different runs of the
       program with otherwise identical  settings,  the  numbers  chosen  by  the  random  number
       generator  will  be  the  same and therefore also subsampling will be done in an identical
       way. This makes runs of the program reproducible. In order to test how results differ when
       subsampling is done in a different way, change the seed to a different value.

   Logging the program’s output to a file
       When you report a bug or unusual behavior to us, we might ask you to send us the output of
       igdiscover run. You can send its output to a file by running the program like this:

          igdiscover run >& logfile.txt

       And here is how to send the logging output to a file and  also  see  the  output  in  your
       terminal at the same time (but you lose the colors):

          igdiscover run |& tee logfile.txt

   Caching of IgBLAST results and of merged reads
       Sometimes  you  may  want  to  re-analyze  a  dataset multiple times with different filter
       settings.  To speed this up,  IgDiscover  can  cache  the  results  of  two  of  the  most
       time-consuming steps, read-merging with PEAR and running IgBLAST.

       The  cache  is  disabled  by  default as it uses a lot of disk space. To enable the cache,
       create a file named ~/.config/igdiscover.conf with the following contents:

          use_cache: true

       If you do so, a directory named ~/.cache/igdiscover/ is created  the  next  time  you  run
       IgDiscover  and all IgBLAST results as well as merged reads from PEAR are stored there. On
       subsequent runs, the existing result is  used  directly  without  calling  the  respective
       program, which speeds up the pipeline considerably.

       The  cache  is only used when we are certain that the results will indeed be the same. For
       example, if the IgBLAST program version or th V/D/J database changes, the cached result is
       not used.

       The  files  in  the cache are compressed, but the cache may still get large over time. You
       can delete the cache with rm -r ~/.cache/igdiscover to free the space.

       You should also delete the cache when updating to a  newer  IgBLAST  version  as  the  old
       results will not be used anymore.

   Terms
       Analysis directory
              The  directory that was created with igdiscover init. Separate ones are created for
              each experiment. When you used igdiscover init myexperiment, the analysis directory
              would be myexperiment/.

       Starting database
              The  initial  list of V/D/J genes. These are expected to be in FASTA format and are
              copied into the database/ directory within each analysis directory.

QUESTIONS AND ANSWERS

   How many sequences are needed to discover germline V gene sequences?
       Library sizes of several hundred thousand sequences are required  for  V  gene  discovery,
       with  even higher numbers necessary for full database production. For example, IgM library
       sizes of 750,000 to 1,000,000 sequences for heavy chain databases and  1.5  to  2  million
       sequences for light chain databases.

   Can IgDiscover analyze IgG libraries?
       IgDiscover  has  been developed to identify germline databases from libraries that contain
       substantial fractions of unswitched antibody sequences. We  recommend  IgM  libraries  for
       heavy  chain  V  gene  identification  and  IgKappa and IgLambda libraries for light chain
       identification.  IgDiscover  can  identify  a  proportion  of  gemline  sequences  in  IgG
       libraries but the process is much more efficient with IgM libraries, enabling the full set
       of germline sequences to be discovered.

   Can IgDiscover analyze a previously sequenced library?
       Yes, IgDiscover accepts both unpaired FASTQ files and paired FASTA files but  the  program
       should be made aware which is being used, see input requirements.

   Do the positions of the PCR primers make a difference to the output?
       Yes.  For  accurate  V gene discovery, all primer sequences must be external to the V gene
       sequences.  For example, forward multiplex amplification primers should be present in  the
       leader  sequence  or  5'  UTR,  and reverse amplification primers should be located in the
       constant region, preferably close to the 5' border of the constant  region.  Primers  that
       are  present  in  framework  1  region  or  J  segments  are  not  recommended for library
       production.

   What are the advantages to 5'-RACE compared to multiplex PCR for IgDiscover analysis?
       Both 5'-RACE and multiplex PCR have their own advantages.

       5'-RACE will enable library production from species where the upstream V gene sequence  is
       unknown.   The output of the upstream subcommand in IgDiscovery enables the identification
       of consensus leader and 5'-UTR sequences for each of the identified germline V genes, that
       can  subsequenctly  be  used  for primer design for either multiplex PCR or for monoclonal
       antibody amplification sets.

       Multiplex  PCR  is  recommended  for  species  where  the  upstream  sequences  are   well
       characterized.   Multiplex  amplification  products  are shorter than 5'-RACE products and
       therefore will be easier to pair and will have less length associated sequence errors.

   What is meant by 'starting database'?
       The starting database refers to the folder that contains the three FASTA  files  necessary
       for  the  process  of  iterative V gene discovery to begin. IgDiscover uses the standalone
       IgBLAST program for comparative assignment of sequences to the starting database.  Because
       IgBlast  requires  three  files (for example V.fasta, D.fasta, J.fasta), three FASTA files
       should be included in the database folder for each analysis to proceed.

       In the case of light chains (that do not contain D  segments),  a  dummy  D  segment  file
       should  be  included  as  IgBLAST  will  not proceed if it does not see three files in the
       database folder. It is sufficient to save the following  sequence  as  a  fasta  file  and
       rename  it  D.fasta,  for  example, for it to function as the dummy D.fasta file for human
       light chain analysis:

          >D_ummy
          GGGGGGGGGG

   How can I use the IMGT database as a starting database?
       Since we do not have permission to distribute IMGT database  files  with  IgDiscover,  you
       need  to  download  them  directly  from  IMGT.   See  the section about obtaining a V/D/J
       database.

   How do I change the parameters of the program?
       By editing the configuration file.

   Where do I find the individualized database produced by IgDiscover?
       The final germline database  in  FASTA  format  is  in  your  analysis  directory  in  the
       subdirectory  final/database/.  The  V.fasta  file  contains  the new list of V genes. The
       D.fasta and J.fasta files are unchanged from the starting database.

       A phylogenetic tree of the V sequences can be found in final/dendrogram_V.pdf.

       For more details of how that database was created, you need to inspect the  files  created
       in  the  last iteration of the discovery process, located in iteration-xx, where xx is the
       number of iterations configured in the igdiscover.yaml configuration file. For example, if
       three iterations were used, look into iteration-03/.

       Most interesting in that folder are likely

       • the linkage cluster analysis plots in iteration-03/clusterplots/,

       • the  error  histograms  in  iteration-03/errorhistograms.pdf, which contain the windowed
         cluster analysis figures.

       • Details about the individualized database in new_V_germline.tab  in  tab-separated-value
         format

       The new_V_germline.fasta file is identical to the one in final/database/V.fasta

   What does the _S1234 at the end of same gene names mean?
       Please see the Section on gene names.

ADVANCED TOPICS

       IgDiscover  itself  does not (yet) come with all imaginable analysis facilities built into
       it.  However, it creates many files (mostly with tables)  that  can  be  used  for  custom
       analysis.   For example, all .tab files (in particular assigned.tab.gz and candidates.tab)
       can be opened and inspected in a spreadsheet application such as LibreOffice. From  there,
       you can do basic tasks such as sorting from the menu of that application.

       Often,  these  facilities  are  not  enough,  however, and some basic understanding of the
       command-line is helpful. Clearly, this is not as convenient as working in a graphical user
       interface (GUI), but we do not currently have the resources to provide one for IgDiscover.
       To alleviate this somewhat, we provide here instructions for a few  things  that  you  may
       want to do with the IgDiscover result files.

   Extract all sequences that match any database gene exactly
       The candidates.tab file tells you for each discovered sequence how often an exact match of
       that sequence was found in your input reads. A high number of  exact  matches  is  a  good
       indication  that  the  candidate  is  actually  a new gene or allele. In order to find the
       original reads that correspond to those matches, you can look at the filtered.tab.gz  file
       and extract all rows where the V_errors column is zero.

       First, run this on the filtered.tab.gz file:

          zcat filtered.tab.gz | head -n 1 | tr '\t' '\n' | nl

       This  will  enumerate  the columns in the file. Take a note of the index that the V_errors
       column has. In newer pipeline versions, the index is 21. Then extract all rows of the file
       where that field is equal to zero:
          zcat filtered.tab.gz | awk -vFS="t" '$21 == 0 || NR == 1' > exact.tab

       If  the column wasn’t 21, then replace the $21 appropriately. The part where it says NR ==
       1 ensures that the column headings are also printed.

   Extra configuration settings
       Some configuration settings are not documented in the default igdiscover.yaml  file  since
       they rarely need to be changed.

          # Leave empty or choose a species name supported by IgBLAST:
          # human, mouse, rabbit, rat, rhesus_monkey
          # This setting is not used anywhere except that it is passed
          # to IgBLAST using the -organism option. Since we provide IgBLAST
          # with our own gene databases, it seems this has no effect.
          species:

          # Which program to use for computing multiple alignments. This is used for
          # computing consens sequences.
          # Choose 'mafft', 'clustalo', 'muscle' or 'muscle-fast'.
          # 'muscle-fast' runs muscle with parameters "-maxiters 1 -diags".
          #
          #multialign_program: muscle-fast

DEVELOPMENT

Source codeReport an issue

   Installing the development version
       To use the most recent IgDiscover version from Git, follow these steps.

       1. If  you  haven’t  done  so,  install  miniconda.  See  the  first  steps of the regular
          installation instructions. Do not install IgDiscover, yet!

       2. Clone the IgDiscover repository:

             git clone https://github.com/NBISweden/IgDiscover.git

          (Use the git@ URL instead if you are a developer.)

       4. Create a new Conda environment using the environment.yml file in the repository:

             cd IgDiscover
             conda env create -n igdiscover -f environment.yml

          You can choose a  different  environment  name  by  changing  the  name  after  the  -n
          parameter.  This  may  be  necessary,  when  you already have a regular (non-developer)
          IgDiscover installation in an igdiscover environment that you don’t want to overwrite.

       5. Activate the environment:

             source activate igdiscover

          (Or use whichever name you chose above.)

       6. Install IgDiscover in “editable” mode:

             python3 -m pip install -e .

       Whenever you want to update the software:

          cd IgDiscover
          git pull

       It may also be necessary to repeat the python3 -m pip install -e . step.

   IgBLAST result cache
       For development, in particular when  running  tests  repeatedly,  you  should  enable  the
       IgBLAST  result  cache. The cache stores IgBLAST output. If the same dataset with the same
       dataset is run a second time, the result is retrieved from the cache and  IgBLAST  is  not
       re-run.  This saves a lot of time when re-running datasets, but may also fill up the cache
       directory ~/.cache/igdiscover/.  Also, in production, datasets are usually not re-run with
       the same settings, which is why caching is disabled by default.

       To enable the cache, create a file ~/.config/igdiscover.conf with the following content:

          use_cache: true

       The file is in YAML format, but at the moment, no other settings are supported.

   Building the documentation
       Go to the doc/ directory in the repository, then run:

          make

       to  build  the documentation locally. Open _build/html/index.html in a browser. The layout
       is different from the version shown on Read the  Docs,  but  allows  you  to  preview  any
       changes you may have made.

   Making a release
       We  use versioneer to manage version numbers. It extracts the version number from the most
       recent tag in Git. Thus, to increment the version number, create a Git tag:

          git tag v0.5

       The v prefix is mandatory.

       Then:

       • tests/run.shpython3 setup.py sdisttwine upload sdist/igdiscover-0.10.tar.gz

       • Update bioconda recipe

   Removing IgDiscover from a Linux system
       If you have been playing around with different  installation  methods  (pip,  conda,  git,
       python3  setup.py  install etc.) you may have multiple copies of IgDiscover on your system
       and you will likely run into problems on updates. Here is a list you can follow  in  order
       to  get rid of the installations as preparation for a clean re-install. Do not add sudo to
       the commands below if you get permission problems, unless explicitly told to do so! If one
       of the steps does not work, that is fine, just continue.

       1. Delete  miniconda:  Run  the  command  which  conda.  The output will be something like
          /home/myusername/miniconda3/bin/conda. The  part  before  bin/conda  is  the  miniconda
          installation  directory.  Delete  that  folder.  In this case, you would need to delete
          miniconda3 in /home/myusername.

       2. Run pip3 uninstall igdiscover. If this runs successfully and prints some messages about
          removing  files,  then repeat the same command! Do this until you get a message telling
          you that the package cannot be uninstalled because it is not installed.

       3. Repeat the previous step, but with pip3 uninstall sqt.

       4. If you have a directory named .local within your home directory, you may want to rename
          it:  mv  .local dot-local-backup You can also delete it, but there is a small risk that
          other software (not IgDiscover) uses that directory. The  directory  is  hidden,  so  a
          normal ls will not show it.  Use ls -la while in your home directory to see it.

       5. If  you  have  ever  used  sudo  to install IgDiscover, you may have an installation in
          /usr/local/. You can try to remove it with sudo pip3 uninstall igdiscover.

       6. Delete the cloned Git repository if you have one. This is the directory  in  which  you
          run git pull.

       Finally,  you  can  follow  the  normal  installation  instructions and then the developer
       installation instructions.

CHANGES

   v0.11 (2018-11-27)
       • The IgBLAST cache is now disabled by default. We assume that, in  most  cases,  datasets
         will  not  be re-run with the exact same parameters, and then it only fills up the disk.
         Delete your cache with rm -r ~/.cache/igdiscover to reclaim the  space.  To  enable  the
         cache, create a file ~/.config/igdiscover.conf with the contents use_cache: true.

       • If  you  choose to enable the cache, results from the PEAR merging step will now also be
         cached. See also the caching documentation.

       • Added detection of chimeras to the (pre-)germline filters. Any novel allele that can  be
         explained   as  a  chimera  of  two  unmodified  reference  alleles  is  marked  in  the
         new_V_germline.tab file. This is a bit sensitive, so  the  candidate  is  currently  not
         discarded.

       • Two   additional  files  annotated_V_germline.tab  and  annotated_V_pregermline.tab  are
         created in each iteration during the germline filtering step. These are identical to the
         candidates.tab file, except that they contain a why_filtered column that describes why a
         sequence was filtered. See the documentation for this feature.

       • A more realistic test dataset (v0.5), now based on human instead  of  rhesus  data,  was
         prepared. The testing instructions have been updated accordingly.

       • J discovery has been tuned to give fewer truncated sequences.

       • Statistics are written to stats/stats.json.

       • V   SHM   distribution   plots   are   created  automatically  and  written  written  to
         v-shm-distributions.pdf in each iteration folder.

       • An igdiscover dbdiff subcommand was added that can compare two FASTA files.

   v0.10 (2018-05-11)
       • When computing a consensus sequence, allow some sequences to be truncated in the 3' end.
         Many  of  the  discovered novel V alleles were truncated by one nucleotide in the 3' end
         because IgBLAST does not always extend the alignment to the end of the  V  sequence.  If
         these slightly too short V sequences were in the majority, their consensus would lead to
         a truncated sequence as well. The new consensus algorithm allows for this effect at  the
         3' end and can therefore more often than previously find the full sequence.  Example:

            TACTGTGCGAGAGA (seq 1)
            TACTGTGCGAGAGA (seq 2)
            TACTGTGCGAGAG- (seq 3)
            TACTGTGCGAG--- (seq 4)
            TACTGTGCGAG--- (seq 5)

            TACTGTGCGAGAG  (previous consensus)
            TACTGTGCGAGAGA (new consensus)

       • Add  a  column  database_changes  to  the new_V_germline.tab file that describes how the
         novel sequence differs from the database sequence. Example: 93C>T; 114A>G

       • Allow filtering by CDR3_shared_ratio and do so by default (needs documentation)

       • Cache the edit distance when computing the  distance  matrix.  Speeds  up  the  discover
         command slightly.

       • discover: Use more than six CPU cores if available

       • igblast: Print progress every minute

   v0.9 (2018-03-22)
       • Implemented allele ratio filtering for J gene discovery

       • J  genes  are  discovered  as  part  of  the pipeline (previously, one needed to run the
         discoverj script manually)

       • In each iteration, dendrograms are now created not only for V genes, but also for D  and
         J genes. The file names are dendrogram_D.pdf, dendrogram_J.pdf

       • The  V  dendrograms  are now in dendrogram_V.pdf (no longer V_dendrogram.pdf). This puts
         all the dendrograms together when looking at the files in the iteration directory.

       • The V_usage.tab and V_usage.pdf files are no longer created.   Instead,  expressed_V.tab
         and  expressed_V.pdf are created. These contain similar information, but an allele-ratio
         filter is used to filter out artifacts.

       • Similarly, expressed_D.tab and expressed_J.tab and their .pdf counterparts  are  created
         in each iteration.

       • Removed parse subcommand (functionality is in the igblast subcommand)

       • New  CDR3  detection method (only heavy chain sequences): CDR3 start/end coordinates are
         pre-computed using the database V and J  sequences.  Increases  detection  rate  to  99%
         (previously less than 90%).

       • Remove  the ability to check discovered genes for required motifs. This has never worked
         well.

       • Add a column clonotypes to the candidates.tab that tries to count  how  many  clonotypes
         are associated with a single candidate (using only exact occurrences).  This is intended
         to replace the CDR3s_exact column.

       • Add an exact_ratio to the germline filtering options. This checks the ratio between  the
         exact V occurrence counts (exact column) between alleles.

       • Germline filtering option allele_ratio was renamed to clonotypes_ratio

       • Implement  a  cache  for IgBLAST results. When the same dataset is re-analyzed, possibly
         with different parameters, the cached results are used instead  of  re-running  IgBLAST,
         which  saves  a  lot  of time. If the V/D/J database or the IgBLAST version has changed,
         results are not re-used.

   v0.8.0 (2017-06-20)
       • Add a barcodes_exact column to the candidates table.  It  gives  the  number  of  unique
         barcode  sequences  that were used by the sequences in the set of exact sequences. Also,
         add a configuration setting barcode_consensus that can  turn  off  consensus  taking  of
         barcode groups, which needs to be set to false for barcodes_exact to work.

       • Add a Ds_exact column to candidates table.

       • Add a D_coverage configuration option.

       • The  pre-processing  filtering  step  no  longer  reads  in  the  full  table of IgBLAST
         assignments, but filters the table piece by piece. Memory usage for this step  therefore
         does not depend anymore on the dataset size and should always be below 1 GB.

       • The  functionality  of  the  parse  subcommand  has  been  integrated  into  the igblast
         subcommand. This means that igdiscover igblast  now  directly  outputs  a  result  table
         (assigned.tab). This makes it easier to use that subcommand directly instead of only via
         the workflow.

       • The igblast subcommand now always runs makeblastdb  by  itself  and  deletes  the  BLAST
         database afterwards. This reduces clutter and ensures the database is always up to date.

       • Remove  the  library_name configuration setting. Instead, the library_name is now always
         the same as the name of analysis directory.

   v0.7.0 (2017-05-04)
       • Add an “allele ratio” criterion to the germline filter to further reduce the  number  of
         false  positives.  The  filter is activated by default and can be configured through the
         allele_ratio setting in the configuration file. See the documentation for how it works.

       • Ignore the CDR3-encoding bases whenever comparing two V gene sequences.

       • Avoid finding 5'-truncated V genes by extending found hits towards the 5' end.

       • By default, candidate sequences are no longer merged if they are nearly identical.  That
         is, the differences setting within the two germline filter configuration sections is now
         set to zero by default.  Previously, we believed the merging  would  remove  some  false
         positives,  but  it  turns  out we also miss true positives. It also seems that with the
         other changes in this version we also no longer get the particular false  positives  the
         setting was supposed to catch.

       • Implement an experimental discoverj script for J gene discovery.  It is curently not run
         automatically as part of igdiscover run. See igdiscover discoverj --help for how to  run
         it manually.

       • Add  a  config  subcommand,  which can be used to change the configuration file from the
         command-line.

       • Add a V_CDR3_start column to the assigned.tab/filtered.tab tables.  It  describes  where
         the CDR3 starts within the V sequence.

       • Similarly,  add  a CDR3_start column to the new_V_germline.tab file describing where the
         CDR3 starts within a discovered V sequence.  It is computed by  using  the  most  common
         CDR3 start of the sequences within the cluster.

       • Rename the compose subcommand to germlinefilter.

       • The  init  subcommand  automatically  fixes  certain  problems  in  the  input  database
         (duplicate sequences, empty records, duplicate sequence  names).  Previously,  it  would
         complain, but the user would have to fix the problems themselves.

       • Move source code to GitHub

       • Set up automatic code testing (continuous integration) via Travis

       • Many documentation improvements

   v0.6.0 (2016-12-07)
       • The  FASTA files of the input V/D/J gene lists now need to be named V.fasta, D.fasta and
         J.fasta. The species name is no longer  part  of  the  file  name.  This  should  reduce
         confusion when working with species not supported by IgBLAST.

       • The  species:  configuration  setting  in the configuration can (and should) now be left
         empty. Its only use was that it is passed to  IgBLAST,  but  since  IgDiscover  provides
         IgBLAST with its own V/D/J sequences anyway, it does not seem to make a difference.

       • A  “cross-mapping”  detection  has  been  added, which should reduce the number of false
         positives.  See the documentation for an explanation.

       • Novel sequences identical to a database sequence no longer get the _S1234 suffix.

       • No longer trim trim the initial G run in sequences (due to RACE) by default. It is now a
         configuration setting.

       • Add  cdr3_location  configuration  setting:  It  allows  to set whether to use a CDR3 in
         addition to the barcode for grouping sequences.

       • Create a groups.tab.gz file by default (describing the de-barcoded groups)

       • The pre-processing filter is now configurable. See the preprocessing_filter  section  in
         the configuration file.

       • Many improvements to the documentation

       • Extended and fixed unit tests. These are now run via a CI system.

       • Statistics in JSON format are written to stats/stats.json.

       • IgBLAST 1.5.0 output can now be parsed. Parsing is also faster by 25%.

       • More helpful warning message when no sequences were discovered in an iteration.

       • Drop support for Python 3.3.

   v0.5 (2016-09-01)
       • V  sequences  of  the input database are now whitelisted by default.  The meaning of the
         whitelist configuration option has changed: If set to  false,  those  sequences  are  no
         longer whitelisted.  To whitelist additional sequences, create a whitelist.fasta file as
         before.

       • Sequences with stop codons are now filtered out by default.

       • Use more stringent germline filtering parameters by default.

   v0.4 (2016-08-24)
       • It is now possible to install and run IgDiscover on OS X. Appropriate Conda packages are
         available on bioconda.

       • Add  column  has_stop  to candidates.tab, which indicates whether the candidate sequence
         contains a stop codon.

       • Add a configuration option that makes it possible to  disable  the  5'  motif  check  by
         setting check_motifs: false (the looks_like_V column is ignored in this case).

       • Make it possible to whitelist known sequences: If a found gene candidate appears in that
         list, the sequence is included in the list of discovered sequences even  when  it  would
         otherwise  not  pass filtering criteria. To enable this, just add a whitelist.fasta file
         to the project directory before starting the analysis.

       • The criteria for germline filter and  pre-germline  filter  are  now  configurable:  See
         germline_filter and pre_germline_filter sections in the configuration file.

       • Different  runs  of IgDiscover with the same parameters on the same input files will now
         give the same results. See the seed parameter in the configuration, also on how  to  get
         non-reproducible results as before.

       • Both the germline and pre-germline filter are now applied in each iteration.  Instead of
         the   new_V_database.fasta   file,   two   files    named    new_V_germline.fasta    and
         new_V_pregermline.fasta are created.

       • The  compose  subcommand  now  outputs  a filtered version of the candidates.tab file in
         addition to a FASTA file. The table contains columns  closest_whitelist,  which  is  the
         name  of  the  closest  whitelist  sequence,  and whitelist_diff, which is the number of
         differences to that whitelist sequence.

   v0.3 (2016-08-08)
       • Optionally, sequences are not  renamed  in  the  assigned.tab  file,  but  retain  their
         original name as in the FASTA or FASTQ file. Set rename: false in the configuration file
         to get this behavior.

       • Started an “advanced” section in the manual.

   v0.2
       • IgDiscover can now also detect kappa and lambda light chain V genes (VK, VL)

AUTHOR

       Marcel Martin

       2015-2019, Marcel Martin