Ubuntu Manpage: trim_galore - automate quality and adapter trimming for DNA sequencing

NAME

       trim_galore - automate quality and adapter trimming for DNA sequencing

DESCRIPTION

              USAGE:

       trim_galore [options] <filename(s)>

       -h/--help               Print this help message and exits.

       -v/--version            Print the version information and exits.

       -q/--quality <INT>      Trim low-quality ends from reads in addition to adapter removal. For

       RRBS samples, quality trimming will be performed first, and adapter
              trimming  is  carried  in  a second round. Other files are quality and adapter trimmed in a single
              pass. The algorithm is the same as the one used by BWA (Subtract INT from all  qualities;  compute
              partial  sums  from all indices to the end of the sequence; cut sequence at the index at which the
              sum is minimal). Default Phred score: 20.

       --phred33               Instructs Cutadapt to use ASCII+33 quality scores as Phred scores

              (Sanger/Illumina 1.9+ encoding) for quality trimming. Default: ON.

       --phred64               Instructs Cutadapt to use ASCII+64 quality scores as Phred scores

              (Illumina 1.5 encoding) for quality trimming.

       --fastqc                Run FastQC in the default mode on the FastQ file once trimming is complete.

       --fastqc_args "<ARGS>"  Passes extra arguments to FastQC. If more than one argument is to be passed

       to FastQC they must be in the form "arg1 arg2 etc.". An example would be:
              --fastqc_args "--nogroup --outdir /home/".  Passing  extra  arguments  will  automatically  invoke
              FastQC, so --fastqc does not have to be specified separately.

       -a/--adapter <STRING>   Adapter sequence to be trimmed. If not specified explicitly, Trim Galore will

       try to auto-detect whether the Illumina universal, Nextera transposase or Illumina
              small  RNA  adapter sequence was used. Also see '--illumina', '--nextera' and '--small_rna'. If no
              adapter can be detected within the first 1 million sequences of the first  file  specified  or  if
              there is a tie between several adapter sequences, Trim Galore defaults to '--illumina' (as long as
              the  Illumina  adapter was one of the options, else '--nextera' is the default). A single base may
              also be given as e.g. -a A{10}, to be expanded to -a AAAAAAAAAA.

       -a2/--adapter2 <STRING> Optional adapter sequence to be trimmed off read 2 of paired-end files. This

       option requires '--paired' to be specified as well. If the libraries to be trimmed
              are  smallRNA  then  a2  will  be  set  to  the  Illumina  small  RNA  5'  adapter   automatically
              (GATCGTCGGACT).  A  single  base  may  also  be  given  as  e.g.  -a2 A{10}, to be expanded to -a2
              AAAAAAAAAA.

       --illumina              Adapter sequence to be trimmed is  the  first  13bp  of  the  Illumina  universal
       adapter

              'AGATCGGAAGAGC' instead of the default auto-detection of adapter sequence.

       --nextera               Adapter sequence to be trimmed is the first 12bp of the Nextera adapter

              'CTGTCTCTTATA' instead of the default auto-detection of adapter sequence.

       --small_rna              Adapter  sequence  to  be trimmed is the first 12bp of the Illumina Small RNA 3'
       Adapter

       'TGGAATTCTCGG' instead of the default auto-detection of adapter sequence. Selecting
              to trim smallRNA adapters will also lower the --length value to 18bp. If  the  smallRNA  libraries
              are  paired-end  then  a2  will  be  set  to  the  Illumina  small  RNA  5'  adapter automatically
              (GATCGTCGGACT) unless -a 2 had been defined explicitly.

       --consider_already_trimmed <INT>     During adapter auto-detection, the limit set  by  <INT>  allows  the
       user to

       set a threshold up to which the file is considered already adapter-trimmed. If no adapter
              sequence  exceeds  this  threshold, no additional adapter trimming will be performed (technically,
              the adapter is set to '-a X'). Quality  trimming  is  still  performed  as  usual.   Default:  NOT
              SELECTED (i.e. normal auto-detection precedence rules apply).

       --max_length  <INT>      Discard reads that are longer than <INT> bp after trimming. This is only advised
       for

              smallRNA sequencing to remove non-small RNA sequences.

       --stringency <INT>      Overlap with adapter sequence required to trim a sequence. Defaults to a

       very stringent setting of 1, i.e. even a single bp of overlapping sequence
              will be trimmed off from the 3' end of any read.

       -e <ERROR RATE>         Maximum allowed error rate (no. of errors divided by the length of the matching

              region) (default: 0.1)

       --gzip                  Compress the output file with GZIP. If the input files are GZIP-compressed

       the output files will automatically be GZIP compressed as well. As of v0.2.8 the
              compression will take place on the fly.

       --dont_gzip             Output files won't be compressed with GZIP. This option overrides --gzip.

       --length <INT>          Discard reads that became shorter than length INT because of either

       quality or adapter trimming. A value of '0' effectively disables
              this behaviour. Default: 20 bp.

       For paired-end files, both reads of a read-pair need to be longer than
              <INT> bp to be printed out to validated paired-end files (see option --paired).  If only one  read
              became  too  short  there  is  the  possibility  of  keeping  such  unpaired single-end reads (see
              --retain_unpaired). Default pair-cutoff: 20 bp.

       --max_n COUNT           The total number of Ns (as integer) a read may contain before it will be  removed
       altogether.

       In a paired-end setting, either read exceeding this limit will result in the entire
              pair being removed from the trimmed output files.

       --trim-n                 Removes  Ns from either side of the read. This option does currently not work in
       RRBS mode.

       -o/--output_dir <DIR>   If specified all output will be written to this directory instead of the current

              directory. If the directory doesn't exist it will be created for you.

       --no_report_file        If specified no report file will be generated.

       --suppress_warn         If specified any output to STDOUT or STDERR will be suppressed.

       --clip_R1 <int>         Instructs Trim Galore to remove <int> bp from the 5' end of read 1 (or single-end

       reads). This may be useful if the qualities were very poor, or if there is some
              sort of unwanted bias at the 5' end. Default: OFF.

       --clip_R2 <int>         Instructs Trim Galore to remove <int> bp from the 5' end of  read  2  (paired-end
       reads

       only). This may be useful if the qualities were very poor, or if there is some sort
              of  unwanted  bias at the 5' end. For paired-end BS-Seq, it is recommended to remove the first few
              bp because the end-repair reaction may introduce a bias towards low methylation. Please  refer  to
              the M-bias plot section in the Bismark User Guide for some examples. Default: OFF.

       --three_prime_clip_R1  <int>      Instructs  Trim Galore to remove <int> bp from the 3' end of read 1 (or
       single-end

       reads) AFTER adapter/quality trimming has been performed. This may remove some unwanted
              bias from the 3' end that is not  directly  related  to  adapter  sequence  or  basecall  quality.
              Default: OFF.

       --three_prime_clip_R2 <int>     Instructs Trim Galore to remove <int> bp from the 3' end of read 2 AFTER

       adapter/quality trimming has been performed. This may remove some unwanted bias from
              the 3' end that is not directly related to adapter sequence or basecall quality.  Default: OFF.

       --2colour/--nextseq INT This enables the option '--nextseq-trim=3'CUTOFF' within Cutadapt, which will set
       a quality

       cutoff (that is normally given with -q instead), but qualities of G bases are ignored.
              This  trimming  is  in  common for the NextSeq- and NovaSeq-platforms, where basecalls without any
              signal are called as high-quality G bases. This is mutually exlusive with '-q INT'.

       --path_to_cutadapt </path/to/cutadapt>     You may use this option to specify  a  path  to  the  Cutadapt
       executable,

       e.g. /my/home/cutadapt-1.7.1/bin/cutadapt. Else it is assumed that Cutadapt is in
              the PATH.

       --basename  <PREFERRED_NAME>     Use PREFERRED_NAME as the basename for output files, instead of deriving
       the filenames from

       the input files. Single-end data would be called PREFERRED_NAME_trimmed.fq(.gz), or
              PREFERRED_NAME_val_1.fq(.gz) and PREFERRED_NAME_val_2.fq(.gz) for paired-end data. --basename only
              works when 1 file (single-end) or 2 files (paired-end) are specified, but not for longer lists.

       -j/--cores INT          Number of cores to be used for trimming [default: 1]. For Cutadapt to  work  with
       multiple cores, it

       requires Python 3 as well as parallel gzip (pigz) installed on the system. The version of Python used
              is  detected  from  the shebang line of the Cutadapt executable (either 'cutadapt', or a specified
              path).  If Python 2 is detected, --cores is set to 1.  If pigz cannot be detected on your  system,
              Trim  Galore  reverts  to using gzip compression. Please note that gzip compression will slow down
              multi-core   processes   so    much    that    it    is    hardly    worthwhile,    please    see:
              https://github.com/FelixKrueger/TrimGalore/issues/16#issuecomment-458557103 for more info).

       Actual core usage: It should be mentioned that the actual number of cores used is a little convoluted.
              Assuming  that  Python  3  is  used and pigz is installed, --cores 2 would use 2 cores to read the
              input (probably not at a high usage though), 2 cores to write to the output  (at  moderately  high
              usage),  and 2 cores for Cutadapt itself + 2 additional cores for Cutadapt (not sure what they are
              used for) + 1 core for Trim Galore itself. So this can be up to 9 cores, even though most of  them
              won't be used at 100% for most of the time. Paired-end processing uses twice as many cores for the
              validation (= writing out) step.  --cores 4 would then be: 4 (read) + 4 (write) + 4 (Cutadapt) + 2
              (extra Cutadapt) +     1 (Trim Galore) = 15.

              It seems that --cores 4 could be a sweet spot, anything above has diminishing returns.

       SPECIFIC TRIMMING - without adapter/quality trimming

       --hardtrim5  <int>        Instead  of  performing  adapter-/quality  trimming,  this  option  will simply
       hard-trim sequences

       to <int> bp at the 5'-end. Once hard-trimming of files is complete, Trim Galore will exit.
              Hard-trimmed output files will end in .<int>_5prime.fq(.gz). Here is an example:

       before:
              CCTAAGGAAACAAGTACACTCCACACATGCATAAAGGAAATCAAATGTTATTTTTAAGAAAATGGAAAAT

              --hardtrim5 20: CCTAAGGAAACAAGTACACT

       --hardtrim3 <int>        Instead  of  performing  adapter-/quality  trimming,  this  option  will  simply
       hard-trim sequences

       to <int> bp at the 3'-end. Once hard-trimming of files is complete, Trim Galore will exit.
              Hard-trimmed output files will end in .<int>_3prime.fq(.gz). Here is an example:

       before:
              CCTAAGGAAACAAGTACACTCCACACATGCATAAAGGAAATCAAATGTTATTTTTAAGAAAATGGAAAAT

       --hardtrim3 20:
              TTTTTAAGAAAATGGAAAAT

       --clock                  In this mode, reads are trimmed in a specific way that is currently used for the
       Mouse

       Epigenetic Clock (see here: Multi-tissue DNA methylation age predictor in mouse, Stubbs et al.,
              Genome Biology, 2017 18:68 https://doi.org/10.1186/s13059-017-1203-5). Following this, Trim Galore
              will exit.

              In it's current implementation, the dual-UMI RRBS reads come in the following format:

       Read 1 5' UUUUUUUU CAGTA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF TACTG UUUUUUUU 3'

       Read 2 3' UUUUUUUU GTCAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF ATGAC UUUUUUUU 5'

       Where UUUUUUUU is a random 8-mer unique molecular identifier (UMI), CAGTA is a constant region,
              and FFFFFFF... is the actual RRBS-Fragment to be sequenced. The UMIs for Read 1 (R1)  and  Read  2
              (R2), as well as the fixed sequences (F1 or F2), are written into the read ID and removed from the
              actual sequence. Here is an example:

       R1: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT
              ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG

       R2: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT
              CAATTTTGCAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA

       R1: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT:R1:ATCTAGTT:R2:CAATTTTG:F1:CAGT:F2:CAGT
              CGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG

       R2: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT:R1:ATCTAGTT:R2:CAATTTTG:F1:CAGT:F2:CAGT
              CAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA

       Following clock trimming, the resulting files (.clock_UMI.R1.fq(.gz) and .clock_UMI.R2.fq(.gz))
              should  be  adapter-  and quality trimmed with Trim Galore as usual. In addition, reads need to be
              trimmed by 15bp from their 3' end to get rid of potential UMI and fixed sequences. The command is:

              trim_galore  --paired  --three_prime_clip_R1  15  --three_prime_clip_R2  15   *.clock_UMI.R1.fq.gz
              *.clock_UMI.R2.fq.gz

       Following this, reads should be aligned with Bismark and deduplicated with UmiBam
              in  '--dual_index' mode (see here: https://github.com/FelixKrueger/Umi-Grinder). UmiBam recognises
              the UMIs within this pattern: R1:(ATCTAGTT):R2:(CAATTTTG): as (UMI R1) and (UMI R2).

       --polyA                 This is a new, still experimental, trimming mode to identify  and  remove  poly-A
       tails from sequences.

       When --polyA is selected, Trim Galore attempts to identify from the first supplied sample whether
              sequences  contain more often a stretch of either 'AAAAAAAAAA' or 'TTTTTTTTTT'. This determines if
              Read 1 of a paired-end end file, or single-end files, are trimmed for PolyA or PolyT. In  case  of
              paired-end  sequencing,  Read2  is trimmed for the complementary base from the start of the reads.
              The auto-detection uses a default of A{20} for  Read1  (3'-end  trimming)  and  T{150}  for  Read2
              (5'-end trimming).  These values may be changed manually using the options -a and -a2.

       In addition to trimming the sequences, white spaces are replaced with _ and it records in the read ID
              how  many  bases were trimmed so it can later be used to identify PolyA trimmed sequences. This is
              currently done by writing tags to both the start ("32:A:") and end ("_PolyA:32") of the  reads  in
              the following example:

       @READ-ID:1:1102:22039:36996 1:N:0:CCTAATCC
              GCCTAAGGAAACAAGTACACTCCACACATGCATAAAGGAAATCAAATGTTATTTTTAAGAAAATGGAAAATAAAAACTTTATAAACACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

       @32:A:READ-ID:1:1102:22039:36996_1:N:0:CCTAATCC_PolyA:32
              GCCTAAGGAAACAAGTACACTCCACACATGCATAAAGGAAATCAAATGTTATTTTTAAGAAAATGGAAAATAAAAACTTTATAAACACC

       PLEASE NOTE: The poly-A trimming mode expects that sequences were both adapter and quality trimmed
              before looking for Poly-A tails, and it is the user's responsibility to carry out an initial round
              of trimming. The following sequence:

       1) trim_galore file.fastq.gz
              2)  trim_galore  --polyA file_trimmed.fq.gz 3) zcat file_trimmed_trimmed.fq.gz | grep -A 3 PolyA |
              grep -v ^-- > PolyA_trimmed.fastq

       Will 1) trim qualities and Illumina adapter contamination, 2) find and remove PolyA contamination.
              Finally, if desired, 3) will specifically find PolyA trimmed sequences to a specific FastQ file of
              your choice.

       --implicon              This is a special mode of operation for paired-end data, such as required for the
       IMPLICON method, where a UMI sequence

              is getting transferred from the start of Read 2 to the readID of both reads. Following this,  Trim
              Galore will exit.

              In it's current implementation, the UMI carrying reads come in the following format:

       Read 1 5' FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 3'

       Read 2 3' UUUUUUUUFFFFFFFFFFFFFFFFFFFFFFFFFFFF 5'

       Where UUUUUUUU is a random 8-mer unique molecular identifier (UMI) and FFFFFFF... is the actual fragment
       to be
              sequenced.  The  UMI of Read 2 (R2) is written into the read ID of both reads and removed from the
              actual sequence.  Here is an example:

       R1: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT
              ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG

       R2: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT
              CAATTTTGCAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA

       After --implicon trimming:
              R1: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT:CAATTTTG

       ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
              R2: @HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT:CAATTTTG

              CAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA

       RRBS-specific options (MspI digested material):

       --rrbs                  Specifies that the input file was an MspI digested RRBS sample (recognition

       site: CCGG). Single-end or Read 1 sequences (paired-end) which were adapter-trimmed
              will have a further 2 bp removed from their 3' end. Sequences which were merely trimmed because of
              poor quality will not be shortened further. Read 2 of paired-end libraries will in  addition  have
              the  first  2  bp  removed  from  the  5'  end  (by setting '--clip_r2 2'). This is to avoid using
              artificial methylation calls from the filled-in cytosine positions close to the 3'  MspI  site  in
              sequenced  fragments.   This  option is not recommended for users of the NuGEN ovation RRBS System
              1-16 kit (see below).

       --non_directional       Selecting this option for non-directional RRBS libraries will screen

       quality-trimmed sequences for 'CAA' or 'CGA' at the start of the read
              and, if found, removes the first two basepairs. Like with the option '--rrbs'  this  avoids  using
              cytosine  positions  that  were filled-in during the end-repair step. '--non_directional' requires
              '--rrbs' to be specified as well. Note that this option does not set '--clip_r2 2'  in  paired-end
              mode.

       --keep                  Keep the quality trimmed intermediate file. Default: off, which means

       the temporary file is being deleted after adapter trimming. Only has
              an effect for RRBS samples since other FastQ files are not trimmed for poor qualities separately.

       Note for RRBS using the NuGEN Ovation RRBS System 1-16 kit:

       Owing  to  the  fact that the NuGEN Ovation kit attaches a varying number of nucleotides (0-3) after each
       MspI site Trim Galore should be run WITHOUT the  option  --rrbs.  This  trimming  is  accomplished  in  a
       subsequent diversity trimming step afterwards (see their manual).

       Note for RRBS using MseI:

       If your DNA material was digested with MseI (recognition motif: TTAA) instead of MspI it is NOT necessary
       to  specify  --rrbs  or --non_directional since virtually all reads should start with the sequence 'TAA',
       and this holds true for both directional and  non-directional  libraries.  As  the  end-repair  of  'TAA'
       restricted  sites  does  not  involve  any  cytosines it does not need to be treated especially. Instead,
       simply run Trim Galore! in the standard (i.e. non-RRBS) mode.

       Paired-end specific options:

       --paired                This option performs length trimming of quality/adapter/RRBS trimmed reads for

       paired-end files. To pass the validation test, both sequences of a sequence pair
              are required to have a certain minimum length which  is  governed  by  the  option  --length  (see
              above).  If  only  one read passes this length threshold the other read can be rescued (see option
              --retain_unpaired). Using this option lets you discard too short read pairs without disturbing the
              sequence-by-sequence order of FastQ files which is required by many aligners.

       Trim Galore! expects paired-end files to be supplied in a pairwise fashion, e.g.
              file1_1.fq file1_2.fq SRR2_1.fq.gz SRR2_2.fq.gz ... .

       -t/--trim1              Trims 1 bp off every read from its 3' end. This may be  needed  for  FastQ  files
       that

       are to be aligned as paired-end data with Bowtie. This is because Bowtie (1) regards
              alignments like this:

       R1 --------------------------->
              or this:    ----------------------->  R1

       R2 <---------------------------
              <-----------------  R2

       as invalid (whenever a start/end coordinate is contained within the other read).
              NOTE: If you are planning to use Bowtie2, BWA etc. you don't need to specify this option.

       --retain_unpaired       If only one of the two paired-end reads became too short, the longer

       read will be written to either '.unpaired_1.fq' or '.unpaired_2.fq'
              output  files.  The  length  cutoff  for  unpaired  single-end reads is governed by the parameters
              -r1/--length_1 and -r2/--length_2. Default: OFF.

       -r1/--length_1 <INT>    Unpaired single-end read length cutoff needed for read 1 to be written to

       '.unpaired_1.fq' output file. These reads may be mapped in single-end mode.
              Default: 35 bp.

       -r2/--length_2 <INT>    Unpaired single-end read length cutoff needed for read 2 to be written to

       '.unpaired_2.fq' output file. These reads may be mapped in single-end mode.
              Default: 35 bp.

       Last modified on 07 October 2020.

AUTHOR

        This manpage was written by Nilesh Patra for the Debian distribution and
        can be used for any other usage of the program.

trim_galore 0.6.7                                   July 2021                                     TRIM_GALORE(1)