Ubuntu Manpage: gsnap - Genomic Short-read Nucleotide Alignment Program

Provided by: gmap_2021-12-17+ds-1_amd64

NAME

       gsnap - Genomic Short-read Nucleotide Alignment Program

SYNOPSIS

       gsnap [OPTIONS...] <FASTA file>, or cat <FASTA file> | gmap [OPTIONS...]

OPTIONS

   Input options (must include -d)
       -D, --dir=directory
              Genome directory.  Default (as specified by --with-gmapdb to the configure program)
              is /var/cache/gmap

       -d, --db=STRING
              Genome database

       --use-localdb=INT
              Whether to use the local hash tables, which help with  finding  extensions  to  the
              ends  of alignments in the presence of splicing or indels (0=no, 1=yes if available
              (default))

       Transcriptome-guided options (optional)

       -C, --transcriptdir=directory
              Transcriptome directory.  Default is the value for --dir above

       -c, --transcriptdb=STRING
              Transcriptome database

       --use-transcriptome-only
              Use only the transcriptome index and not the genome index

       Computation options

       -k, --kmer=INT
              kmer size to use in genome database (allowed values: 16 or less) If not  specified,
              the program will find the highest available kmer size in the genome database

       --sampling=INT
              Sampling  to  use  in genome database.  If not specified, the program will find the
              smallest available sampling value in the genome database within selected k-mer size

       -q, --part=INT/INT
              Process only the i-th out of every n sequences e.g., 0/100 or  99/100  (useful  for
              distributing jobs to a computer farm).

       --input-buffer-size=INT
              Size  of  input buffer (program reads this many sequences at a time for efficiency)
              (default 10000)

       --barcode-length=INT
              Amount of barcode to remove from start of every read before alignment (default 0)

       --endtrim-length=INT
              Amount of trim to remove from the end of every read before alignment (default 0)

       --orientation=STRING
              Orientation of paired-end reads Allowed values: FR (fwd-rev, or  typical  Illumina;
              default),  RF (rev-fwd, for circularized inserts), or FF (fwd-fwd, same strand), or
              10X (single-cell where read 1 has barcode information; read 2 is rev)

       --10x-whitelist=FILE
              Whitelist of 10X Genomics GEM  bead  barcodes,  needed  to  perform  correction  of
              cellular      barcodes.       This      file      can      be      obtained      at
              cellranger-x.y.z/lib/python/cellranger/barcodes (for  Cell  Ranger  version  >=  4)
              cellranger-x.y.z/lib/cellranger-cs/x.y.z/lib/python/cellranger/barcodes (<= 3)

       --10x-well-position=INT
              Position  of  well information in the accession, when separated by colons If set to
              0, then no well information will be printed in the CB field (default: 4)

       --fastq-id-start=INT
              Starting position of identifier in FASTQ header, space-delimited (>= 1)

       --fastq-id-end=INT
              Ending position of identifier in FASTQ header, space-delimited (>= 1)

       Examples:

       @HWUSI-EAS100R:6:73:941:1973#0/1
              start=1, end=1 (default) => identifier is HWUSI-EAS100R:6:73:941:1973#0

       @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
              start=1, end=1  => identifier is  SRR001666.1  start=2,  end=2   =>  identifier  is
              071112_SLXA-EAS1_s_7:5:1:817:345  start=1,  end=2   =>  identifier  is  SRR001666.1
              071112_SLXA-EAS1_s_7:5:1:817:345

       --force-single-end
              When multiple FASTQ files are provided on the command line, GSNAP assumes they  are
              matching paired-end files.  This flag treats each file as single-end.

       --filter-chastity=STRING
              Skips  reads marked by the Illumina chastity program.  Expecting a string after the
              accession having a 'Y' after the first colon, like this:

       @accession 1:Y:0:CTTGTA
              where the 'Y' signifies filtering by  chastity.   Values:  off  (default),  either,
              both.   For  'either',  a  'Y' on either end of a paired-end read will be filtered.
              For 'both', a 'Y' is required on both ends of a paired-end read (or on the only end
              of a single-end read).

       --allow-pe-name-mismatch
              Allows accession names of reads to mismatch in paired-end files

       --interleaved
              Input is in interleaved format (one read per line, tab-delimited

       --gunzip
              Uncompress gzipped input files

       --bunzip2
              Uncompress bzip2-compressed input files

       Computation options

       -B, --batch=INT
              Batch mode (default = 2)

                       Mode      Hash offsets  Hash positions  Genome          Local hash offsets
              Local hash positions

       0      allocate      mmap            mmap            allocate            mmap

       1      allocate      mmap & preload  mmap            allocate            mmap & preload

       2      allocate      mmap & preload  mmap & preload  allocate            mmap & preload

       3      allocate      allocate        mmap & preload  allocate            allocate

       (default)
                         4        allocate         allocate           allocate           allocate
              allocate

       Note: For a single sequence, all data structures use mmap
              A batch level of 5 means the same as 4, and is kept only for backward compatibility

       --use-shared-memory=INT
              If  1,  then  allocated  memory  is  shared  among  all processes on this node If 0
              (default), then each process has private allocated memory

       --preload-shared-memory
              Load files indicated by --batch mode into shared memory for use by other GMAP/GSNAP
              processes on this node, and then exit.  Ignore any input files.

       --unload-shared-memory
              Unload  files  indicated  by  --batch  mode into shared memory, or allow them to be
              unloaded when existing GMAP/GSNAP processes on this node are  finished  with  them.
              Ignore any input files.

       -m, --max-mismatches=FLOAT
              Maximum  number  of  mismatches allowed (if not specified, then GSNAP tries to find
              the best possible match in the genome) If  specified  between  0.0  and  1.0,  then
              treated  as  a  fraction  of  each  read length.  Otherwise, treated as an integral
              number of mismatches (including indel and splicing penalties).  Default is 0.3

       --max-ref-mismatches=FLOAT
              If GSNAP is run under SNP-tolerant or masked genome mode, then the --max-mismatches
              parameter  above  is  for  mismatches  against  reference  and SNPs, or against the
              unmasked genome,  respectively.   The  --max-ref-mismatches  parameter  is  against
              mismatches against the reference genome or masked genome.

       --min-coverage=FLOAT
              Minimum coverage required for an alignment.  If specified between 0.0 and 1.0, then
              treated as a fraction of each read  length.   Otherwise,  treated  as  an  integral
              number of base pairs.  Default value is 0.5.

       --filter-within-trims=INT
              Whether  to count mismatches in trimmed part of alignment (1, yes) or mismatches to
              the  ends  of  the  read  (0,  no),  when   applying   the   --max-mismatches   and
              --max-ref-mismatches  parameters.   Default  for RNA-Seq is 1 (yes) so we can allow
              for reads that align past the ends of an exon.  Default for DNA-Seq is 0 (no).

       For RNA-Seq, trimmed ends should be ignored, because trimming
              is performed at probable splice sites, to allow for reads that align past the  ends
              of an exon.

       --query-unk-mismatch=INT
              Whether to count unknown (N) characters in the query as a mismatch (0=no (default),
              1=yes)

       --genome-unk-mismatch=INT
              Whether to count unknown (N) characters in the genome as a mismatch (0=no,  1=yes).
              If --use-mask is specified, default is no, otherwise yes.

       -i, --indel-penalty=INT
              Penalty  for  an  indel  (default  2).  Counts against mismatches allowed.  To find
              indels, make indel-penalty less than or equal to max-mismatches.  A value <  2  can
              lead to false positives at read ends

       --indel-endlength=INT
              Minimum length at end required for indel alignments (default 4)

       -y, --max-middle-insertions=FLOAT
              Maximum  number  of middle insertions allowed (default is 0.2) If specified between
              0.0 and 1.0, then treated as a fraction of each read length.  Otherwise, treated as
              an integral number of base pairs

       -z, --max-middle-deletions=FLOAT
              Maximum  number  of middle deletions allowed (default 0.2) If specified between 0.0
              and 1.0, then treated as a fraction of each read length.  Otherwise, treated as  an
              integral number of base pairs

       -Y, --max-end-insertions=INT
              Maximum number of end insertions allowed (default 3)

       -Z, --max-end-deletions=INT
              Maximum number of end deletions allowed (default 3)

       -M, --suboptimal-levels=INT
              Report  suboptimal  hits  beyond best hit (default 0) All hits with best score plus
              suboptimal-levels are reported

       -a, --adapter-strip=STRING
              Method for removing adapters from reads.  Currently allowed  values:  off,  paired.
              Default  is  "off".   To  turn  on,  specify  "paired", which removes adapters from
              paired-end reads if they appear to be present.

       --trim-indel-score=INT
              Score to use for indels when trimming at ends.  To turn off  trimming,  specify  0.
              Default  is  -2  for  both  RNA-Seq  and DNA-Seq.  Warning: Turning trimming off in
              RNA-Seq can give false positive indels at the ends of reads

       -e, --use-mask=STRING
              Use genome containing masks (e.g. for non-exons) for scoring preference

       -V, --snpsdir=STRING
              Directory for SNPs index files (created using snpindex)  (default  is  location  of
              genome index files specified using -D and -d)

       -v, --use-snps=STRING
              Use  database  containing  known  SNPs  (in  <STRING>.iit,  built  previously using
              snpindex) for tolerance to SNPs

       --cmetdir=STRING
              Directory for methylcytosine index files  (created  using  cmetindex)  (default  is
              location of genome index files specified using -D, -V, and -d)

       --atoidir=STRING
              Directory  for A-to-I RNA editing index files (created using atoiindex) (default is
              location of genome index files specified using -D, -V, and -d)

       --mode=STRING
              Alignment mode: standard (default), cmet-stranded, cmet-nonstranded, atoi-stranded,
              atoi-nonstranded,  ttoc-stranded, or ttoc-nonstranded.  Non-standard modes requires
              you to have previously run the cmetindex or atoiindex programs  (which  also  cover
              the ttoc modes) on the genome

       -t, --nthreads=INT
              Number of worker threads

       --max-anchors=INT
              Controls  number  of  candidate  segments  returned  by  the complete set algorithm
              Default is 10.  Can be increased to higher values to solve alignments  with  evenly
              spaced  mismatches  at close distances.  However, higher values will cause GSNAP to
              run more slowly.  A value of 1000, for example, slows down the program by a  factor
              of 10 or so.  Therefore, change this value only if absolutely necessary.

       Splicing options for DNA-Seq

       --find-dna-chimeras=INT
              Look  for  distant  splicing  involving  poor  splice  sites  (0=no,  1=yes) If not
              specified, then default  is  to  be  on  unless  only  known  splicing  is  desired
              (--use-splicing is specified and --novelsplicing is off)

       Splicing options for RNA-Seq

       -N, --novelsplicing=INT
              Look for novel splicing (0=no (default), 1=yes)

       --splicingdir=STRING
              Directory  for splicing involving known sites or known introns, as specified by the
              -s or --use-splicing flag (default is directory computed from  -D  and  -d  flags).
              Note: can just give full pathname to the -s flag instead.

       -s, --use-splicing=STRING
              Look  for  splicing  involving  known  sites or known introns (in <STRING>.iit), at
              short or long distances See README instructions for the distinction  between  known
              sites and known introns

       --ambig-splice-noclip
              For  ambiguous  known splicing at ends of the read, do not clip at the splice site,
              but extend instead into the intron.  This flag makes sense only if you provide  the
              --use-splicing  flag,  and  you  are  trying  to  eliminate  all soft clipping with
              --trim-mismatch-score=0

       -w, --localsplicedist=INT
              Definition of local novel splicing event (default 200000)

       --novelend-splicedist=INT
              Distance to look for novel splices at the ends of reads (default 80000)

       --local-splice-penalty=INT
              Penalty for a local splice (default 0).  Counts against mismatches allowed

       --fusion-sensitivity=INT
              Sensitivity for finding fusions

       --distant-splice-penalty=INT
              Penalty for a distant splice (default 1).  A distant splice is one where the intron
              length exceeds the value of -w, or --localsplicedist, or is an inversion, scramble,
              or translocation  between  two  different  chromosomes  Counts  against  mismatches
              allowed

       --distant-splice-endlength=INT
              Minimum  length  at  end  required  for distant spliced alignments (default 20, min
              allowed is the value of -k, or kmer size)

       --shortend-splice-endlength=INT
              Minimum length at end required for short-end spliced  alignments  (default  2,  but
              unless  known  splice sites are provided with the -s flag, GSNAP may still need the
              end length to be the value of -k, or kmer size to find a given splice

       --distant-splice-identity=FLOAT
              Minimum identity at end required for distant spliced alignments (default 0.95)

       --antistranded-penalty=INT
              (Not  currently  implemented,  since  it  leads  to  poor  results)   Penalty   for
              antistranded  splicing  when  using  stranded RNA-Seq protocols.  A positive value,
              such as 1, expects antisense on the first  read  and  sense  on  the  second  read.
              Default is 0, which treats sense and antisense equally well

       --merge-distant-samechr
              Report  distant  splices  on  the  same chromosome as a single splice, if possible.
              Will produce a single SAM line instead of two SAM lines, which  is  also  done  for
              translocations, inversions, and scramble events

       Options for paired-end reads

       --pairmax-dna=INT
              Max  total genomic length for DNA-Seq paired reads, or other reads without splicing
              (default 2000).  Used if -N or -s is not specified.  This value is  also  used  for
              circular chromosomes when splicing in linear chromosomes is allowed

       --pairmax-rna=INT
              Max total genomic length for RNA-Seq paired reads, or other reads that could have a
              splice (default 200000).  Used if -N or -s is specified.  Should probably match the
              value for -w, --localsplicedist.

       --pairexpect=INT
              Expected  paired-end  length, used for calling splices in medial part of paired-end
              reads (default 500).  Was turned off in previous versions, but reinstated.

       --pairdev=INT
              Allowable deviation from expected paired-end length, used for  calling  splices  in
              medial  part  of  paired-end  reads  (default  100).   Was  turned  off in previous
              versions, but reinstated.

       Options for quality scores

       --quality-protocol=STRING
              Protocol for  input  quality  scores.   Allowed  values:  illumina  (ASCII  64-126)
              (equivalent to -J 64 -j -31) sanger   (ASCII 33-126) (equivalent to -J 33 -j 0)

       Default is sanger (no quality print shift)
              SAM output files should have quality scores in sanger protocol

              Or you can customize this behavior with these flags:

       -J, --quality-zero-score=INT
              FASTQ  quality  scores  are  zero  at  this  ASCII  value (default is 33 for sanger
              protocol; for Illumina, select 64)

       -j, --quality-print-shift=INT
              Shift FASTQ quality scores by this amount  in  output  (default  is  0  for  sanger
              protocol; to change Illumina input to Sanger output, select -31)

       Output options

       -n, --npaths=INT
              Maximum number of paths to print (default 100).

       -Q, --quiet-if-excessive
              If more than maximum number of paths are found, then nothing is printed.

       -O, --ordered
              Print output in same order as input (relevant only if there is more than one worker
              thread)

       --show-refdiff
              For GSNAP output in SNP-tolerant alignment, shows all differences relative  to  the
              reference  genome  as  lower  case (otherwise, it shows all differences relative to
              both the reference and alternate genome)

       --clip-overlap
              For paired-end reads whose alignments overlap, clip the overlapping region.

       --merge-overlap
              For paired-end reads whose alignments overlap, merge the two ends into a single end
              (beta implementation)

       --print-snps
              Print  detailed  information  about  SNPs in reads (works only if -v also selected)
              (not fully implemented yet)

       --failsonly
              Print only failed alignments, those with no results

       --nofails
              Exclude printing of failed alignments

       --only-concordant
              Print    only    concordant    alignments    (concordant_uniq,     concordant_mult,
              concordant_circular)

       --omit-concordant-uniq
              Do not print any concordant_uniq alignments

       --omit-concordant-mult
              Do not print any concordant_mult alignments

       --omit-softclipped
              Do not allow any alignments with soft clips

       -A, --format=STRING
              Another  format  type,  other  than default.  Currently implemented: sam, m8 (BLAST
              tabular format)

       --split-output=STRING
              Basename for multiple-file  output,  separately  for  nomapping,  halfmapping_uniq,
              halfmapping_mult,    unpaired_uniq,    unpaired_mult,   paired_uniq,   paired_mult,
              concordant_uniq, and concordant_mult results

       -o, --output-file=STRING
              File name for a single stream of output results.

       --failed-input=STRING
              Print completely failed alignments as input FASTA or FASTQ  format,  to  the  given
              file,  appending .1 or .2, for paired-end data.  If the --split-output flag is also
              given, this file is generated in addition to the output in the .nomapping file.

       --append-output
              When --split-output or --failed-input is given, this flag will append output to the
              existing files.  Otherwise, the default is to create new files.

       --order-among-best=STRING
              Among  alignments  tied  with the best score, order those alignments in this order.
              Allowed values: genomic, random (default)

       --output-buffer-size=INT
              Buffer size, in queries, for output thread (default  1000).   When  the  number  of
              results  to  be printed exceeds this size, worker threads wait until the backlog is
              cleared

       Options for SAM output

       --no-sam-headers
              Do not print headers beginning with '@'

       --add-paired-nomappers
              Add nomapper lines as needed to make all paired-end results alternate between first
              end and second end

       --paired-flag-means-concordant=INT
              Whether  the  paired  bit in the SAM flags means concordant only (1) or paired plus
              concordant (0, default)

       --sam-headers-batch=INT
              Print headers only for this batch, as specified by -q

       --sam-hardclip-use-S
              Use S instead of H for hardclips

       --sam-use-0M
              Insert 0M in CIGAR between  adjacent  insertions,  deletions,  and  introns  Picard
              disallows 0M, other tools may require it

       --sam-extended-cigar
              Use  extended CIGAR format (using X and = symbols instead of M, to indicate matches
              and mismatches, respectively

       --sam-multiple-primaries
              Allows multiple alignments to be marked  as  primary  if  they  have  equally  good
              mapping scores

       --sam-sparse-secondaries
              For  secondary alignments (in multiple mappings), uses '*' for SEQ and QUAL fields,
              to give smaller file sizes.  However, the output will give warnings  in  Picard  to
              give warnings and may not work with downstream tools

       --force-xs-dir
              For  RNA-Seq  alignments, disallows XS:A:? when the sense direction is unclear, and
              replaces this value arbitrarily with XS:A:+.  May be useful for some programs, such
              as  Cufflinks,  that  cannot  handle  XS:A:?.   However,  if you use this flag, the
              reported value of XS:A:+ in these cases will not be meaningful.

       --md-lowercase-snp
              In MD string, when  known  SNPs  are  given  by  the  -v  flag,  prints  difference
              nucleotides  as  lower-case  when  they,  differ  from  reference but match a known
              alternate allele

       --extend-soft-clips
              Extends alignments through soft clipped regions.  CIGAR string and coordinates will
              be revised, but currently the MD string will reflect the clipped CIGAR

       --action-if-cigar-error
              Action  to take if there is a disagreement between CIGAR length and sequence length
              Allowed values: ignore, warning (default), noprint, abort  Note  that  the  noprint
              option does not print the CIGAR string at all if there is an error, so it may break
              a SAM parser

       --read-group-id=STRING
              Value to put into read-group id (RG-ID) field

       --read-group-name=STRING
              Value to put into read-group name (RG-SM) field

       --read-group-library=STRING
              Value to put into read-group library (RG-LB) field

       --read-group-platform=STRING
              Value to put into read-group library (RG-PL) field

       Help options

       --check
              Check compiler assumptions

       --version
              Show version

       --help Show this help message

       Other tools of GMAP suite are located in /usr/lib/gmap