Ubuntu Manpage: gsnap - Genomic Short-read Nucleotide Alignment Program

Provided by: gmap_2024-11-20+ds-1_amd64

NAME

       gsnap - Genomic Short-read Nucleotide Alignment Program

SYNOPSIS

       gsnap [OPTIONS...] <FASTA file>, or cat <FASTA file> | gmap [OPTIONS...]

OPTIONS

   Input options (must include -d)
       -D, --dir=directory
              Genome directory.  Default (as specified by --with-gmapdb to the configure program)
              is /var/cache/gmap

       -d, --db=STRING
              Genome database

       --two-pass
              Two-pass mode, in which the sequences are processed first to identify splice  sites
              and introns, and then aligned using this splicing information

       --use-localdb=INT
              Whether  to  use the local suffix arrays, which help with finding extensions to the
              ends of alignments in the presence of splicing or indels (0=no, 1=yes if  available
              (default))

       Transcriptome-guided options (optional)

       -C, --transcriptdir=directory
              Transcriptome directory.  Default is the value for --dir above

       -c, --transcriptdb=STRING
              Transcriptome database

       --transcriptome-mode=STRING
              Options:  assist,  only,  annotate  (default).   The  option  assist  means  to try
              transcriptome alignment first, but then use genomic alignment if nothing is  found.
              The  option  only  means  to try transcriptome alignment only.  The option annotate
              means to try only genomic alignment, to use the transcriptome only for  annotation;
              this is the fastest option.  In the other two options, annotation is also performed

       Computation options

       -k, --kmer=INT
              kmer  size to use in genome database (allowed values: 16 or less) If not specified,
              the program will find the highest available kmer size in the genome database

       --sampling=INT
              Sampling to use in genome database.  If not specified, the program  will  find  the
              smallest available sampling value in the genome database within selected k-mer size

       --align-fraction=FLOAT
              Process  only  the  given fraction of reads, selected at random If --align-fraction
              and --part are given, --align-fraction takes precedence

       -q, --part=INT/INT
              Process only the i-th out of every n sequences e.g., 0/100 or  99/100  (useful  for
              distributing jobs to a computer farm).

       --input-buffer-size=INT
              Size  of  input buffer (program reads this many sequences at a time for efficiency)
              (default 10000)

       --barcode-length=INT
              Amount of barcode to remove from start of every read before alignment (default 0)

       --endtrim-length=INT
              Amount of trim to remove from the end of every read before alignment (default 0)

       --orientation=STRING
              Orientation of paired-end reads Allowed values: FR (fwd-rev, or  typical  Illumina;
              default),  RF (rev-fwd, for circularized inserts), or FF (fwd-fwd, same strand), or
              10X (single-cell where read 1 has barcode information; read 2 is rev)

       --10x-whitelist=FILE
              Whitelist of 10X Genomics GEM  bead  barcodes,  needed  to  perform  correction  of
              cellular      barcodes.       This      file      can      be      obtained      at
              cellranger-x.y.z/lib/python/cellranger/barcodes (for  Cell  Ranger  version  >=  4)
              cellranger-x.y.z/lib/cellranger-cs/x.y.z/lib/python/cellranger/barcodes (<= 3)

       --10x-well-position=INT
              Position  of  well information in the accession, when separated by colons If set to
              0, then no well information will be printed in the CB field (default: 4)

       --fastq-id-start=INT
              Starting position of identifier in FASTQ header, space-delimited (>= 1)

       --fastq-id-end=INT
              Ending position of identifier in FASTQ header, space-delimited (>= 1)

       Examples:

       @HWUSI-EAS100R:6:73:941:1973#0/1
              start=1, end=1 (default) => identifier is HWUSI-EAS100R:6:73:941:1973#0

       @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
              start=1, end=1  => identifier is  SRR001666.1  start=2,  end=2   =>  identifier  is
              071112_SLXA-EAS1_s_7:5:1:817:345  start=1,  end=2   =>  identifier  is  SRR001666.1
              071112_SLXA-EAS1_s_7:5:1:817:345

       --force-single-end
              When multiple FASTQ files are provided on the command line, GSNAP assumes they  are
              matching paired-end files.  This flag treats each file as single-end.

       --filter-chastity=STRING
              Skips  reads marked by the Illumina chastity program.  Expecting a string after the
              accession having a 'Y' after the first colon, like this:

       @accession 1:Y:0:CTTGTA
              where the 'Y' signifies filtering by  chastity.   Values:  off  (default),  either,
              both.   For  'either',  a  'Y' on either end of a paired-end read will be filtered.
              For 'both', a 'Y' is required on both ends of a paired-end read (or on the only end
              of a single-end read).

       --allow-pe-name-mismatch
              Allows accession names of reads to mismatch in paired-end files

       --interleaved
              Input is in interleaved format (one read per line, tab-delimited

       --gunzip
              Uncompress gzipped input files

       --bunzip2
              Uncompress bzip2-compressed input files

       Computation options

       -B, --batch=INT
              Batch  mode (default = 5) Mode  Hash offsets  Hash positions  Genome          Local
              hash offsets  Local hash positions  Localdb

       0      allocate         mmap               mmap               allocate                mmap
              mmap

       1      allocate       mmap  &  preload  mmap            allocate            mmap & preload
              mmap

       2      allocate      mmap & preload  mmap & preload  allocate             mmap  &  preload
              mmap

       3      allocate        allocate          mmap   &  preload   allocate             allocate
              mmap

       4      allocate        allocate          allocate          allocate               allocate
              mmap

       (default)
              5    allocate       allocate         allocate         allocate             allocate
              allocate

       Note: For a single sequence, all data structures use mmap
              A batch level of 5 means the same as 4, and is kept only for backward compatibility

       --use-shared-memory=INT
              If 1, then allocated memory is shared  among  all  processes  on  this  node  If  0
              (default), then each process has private allocated memory

       --preload-shared-memory
              Load files indicated by --batch mode into shared memory for use by other GMAP/GSNAP
              processes on this node, and then exit.  Ignore any input files.

       --unload-shared-memory
              Unload files indicated by --batch mode into shared memory,  or  allow  them  to  be
              unloaded  when  existing  GMAP/GSNAP processes on this node are finished with them.
              Ignore any input files.

       -m, --max-mismatches=FLOAT
              Maximum number of mismatches allowed (if not specified, then GSNAP  tries  to  find
              the  best  possible  match  in  the  genome) If specified between 0.0 and 1.0, then
              treated as a fraction of each read  length.   Otherwise,  treated  as  an  integral
              number of mismatches (including indel and splicing penalties).  Default is 0.3

       --query-unk-mismatch=INT
              Whether to count unknown (N) characters in the query as a mismatch (0=no (default),
              1=yes)

       --genome-unk-mismatch=INT
              Whether to count unknown (N) characters in the genome as a mismatch (0=no,  1=yes).
              If --use-mask is specified, default is no, otherwise yes.

       --maxsearch=INT
              Maximum  number  of  alignments  to  find  (default  1000).   Should be larger than
              --npaths, which is the number to report.  Keeping this number large will allow  for
              random  selection among multiple alignments.  Reducing this number can speed up the
              program.

       --indel-endlength=INT
              Minimum length at end required for indel alignments (default 4)

       --max-insertions=INT
              Maximum number of insertions allowed (default 9)

       --max-deletions=INT
              Maximum number of deletions allowed (default 15)

       -M, --suboptimal-levels=INT
              Report suboptimal hits beyond best hit (default 0) All hits with  best  score  plus
              suboptimal-levels are reported (Note: Not currently implemented)

       -a, --adapter-strip=STRING
              Method  for  removing  adapters from reads.  Currently allowed values: off, paired.
              Default is "off".  To turn  on,  specify  "paired",  which  removes  adapters  from
              paired-end reads if they appear to be present.

       -e, --use-mask=STRING
              Use genome containing masks (e.g. for non-exons) for scoring preference

       -V, --snpsdir=STRING
              Directory  for  SNPs  index  files (created using snpindex) (default is location of
              genome index files specified using -D and -d)

       -v, --use-snps=STRING
              Use database  containing  known  SNPs  (in  <STRING>.iit,  built  previously  using
              snpindex) for tolerance to SNPs

       --cmetdir=STRING
              Directory  for  methylcytosine  index  files  (created using cmetindex) (default is
              location of genome index files specified using -D, -V, and -d)

       --atoidir=STRING
              Directory for A-to-I RNA editing index files (created using atoiindex) (default  is
              location of genome index files specified using -D, -V, and -d)

       --mode=STRING
              Alignment mode: standard (default), cmet-stranded, cmet-nonstranded, atoi-stranded,
              atoi-nonstranded, ttoc-stranded, or ttoc-nonstranded.  Non-standard modes  requires
              you  to  have  previously run the cmetindex or atoiindex programs (which also cover
              the ttoc modes) on the genome

       -t, --nthreads=INT
              Number of worker threads

       Splicing options for DNA-Seq

       --find-dna-chimeras=INT
              Look for distant  splicing  involving  poor  splice  sites  (0=no,  1=yes)  If  not
              specified,  then  default  is  to  be  on  unless  only  known  splicing is desired
              (--use-splicing is specified and --novelsplicing is off)

       Splicing options for RNA-Seq

       -N, --novelsplicing=INT
              Look for novel splicing (0=no (default), 1=yes)

       --splicingdir=STRING
              Directory for splicing involving known sites or known introns, as specified by  the
              -s  or  --use-splicing  flag  (default is directory computed from -D and -d flags).
              Note: can just give full pathname to the -s flag instead.

       -s, --use-splicing=STRING
              Look for splicing involving known sites or  known  introns  (in  <STRING>.iit),  at
              short  or  long distances See README instructions for the distinction between known
              sites and known introns

       --splices-noeval
              Do not evaluate splices for probability  or  intron  length,  but  depend  only  on
              sequence alignment

       --splices-dump=FILE
              Write  splice  junction  information  to  FILE, in the same format as for STAR plus
              MaxEnt probabilities for the two intron positions.  Note that in  this  dump  file,
              the annotation column is reserved strictly for known introns, and not novel introns
              that passed some criterion from a first pass.

       --splices-include-knownp
              In the file for --splices-dump, include all known introns

       --splices-read=FILE
              Read allowable splices from FILE, in the same format as for STAR.  This  is  useful
              if some external program can evaluate and filter the results from --splices-dump in
              a first alignment pass, and then GSNAP can use the filtered  splices  in  a  second
              alignment pass

       -w, --localsplicedist=INT
              Definition of local novel splicing event (default 200000)

       --merge-distant-samechr
              Report  distant  splices  on  the  same chromosome as a single splice, if possible.
              Will produce a single SAM line instead of two SAM lines, which  is  also  done  for
              translocations, inversions, and scramble events

       Options for paired-end reads

       --pairmax-dna=INT
              Max  total genomic length for DNA-Seq paired reads, or other reads without splicing
              (default 2000).  Used if -N or -s is not specified.  This value is  also  used  for
              circular chromosomes when splicing in linear chromosomes is allowed

       --pairmax-rna=INT
              Max total genomic length for RNA-Seq paired reads, or other reads that could have a
              splice (default 200000).  Used if -N or -s is specified.  Should probably match the
              value for -w, --localsplicedist.

       --resolve-inner=INT
              Whether to resolve soft-clipping on the insides of paired-end reads (default 1)

       --pairexpect=INT
              Expected  paired-end  length,  used  for  resolving soft-clipping on the insides of
              paired-end reads, and for pairing DNA-seq reads (default 200)

       --pairdev=INT
              Allowable  deviation  from  expected  paired-end   length,   used   for   resolving
              soft-clipping on the insides of paired-end reads (default 100).

       --pass1-min-support=INT
              Threshold  read  support  for  learning  an intron during pass 1 of --two-pass mode
              (default 20)

       Options for quality scores

       --quality-protocol=STRING
              Protocol for  input  quality  scores.   Allowed  values:  illumina  (ASCII  64-126)
              (equivalent to -J 64 -j -31) sanger   (ASCII 33-126) (equivalent to -J 33 -j 0)

       Default is sanger (no quality print shift)
              SAM output files should have quality scores in sanger protocol

              Or you can customize this behavior with these flags:

       -J, --quality-zero-score=INT
              FASTQ  quality  scores  are  zero  at  this  ASCII  value (default is 33 for sanger
              protocol; for Illumina, select 64)

       -j, --quality-print-shift=INT
              Shift FASTQ quality scores by this amount  in  output  (default  is  0  for  sanger
              protocol; to change Illumina input to Sanger output, select -31)

       Output options

       -n, --npaths=INT
              Maximum number of paths to print (default 100).

       -Q, --quiet-if-excessive
              If more than maximum number of paths are found, then nothing is printed.

       -O, --ordered
              Print output in same order as input (relevant only if there is more than one worker
              thread)

       --show-refdiff
              For GSNAP output in SNP-tolerant alignment, shows all differences relative  to  the
              reference  genome  as  lower  case (otherwise, it shows all differences relative to
              both the reference and alternate genome)

       --clip-overlap
              For paired-end reads whose alignments overlap, clip the overlapping region.

       --merge-overlap
              For paired-end reads whose alignments overlap, merge the two ends into a single end
              (beta implementation)

       --print-snps
              Print  detailed  information  about  SNPs in reads (works only if -v also selected)
              (not fully implemented yet)

       --failsonly
              Print only failed alignments, those with no results

       --nofails
              Exclude printing of failed alignments

       --only-concordant
              Print    only    concordant    alignments    (concordant_uniq,     concordant_mult,
              concordant_circular)

       --omit-concordant-uniq
              Do not print any concordant_uniq alignments

       --omit-concordant-mult
              Do not print any concordant_mult alignments

       --omit-softclipped
              Do not allow any alignments with soft clips

       --only-tr-consistent
              Print  only  alignments with consistent transcripts (XX field present, identical if
              paired-end)

       -A, --format=STRING
              Another format type, other than default.  Currently  implemented:  sam,  m8  (BLAST
              tabular format)

       --split-output=STRING
              Basename  for  multiple-file  output,  separately  for nomapping, halfmapping_uniq,
              halfmapping_mult,   unpaired_uniq,   unpaired_mult,    paired_uniq,    paired_mult,
              concordant_uniq, and concordant_mult results

       -o, --output-file=STRING
              File name for a single stream of output results.

       --failed-input=STRING
              Print  completely  failed  alignments  as input FASTA or FASTQ format, to the given
              file, appending .1 or .2, for paired-end data.  If the --split-output flag is  also
              given, this file is generated in addition to the output in the .nomapping file.

       --append-output
              When --split-output or --failed-input is given, this flag will append output to the
              existing files.  Otherwise, the default is to create new files.

       --order-among-best=STRING
              Among alignments tied with the best score, order those alignments  in  this  order.
              Allowed values: genomic, random (default)

       --output-buffer-size=INT
              Buffer  size,  in  queries,  for  output thread (default 1000).  When the number of
              results to be printed exceeds this size, worker threads wait until the  backlog  is
              cleared

       Options for SAM output

       --no-sam-headers
              Do not print headers beginning with '@'

       --add-paired-nomappers
              Add nomapper lines as needed to make all paired-end results alternate between first
              end and second end

       --paired-flag-means-concordant=INT
              Whether the paired bit in the SAM flags means concordant only (1)  or  paired  plus
              concordant (0, default)

       --sam-headers-batch=INT
              Print headers only for this batch, as specified by -q

       --sam-hardclip-use-S
              Use S instead of H for hardclips

       --sam-use-0M=INT
              If  1  (default), then insert 0M in CIGAR between adjacent indels and introns If 0,
              do not allow 0M.  Picard disallows 0M, but other tools may require it

       --sam-extended-cigar
              Use extended CIGAR format (using X and = symbols instead of M, to indicate  matches
              and mismatches, respectively

       --sam-multiple-primaries
              Allows  multiple  alignments  to  be  marked  as  primary if they have equally good
              mapping scores

       --sam-sparse-secondaries
              For secondary alignments (in multiple mappings), uses '*' for SEQ and QUAL  fields,
              to  give  smaller  file sizes.  However, the output will give warnings in Picard to
              give warnings and may not work with downstream tools

       --force-xs-dir
              For RNA-Seq alignments, disallows XS:A:? when the sense direction is  unclear,  and
              replaces this value arbitrarily with XS:A:+.  May be useful for some programs, such
              as Cufflinks, that cannot handle XS:A:?.   However,  if  you  use  this  flag,  the
              reported value of XS:A:+ in these cases will not be meaningful.

       --md-report-snps
              In  MD  string,  when  known  SNPs  are  given  by  the  -v flag, prints difference
              nucleotides when they differ from reference but match a known alternate allele

       --no-soft-clips
              Does not allow soft clips at ends.  Mismatches will  be  counted  over  the  entire
              query

       --extend-soft-clips
              Extends alignments through soft clipped regions.  CIGAR string and coordinates will
              be revised, but mismatches and the MD string will reflect the clipped CIGAR

       --action-if-cigar-error
              Action to take if there is a disagreement between CIGAR length and sequence  length
              Allowed  values:  ignore,  warning  (default), noprint, abort Note that the noprint
              option does not print the CIGAR string at all if there is an error, so it may break
              a SAM parser

       --read-group-id=STRING
              Value to put into read-group id (RG-ID) field

       --read-group-name=STRING
              Value to put into read-group name (RG-SM) field

       --read-group-library=STRING
              Value to put into read-group library (RG-LB) field

       --read-group-platform=STRING
              Value to put into read-group library (RG-PL) field

       Help options

       --check
              Check compiler assumptions

       --version
              Show version

       --help Show this help message

       Other tools of GMAP suite are located in /usr/lib/gmap