Provided by: sumaclust_1.0.35-2_amd64 bug

NAME

       sumaclust - star clustering of genetic sequences

SYNOPSIS

       sumaclust [options] <dataset>

DESCRIPTION

       With  the  development  of  next-generation  sequencing, efficient tools are needed to handle millions of
       sequences in reasonable amounts of time.  Sumaclust is a program developed by the LECA. Sumaclust aims to
       cluster  sequences  in  a way that is fast and exact at the same time. This tool has been developed to be
       adapted to the type of data generated by DNA  metabarcoding,  i.e.  entirely  sequenced,  short  markers.
       Sumaclust clusters sequences using the same clustering algorithm as UCLUST and CD- HIT. This algorithm is
       mainly useful to detect the 'erroneous' sequences created during amplification and sequencing  protocols,
       deriving from 'true' sequences.

OPTIONS

       -h     [H]elp - print <this> help

       -l     : Reference sequence length is the shortest.

       -L     Reference sequence length is the largest.

       -a     Reference sequence length is the alignment length (default).

       -n     Score is normalized by reference sequence length (default).

       -r     : Raw score, not normalized.

       -d     : Score is expressed in distance (default : score is expressed in similarity).

       -t  ##.##  :  Score  threshold  for  clustering.  If  the score is normalized and expressed in similarity
              (default),

              it is an identity, e.g. 0.95 for an identity of 95%. If the score is normalized and  expressed  in
              distance,  it  is  (1.0  -  identity),  e.g.  0.05  for  an  identity of 95%.  If the score is not
              normalized and expressed in similarity, it is the length of the Longest Common Subsequence. If the
              score  is  not  normalized and expressed in distance, it is (reference length - LCS length).  Only
              sequences with a similarity above ##.## with the center sequence of a cluster are assigned to that
              cluster. Default: 0.97.

       -e     Exact  option:  A  sequence  is  assigned  to  the cluster with the center sequence presenting the
              highest similarity score > threshold, as opposed to the default 'fast' option where a sequence  is
              assigned to the first cluster found with a center sequence presenting a score > threshold.

       -R ##  Maximum  ratio between the counts of two sequences so that the less abundant one can be considered
              as a variant of the more abundant one. Default: 1.0.

       -p ##  Multithreading with ## threads using openMP.

       -s ####
              Sorting by ####. Must be 'None' for no sorting, or a key in the fasta  header  of  each  sequence,
              except for the count that can be computed (default : sorting by count).

       -o     Sorting is in ascending order (default : descending).

       -g     n's are replaced with a's (default: sequences with n's are discarded).

       -B ### Output of the OTU table in BIOM format is activated, and written to file ###.

       -O ### Output of the OTU map (observation map) is activated, and written to file ###.

       -F ### Output in FASTA format is written to file ### instead of standard output.

       -f     Output in FASTA format is deactivated.

       Argument : the nucleotide dataset to cluster

SEE ALSO

       http://metabarcoding.org/sumatra