Provided by: sumaclust_1.0.36+ds-2_amd64 bug

NAME

       sumaclust - star clustering of genetic sequences

SYNOPSIS

       sumaclust [options] <dataset>

DESCRIPTION

       With  the  development of next-generation sequencing, efficient tools are needed to handle
       millions of sequences in reasonable amounts of time.  Sumaclust is a program developed  by
       the  LECA. Sumaclust aims to cluster sequences in a way that is fast and exact at the same
       time. This tool has been developed to be adapted to the type  of  data  generated  by  DNA
       metabarcoding,  i.e. entirely sequenced, short markers. Sumaclust clusters sequences using
       the same clustering algorithm as UCLUST and CD- HIT. This algorithm is  mainly  useful  to
       detect  the  'erroneous'  sequences created during amplification and sequencing protocols,
       deriving from 'true' sequences.

OPTIONS

       -h     [H]elp - print <this> help

       -l     : Reference sequence length is the shortest.

       -L     Reference sequence length is the largest.

       -a     Reference sequence length is the alignment length (default).

       -n     Score is normalized by reference sequence length (default).

       -r     : Raw score, not normalized.

       -d     : Score is expressed in distance (default : score is expressed in similarity).

       -t ##.## : Score threshold for clustering. If the score is  normalized  and  expressed  in
              similarity (default),

              it is an identity, e.g. 0.95 for an identity of 95%. If the score is normalized and
              expressed in distance, it is (1.0 - identity), e.g. 0.05 for an  identity  of  95%.
              If the score is not normalized and expressed in similarity, it is the length of the
              Longest Common Subsequence. If  the  score  is  not  normalized  and  expressed  in
              distance,  it is (reference length - LCS length).  Only sequences with a similarity
              above ##.## with the center sequence of a cluster are  assigned  to  that  cluster.
              Default: 0.97.

       -e     Exact  option:  A  sequence  is  assigned  to  the cluster with the center sequence
              presenting the highest similarity score > threshold,  as  opposed  to  the  default
              'fast' option where a sequence is assigned to the first cluster found with a center
              sequence presenting a score > threshold.

       -R ##  Maximum ratio between the counts of two sequences so that the less abundant one can
              be considered as a variant of the more abundant one. Default: 1.0.

       -p ##  Multithreading with ## threads using openMP.

       -s ####
              Sorting  by  ####.  Must  be 'None' for no sorting, or a key in the fasta header of
              each sequence, except for the count that can be  computed  (default  :  sorting  by
              count).

       -o     Sorting is in ascending order (default : descending).

       -g     n's are replaced with a's (default: sequences with n's are discarded).

       -B ### Output of the OTU table in BIOM format is activated, and written to file ###.

       -O ### Output of the OTU map (observation map) is activated, and written to file ###.

       -F ### Output in FASTA format is written to file ### instead of standard output.

       -f     Output in FASTA format is deactivated.

       Argument : the nucleotide dataset to cluster

SEE ALSO

       http://metabarcoding.org/sumatra