Ubuntu Manpage: psi-cd-hit-2d.pl - runs similar algorithm like CD-HIT but using BLAST to calculate

Provided by: cd-hit_4.6.1-2012-08-27-2_amd64

NAME

       psi-cd-hit-2d.pl  -  runs  similar  algorithm  like  CD-HIT  but  using BLAST to calculate
       similarities in db1 or db2 format

DESCRIPTION

       Usage psi-cd-hit-2d [Options]

       Options

       -i     in_dbname, required

       -o     out_dbname, required

       -c     clustering threshold (sequence identity), default 0.3

       -ce clustering threshold (blast expect), default -1,

              it means by default it doesn't use expect threshold, but with positive  value,  the
              program  cluster  seqs  if  similarities  meet  either identity threshold or expect
              threshold

       -L     coverage of shorter sequence ( aligned / full), default 0.0

       -M     coverage of longer sequence ( aligned / full), default 0.0

       -R     (1/0) use psi-blast profile? default 0 perform psi-blast / pdb-blast type search

       -G     (1/0) use global identity? default 1 sequence identity calculated as

              total identical residues of local alignments / length of shorter seq

              if you prefer to use -G 0, it is suggested that you also use -L, such as -L 0.8, to
              prevent very short matches.

       -d     length of description line in the .clstr file, default 30 if set to 0, it takes the
              fasta defline and stops at first space

       -l     length_of_throw_away_sequences, default 10

       -p     profile search para, default

              "-a 2 -d nr80 -j 3 -F F -e 0.001 -b 500 -v 500"

       -bfdb profile database, default nr80

       -s     blast search para, default

              "-F F -e 0.000001 -b 100000 -v 100000"

       -be blast expect cutoff, default 0.000001

       -b     filename of list of hosts to run this program in parallel with ssh calls, you  need
              provide a list of hosts

       -pbs No of jobs to send each time by PBS querying system

              you can not use both ssh and pbs at same time

       -k (1/0) keep blast raw output file, default 1

       -rs steps of save restart file and clustering output, default 5000

              everytime  after  process  5000 sequences, program write a restart file and current
              clustering information

       -restart restart file, readin a restart file

              if program crash, stoped, termitated, you can restart it by add a option  "-restart
              sth.restart"

       -rf steps of re format blast database, default 200,000

              if  program  clustered  200,000  seqs,  it remove them from seq pool, and re format
              blast db to save time

       -local dir of local blast db,

              when run in parallel with ssh (not pbs), I can copy blast dbs to  local  drives  on
              each node to save blast db reading time BUT, IT MAY NOT FASTER

       -J     job,  job_file, exe specific jobs like parse blast outonly DON'T use it, it is only
              used by this program itself

       -single files of ids those you known that they are singletons

              so I won't run them as queries

       -i2 second input database

       -blastn run blastn, default 0

       -lo how long can seq in db2 > db1 in a cluster, default 0

              means, that seq in db2 should <= seqs in db1 in a cluster

              ==============================      by       Weizhong       Li,       liwz@sdsc.edu
              ==============================

              If you find cd-hit useful, please kindly cite:

              "Clustering  of  highly  homologous  sequences  to  reduce thesize of large protein
              database", Weizhong Li, Lukasz  Jaroszewski  &  Adam  GodzikBioinformatics,  (2001)
              17:282-283  "Cd-hit:  a  fast  program  for  clustering and comparing large sets of
              protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics,  (2006)
              22:1658-1659