MMseqs2 - MMseqs2 (Many against Many sequence searching): fast, parallelized protein sequence searches and clustering of huge protein sequence data sets.
mmseqs <module> args
MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge proteins/nucleotide sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed. The following depicts the different <module> that can be used. Easy workflows (for non-experts) An example for running a command using easy-* modules would be mmseqs easy-search <DB> <targetDB> easy-search Search with a query fasta against target fasta (or database) and return a BLAST-compatible result in a single step easy-linsearch Linear time search with a query fasta against target fasta (or database) and return a BLAST-compatible result in a single step easy-linclust Compute clustering of a fasta/fastq database in linear time. The workflow outputs the representative sequences, a cluster tsv and a fasta-like format containing all sequences. easy-cluster Compute clustering of a fasta database. The workflow outputs the representative sequences, a cluster tsv and a fasta-like format containing all sequences. easy-taxonomy Compute taxonomy and lowest common ancestor for each sequence. The workflow outputs a taxonomic classification for sequences and a hierarchical summery report. Main tools (for non-experts) createdb Convert protein sequence set in a FASTA file to MMseqs sequence DB format search Search with query sequence or profile DB (iteratively) through target sequence DB linsearch Search with query sequence DB through target sequence DB map Fast ungapped mapping of query sequences to target sequences. cluster Compute clustering of a sequence DB (quadratic time) linclust Cluster sequences of >30% sequence identity *in linear time* createindex Precompute index table of sequence DB for faster searches createlinindex Precompute index for linsearch enrich Enrich a query set by searching iteratively through a profile sequence set. rbh Find reciprocal best hits between query and target clusterupdate Update clustering of old sequence DB to clustering of new sequence DB Utility tools for format conversions createtsv Create tab-separated flat file from prefilter DB, alignment DB, cluster DB, or taxa DB convertalis Convert alignment DB to BLAST-tab format or specified custom-column output format convertprofiledb Convert ffindex DB of HMM files to profile DB convert2fasta Convert sequence DB to FASTA format result2flat Create a FASTA-like flat file from prefilter DB, alignment DB, or cluster DB createseqfiledb Create DB of unaligned FASTA files (1 per cluster) from sequence DB and cluster DB Taxonomy tools taxonomy Compute taxonomy and lowest common ancestor for each sequence. createtaxdb Annotates a sequence database with NCBI taxonomy information addtaxonomy Add taxonomy information to result database. lca Compute the lowest common ancestor from a set of taxa. taxonomyreport Create Kraken-style taxonomy report. filtertaxdb Filter taxonomy database. Multi-hit search tools multihitdb Create sequence database and associated metadata for multi hit searches multihitsearch Search with a grouped set of sequences against another grouped set besthitperset For each set of sequences compute the best element and updates the p-value combinepvalperset For each set compute the combined p-value summerizeresultsbyset For each set compute summary statistics, such as spread-pvalue etc. resultsbyset For each set compute the combined p-value mergeresultsbyset Merge results from multiple orfs back to their respective contig Utility tools for clustering mergeclusters Merge multiple cluster DBs into single cluster DB Core tools (for advanced users) prefilter Search with query sequence / profile DB through target DB (k-mer matching + ungapped alignment) ungappedprefilter Search with query sequence / profile DB through target DB and compute optimal ungapped alignment score align Compute Smith-Waterman alignments for previous results (e.g. prefilter DB, cluster DB) alignall Compute all against all Smith-Waterman alignments for a results (e.g. prefilter DB, cluster DB) transitivealign Transfers alignments by transitivity via a center star alignment clust Cluster sequence DB from alignment DB (e.g. created by searching DB against itself) kmermatcher Finds exact $k$-mers matches between sequences kmersearch Search with query sequence through target DB. Finds exact $k$-mers matches between sequences kmersearch Search with query sequence through target DB. (k-mer matching) kmerindexdb Finds exact $k$-mers matches between sequences and stores them as index clusthash Cluster sequences of same length and >90% sequence identity *in linear time* Utility tools to manipulate DBs compress Compresses a database. decompress Decompresses a database. apply Passes each input database entry to stdin of the specified program, executes it and writes its stdout to the output database. extractorfs Extract open reading frames from all six frames from nucleotide sequence DB extractframes Extract frames reading frames from a nucleotide sequence DB orftocontig Obtain location information of extracted orfs with respect to their contigs in alignment format reverseseq Reverse each sequence in a DB touchdb Memory map database translatenucs Translate nucleotide sequence DB into protein sequence DB translateaa Translate protein sequence into nucleotide sequence DB swapresults Reformat prefilter or alignment DB as if target DB had been searched through query DB swapdb Create a DB where the key is from the first column of the input result DB mergedbs Merge multiple DBs into a single DB, based on IDs (names) of entries splitdb Split a mmseqs DB into multiple DBs splitsequence Split sequences by length subtractdbs Generate a DB with entries of first DB not occurring in second DB filterdb Filter a DB by conditioning (regex, numerical, ...) on one of its whitespace-separated columns createsubdb Create a subset of a DB from a file of IDs of entries view Prints entries to console rmdb Removes the database mvdb Move the database result2profile Compute profile and consensus DB from a prefilter, alignment or cluster DB result2pp Merge the query profiles with target profiles according to search results and outputs an enriched profile DB result2rbh Filter a merged result DB to retain only reciprocal best hits result2msa Generate MSAs for queries by locally aligning their matched targets in prefilter/alignment/cluster DB convertmsa Turns an MSA file into an MSA database. msa2profile Turns an MSA database into a MMseqs profile database. profile2pssm Converts a profile database into a human readable tab-separated PSSM file. profile2cs Converts a profile database into a column state sequence. result2stats Compute statistics for each entry in a sequence, prefilter, alignment or cluster DB proteinaln2nucl Map protein alignment to nucleotide alignment tsv2db Turns a TSV file into a MMseqs database result2repseq Get representative sequences for a result database Special-purpose utilities rescorediagonal Compute sequence identity for diagonal alignbykmer Predict sequence identity, score, alignment start and end by kmer alignment diffseqdbs Find IDs of sequences kept, added and removed between two versions of sequence DB concatdbs Concatenate two DBs, giving new IDs to entries from second input DB sortresult Sort a result database in the same order as prefilter or align would. summarizealis Summarize alignment results into a single show uniq. coverage, coverage and avg. sequence identity summarizeresult Extract annotations from alignment DB summarizetabs Extract annotations from HHblits BAST-tab-formatted results gff2db Turn a gff3 (generic feature format) file into a gff3 DB masksequence Soft mask sequences using tantan, low. complex regions in lower case the rest upper maskbygff X out sequence regions in a sequence DB by features in a gff3 file prefixid For each entry in a DB prepend the entry ID to the entry itself suffixid For each entry in a DB append the entry ID to the entry itself convertkb Convert UniProt knowledge base files into MMseqs2 database format for the selected column types summarizeheaders Return a new summarized header DB from the UniProt headers of a cluster DB extractalignedregion Extract aligned sequence region from query extractdomains Extract highest scoring alignment region for each sequence from BLAST-tab file convertca3m Converts a cA3M database into a MMseqs2 result database. expandaln Expands an alignment result based on another. countkmer Simple kmer counter, it prints the numeric, alphanumeric representation and kmercount