Ubuntu Manpage: Raptor-layout - A fast and space-efficient pre-filter for querying very large collections

Provided by: seqan-raptor_3.0.1+ds-3build1_amd64

NAME

       Raptor-layout  - A fast and space-efficient pre-filter for querying very large collections
       of nucleotide sequences.

DESCRIPTION

       Computes an HIBF layout that tries to minimize the disk space consumption of the resulting
       index.  The  space  is  estimated  using  a  k-mer count per user bin which represents the
       potential denisity in a technical bin in an interleaved Bloom filter.  You  can  pass  the
       resulting  layout  to  raptor  (https://github.com/seqan/raptor)  to  build  the index and
       conduct queries.

OPTIONS

Main options:
--input-file (std::filesystem::path)
The input must be a file containing paths to sequence data you wish to estimate;
one filepath per line. If your file contains auxiliary information (e.g. species
IDs), your file must be tab-separated.

Example file:

```

/absolute/path/to/file1.fasta

/absolute/path/to/file2.fa.gz

```

--kmer-size (unsigned 8 bit integer)
The k-mer size influences the size estimates of the input. Choosing a k-mer size
that is too small for your data will result in files appearing more similar than
they really are. Likewise, a large k-mer size might miss out on certain
similarities. For DNA sequences, a k-mer size between [16,32] has proven to work
well. Default: 19.

--num-hash-functions (unsigned 64 bit integer)
The number of hash functions to use when building the HIBF from the resulting
layout. This parameter is needed to correctly estimate the index size when
computing the layout. Default: 2.

--false-positive-rate (double)
The false positive rate you aim for when building the HIBF from the resulting
layout. This parameter is needed to correctly estimate the index size when
computing the layout. Default: 0.05.

--output-filename (std::filesystem::path)
A file name for the resulting layout. Default: "binning.out".

--threads (unsigned 64 bit integer)
The number of threads to use. Currently, only merging of sketches is parallelized,
so if the flag --disable-rearrangement is set, --threads will have no effect.
Default: 1. Value must be in range [1,18446744073709551615].

HyperLogLog Sketches:
To improve the layout, you can estimate the sequence similarities using HyperLogLog
sketches.

--disable-estimate-union
The sketches are used to estimate the sequence similarity among a set of user bins.
This will improve the layout computation as merging user bins that do not increase
technical bin sizes will be preferred. This may use more RAM and can be disabled in
RAM-critical environments. Attention: Also disables rearrangement which depends on
union estimations.

--disable-rearrangement
As a preprocessing step, rearranging the order of the given user bins based on
their sequence similarity may lead to favourable small unions and thus a smaller
index. Depending on the number of input samples (user bins), this may be time-
consuming and can thus be disabled if a suboptimal layout is sufficient.

Parameter Tweaking:
Special options

REFERENCES

       [1] Philippe Flajolet, Éric Fusy, Olivier Gandouet,  Frédéric  Meunier.  HyperLogLog:  the
       analysis of a near-optimal cardinality estimation algorithm. AofA: Analysis of Algorithms,
       Jun     2007,     Juan     les     Pins,     France.      pp.137-156.      hal-00406166v2,
       https://doi.org/10.46298/dmtcs.3545

   Common options
       -h, --help
              Prints the help page.

       -hh, --advanced-help
              Prints the help page including advanced options.

       --version
              Prints the version information.

       --copyright
              Prints the copyright/license information.

       --export-help (std::string)
              Export the help page information. Value must be one of [html, man, ctd, cwl].

VERSION

       Last update: Unavailable
       Raptor-layout version: 3.0.1 (commit unavailable)
       Sharg version: 1.1.1
       SeqAn version: 3.3.0-rc.2

URL

       https://github.com/seqan/raptor

LEGAL

       Raptor-layout Copyright: BSD 3-Clause License
       Author: Svenja Mehringer
       Contact: svenja.mehringer@fu-berlin.de
       SeqAn Copyright: 2006-2023 Knut Reinert, FU-Berlin; released under the 3-clause BSDL.
       In  your  academic  works  please  cite: Raptor: A fast and space-efficient pre-filter for
       querying very large collections of nucleotide sequences; Enrico Seiler, Svenja  Mehringer,
       Mitra  Darvish,  Etienne  Turc,  and  Knut  Reinert;  iScience  2021  24 (7): 102782. doi:
       https://doi.org/10.1016/j.isci.2021.102782
       For full copyright and/or warranty information see --copyright.