Ubuntu Manpage: obisample - description of obisample

Provided by: obitools_1.2.12+dfsg-2_amd64

NAME

       obisample - description of obisample

       obisample randomly resamples sequence records with or without replacement.

OBISAMPLE SPECIFIC OPTIONS

-s ###, --sample-size ###
Specifies the size of the generated sample.

· without the -a option, sample size is expressed as the exact number of
sequence records to be sampled (default: number of sequence records in the
input file).

· with the -a option, sample size is expressed as a fraction of the sequence
record numbers in the input file (expressed as a number between 0 and 1).

Example:

> obisample -s 1000 seq1.fasta > seq2.fasta

Samples randomly 1000 sequence records from the seq1.fasta file, with
replacement, and saves them in the seq2.fasta file.

-a, --approx-sampling
Switches the resampling algorithm to an approximative one, useful for large
files.

The default algorithm selects exactly the number of sequence records specified
with the -s option. When the -a option is set, each sequence record has a
probability to be selected related to the count attribute of the sequence record
and the -s fraction.

Example:

> obisample -s 0.5 -a seq1.fastq > seq2.fastq

Samples randomly half of the sequence records of the seq1.fastq file, without
replacement, and saves them in the seq2.fastq file.

-w, --without-replacement
Asks for sampling without replacement.

Example:

> obisample -s 1000 -w seq1.fasta > seq2.fasta

Samples randomly 1000 sequence records from the seq1.fasta file, without
replacement (the input file must contain at least 1000 sequences), and saves
them in the seq2.fasta file.

OPTIONS TO SPECIFY INPUT FORMAT

   Restrict the analysis to a sub-part of the input file
       --skip <N>
              The N first sequence records of the file are discarded from the  analysis  and  not
              reported to the output file

       --only <N>
              Only  the N next sequence records of the file are analyzed. The following sequences
              in the file are neither analyzed, neither reported to the output file.  This option
              can be used conjointly with the –skip option.

   Sequence annotated format
       --genbank
              Input file is in genbank format.

       --embl Input file is in embl format.

   fasta related format
       --fasta
              Input file is in fasta format (including OBITools fasta extensions).

   fastq related format
       --sanger
              Input  file  is  in  Sanger  fastq  format  (standard  fastq  used  by  HiSeq/MiSeq
              sequencers).

       --solexa
              Input file is in fastq format produced by Solexa (Ga IIx) sequencers.

   ecoPCR related format
       --ecopcr
              Input file is in ecoPCR format.

       --ecopcrdb
              Input is an ecoPCR database.

   Specifying the sequence type
       --nuc  Input file contains nucleic sequences.

       --prot Input file contains protein sequences.

COMMON OPTIONS

       -h, --help
              Shows this help message and exits.

       --DEBUG
              Sets logging in debug mode.

OBISAMPLE USED SEQUENCE ATTRIBUTE

          · count

AUTHOR

       The OBITools Development Team - LECA

COPYRIGHT

       2019 - 2015, OBITool Development Team

 1.02 12                                   Jan 28, 2019                              OBISAMPLE(1)