Ubuntu Manpage: artfastqgenerator - outputs artificial FASTQ files derived from a reference genome

Provided by: artfastqgenerator_0.0.20150519-2_all

NAME

       artfastqgenerator - outputs artificial FASTQ files derived from a reference genome

SYNOPSIS

       artfastqgenerator  -O  <outputPath>  -R <referenceGenomePath> -S <startSequenceIdentifier>
       -F1       <fastq1ForQualityScores>       -F2        <fastq2ForQualityScores>        -CMGCS
       <coverageMeanGCcontentSpread>  -CMP  <coverageMeanPeak> -CMPGC <coverageMeanPeakGCcontent>
       -CSD  <coverageSD>   -E   <endSequenceIdentifier>   -GCC   <GCcontentBasedCoverage>   -GCR
       <GCcontentRegionSize>  -L  <logRegionStats>  -N  <nucleobaseBufferSize> -OF <outputFormat>
       -RCNF   <readsContainingNfilter>   -RL   <readLength>   -SE   <simulateErrorInRead>   -TLM
       <templateLengthMean>  -TLSD <templateLengthSD> -URQS <useRealQualityScores> -X <xStart> -Y
       <yStart>

DESCRIPTION

       ArtificialFastqGenerator takes the reference genome (in FASTA format) as input and outputs
       artificial  FASTQ files in the Sanger format. It can accept Phred base quality scores from
       existing FASTQ files, and use them to simulate sequencing  errors.  Since  the  artificial
       FASTQs  are  derived  from  the  reference  genome,  the reference genome provides a gold-
       standard for calling variants (Single Nucleotide Polymorphisms (SNPs) and  insertions  and
       deletions  (indels)).  This  enables  evaluation  of  a  Next  Generation Sequencing (NGS)
       analysis pipeline which aligns reads to the reference genome and then calls the variants.

OPTIONS

       -h     Print usage help.

       -O, <outputPath>
              Path for the artificial fastq and log files, including their  base  name  (must  be
              specified).

       -R, <referenceGenomePath>
              Reference genome sequence file, (must be specified).

       -S, <startSequenceIdentifier>
              Prefix  of  the  sequence  identifier  in the reference after which read generation
              should begin (must be specified).

       -F1, <fastq1ForQualityScores>
              First  fastq  file  to  use  for  real  quality  scores,  (must  be  specified   if
              useRealQualityScores = true).

       -F2, <fastq2ForQualityScores>
              Second  fastq  file  to  use  for  real  quality  scores,  (must  be  specified  if
              useRealQualityScores = true).

       -CMGCS, <coverageMeanGCcontentSpread>
              The spread of coverage mean given GC content (default = 0.22).

       -CMP, <coverageMeanPeak>
              The peak coverage mean for a region (default = 37.7).

       -CMPGC, <coverageMeanPeakGCcontent>
              The GC content for regions with peak coverage mean (default = 0.45).

       -CSD, <coverageSD>
              The coverage standard deviation divided by the mean (default = 0.2).

       -E, <endSequenceIdentifier>
              Prefix of the sequence identifier in the reference  where  read  generation  should
              stop, (default = end of file).

       -GCC, <GCcontentBasedCoverage>
              Whether nucleobase coverage is biased by GC content (default = true).

       -GCR, <GCcontentRegionSize>
              Region size in nucleobases for which to calculate GC content, (default = 150).

       -L, <logRegionStats>
              The  region  size  as  a multiple of -NBS for which summary coverage statistics are
              recorded (default = 2).

       -N, <nucleobaseBufferSize>
              The number of reference sequence nucleobases to buffer in memory, (default = 5000).

       -OF, <outputFormat>
               'default': standard fastq output; 'debug_nucleobases(_nuc|read_ids)': debugging.

       -RCNF, <readsContainingNfilter>
              Filter out no "N-containing" reads (0), "all-N"  reads  (1),  "at-least-1-N"  reads
              (2), (default = 0).

       -RL, <readLength>
              The length of each read, (default = 76).

       -SE, <simulateErrorInRead>
              Whether  to  simulate  error  in  the  read based on the quality scores, (default =
              false).

       -TLM, <templateLengthMean>
              The mean DNA template length, (default = 210).

       -TLSD, <templateLengthSD>
              The standard deviation of the DNA template length, (default = 60).

       -URQS, <useRealQualityScores>
              Whether to use real quality scores from existing fastq files  or  set  all  to  the
              maximum, (default = false).

       -X, <xStart>
              The first read's X coordinate, (default = 1000).

       -Y, <yStart>
              The first read's Y coordinate, (default = 1000).

BUGS

       Any bugs should be reported to Matthew.Frampton@icr.ac.uk

AUTHOR

       This  manpage was written by Andreas Tille for the Debian distribution and can be used for
       any other usage of the program.