Ubuntu Manpage: rsem-simulate-reads - Simulate RNA-Seq data ("reads") for a given model and a RSEM

NAME

       rsem-simulate-reads  -  Simulate  RNA-Seq  data  ("reads")  for  a  given model and a RSEM
       reference transcript collection.

SYNOPSIS

       rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N
       output_name [--seed seed] [-q]

DESCRIPTION

       Parameters:

       reference_name:  The  name  of  RSEM  references,  which  should  be  already generated by
       'rsem-prepare-reference' estimated_model_file: This file describes how the  RNA-Seq  reads
       will  be  sequenced  given the expression levels. It determines what kind of reads will be
       simulated (single-end/paired-end, w/o quality score) and includes parameters for  fragment
       length  distribution,  read  start  position  distribution,  sequencing error models, etc.
       Normally, this file should be learned from real  data  using  'rsem-calculate-expression'.
       The   file   can   be   found  under  the  'sample_name.stat'  folder  with  the  name  of
       'sample_name.model' estimated_isoform_results: This file contains  expression  levels  for
       all    isoforms    recorded    in    the    reference.    It    can   be   learned   using
       'rsem-calculate-expression' from real data. The corresponding file users want  to  use  is
       'sample_name.isoforms.results'.  If  simulating  from  user-designed expression profile is
       desired, start from a learned 'sample_name.isoforms.results'  file  and  only  modify  the
       'TPM'  column.  The  simulator  only reads the TPM column. But keeping the file format the
       same is required. If the RSEM references built are aware of  allele-specific  transcripts,
       'sample_name.alleles.results'  should  be used instead.  theta0: This parameter determines
       the fraction of reads  that  are  coming  from  background  "noise"  (instead  of  from  a
       transcript).  It  can  also be estimated using 'rsem-calculate-expression' from real data.
       Users  can  find  it  as   the   first   value   of   the   third   line   of   the   file
       'sample_name.stat/sample_name.theta'.   N:  The  total number of reads to be simulated. If
       'rsem-calculate-expression' is executed on a real data set, the total number of reads  can
       be    found    as    the    4th    number    of    the    first    line    of   the   file
       'sample_name.stat/sample_name.cnt'.  output_name: Prefix for  all  output  files.   --seed
       seed:  Set  seed  for the random number generator used in simulation. The seed should be a
       32-bit unsigned integer.  -q: Set it will stop outputting intermediate information.

       Outputs:

       output_name.sim.isoforms.results,   output_name.sim.genes.results:    Expression    levels
       estimated     by     counting     where     each     simulated     read     comes    from.
       output_name.sim.alleles.results: Allele-specific expression levels estimated  by  counting
       where each simulated read comes from.

       output_name.fa  if  single-end  without  quality  score; output_name.fq if single-end with
       quality score; output_name_1.fa & output_name_2.fa if paired-end  without  quality  score;
       output_name_1.fq & output_name_2.fq if paired-end with quality score.

       Format  of the header line: Each simulated read's header line encodes where it comes from.
       The header line has the format:

              {>/@}_rid_dir_sid_pos[_insertL]

       {>/@}: Either '>' or '@' must appear. '>' appears if FASTA files  are  generated  and  '@'
       appears if FASTQ files are generated rid: Simulated read's index, numbered from 0 dir: The
       direction of the simulated read. 0 refers to forward strand ('+') and 1 refers to  reverse
       strand  ('-')  sid:  Represent  which  transcript  this  read is simulated from. It ranges
       between 0 and M, where M is the total  number  of  transcripts.  If  sid=0,  the  read  is
       simulated  from  the  background noise. Otherwise, the read is simulated from a transcript
       with index sid. Transcript sid's transcript name  can  be  found  in  the  'transcript_id'
       column  of  the 'sample_name.isoforms.results' file (at line sid + 1, line 1 is for column
       names) pos: The start position of the simulated read in strand dir of transcript  sid.  It
       is  numbered  from 0 insertL: Only appear for paired-end reads. It gives the insert length
       of the simulated read.

       Example:

       Suppose we want to simulate 50 millon single-end reads with quality  scores  and  use  the
       parameters  learned  from  [Example](#example).  In  addition,  we  set  theta0 as 0.2 and
       output_name as 'simulated_reads'. The command is:

              rsem-simulate-reads                                                  /ref/mouse_125
              mmliver_single_quals.stat/mmliver_single_quals.model
              mmliver_single_quals.isoforms.results 0.2 50000000 simulated_reads