Provided by: soapdenovo_1.05-5_amd64 bug

NAME

       soapdenovo -  Short-read assembly method that can build a de novo draft assembly

SYNOPSIS

       soapdenovo_31mer soapdenovo_63mer soapdenovo_127mer

Introduction

       SOAPdenovo  is  a novel short-read assembly method that can build a de novo draft assembly for the human-
       sized genomes. The program is specially designed to assemble Illumina GA  short  reads.  It  creates  new
       opportunities  for  building reference sequences and carrying out accurate analyses of unexplored genomes
       in a cost effective way.

       1) Support large kmer up to 127 to utilize long reads. Three version are provided.
           I. The 31mer version support kmer only <=31.
           II. The 63mer version support kmer only <=63 and doubles the memory consumption than  31mer  version,
       even being used with kmer <=31.
           III. The 127mer version support kmer only <=127 and double the memory consumption than 63mer version,
       even being used with kmer <=63.

       Please notice that, with longer kmer, the quantity of nodes would decrease significantly, thus the memory
       consumption is usually smaller than double with shifted version.

       2)  New  parameter  added  in  "pregraph" module. This parameter initiates the memory assumption to avoid
       further reallocation. Unit of the parameter is GB. Without further reallocation, SOAPdenovo  runs  faster
       and  provide  the  potential  to  eat  up  all the memory of the machine. For example, if the workstation
       provides 50g free memory, use -a 50 in pregraph step, then  a  static  amount  of  50g  memory  would  be
       allocated  before processing reads. This can also avoid being interrupted by other users sharing the same
       machine.

       3) Gap filled bases now represented by lowercase characters in 'scafSeq' file.

       4) Introduced SIMD instructions to boost the performance.

Configuration file

       For big genome projects with deep sequencing, the data is usually organized  as  multiple  read  sequence
       files  generated from multiple libraries.  The configuration file tells the assembler where to find these
       files and the relevant information.  “example.config” is an example of such a file.

       The configuration file has a section for global information, and then multiple library  sections.   Right
       now only “max_rd_len” is included in the global information section. Any read longer than max_rd_len will
       be cut to this length.

       The library information and the information of sequencing data  generated  from  the  library  should  be
       organized  in the corresponding library section.  Each library section starts with tag [LIB] and includes
       the following items:

       avg_ins
              This value indicates the average insert size of this library or the peak  value  position  in  the
              insert size distribution figure.

       reverse_seq
              This  option  takes  value  0  or  1.  It  tells  the  assembler  if the read sequences need to be
              complementarily reversed.  Illumima GA produces two types of  paired-end  libraries:  a)  forward-
              reverse, generated from fragmented DNA ends with typical insert size less than 500 bp; b) forward-
              forward, generated from circularizing libraries with typical insert size greater than 2  Kb.   The
              parameter “reverse_seq” should be set to indicate this: 0, forward-reverse; 1, forward-forward.

       asm_flags=3
              This  indicator  decides  in  which  part(s)  the  reads  are  used.  It takes value 1(only contig
              assembly), 2 (only scaffold assembly), 3(both contig  and  scaffold  assembly),  or  4  (only  gap
              closure).

       rd_len_cutoff
              The assembler will cut the reads from the current library to this length.

       rank   It  takes  integer  values  and  decides  in which order the reads are used for scaffold assembly.
              Libraries with the same “rank” are used at the same time during scaffold assembly.

       pair_num_cutoff
              This parameter is the cutoff value of pair number for a reliable connection between two contigs or
              pre-scaffolds.

       map_len
              This  takes  effect  in  the  “map”  step and is the minimum alignment length between a read and a
              contig required for a reliable read location.

       The assembler accepts read file in  two  formats:  FASTA  or  FASTQ.   Mate-pair  relationship  could  be
       indicated  in  two  ways:  two  sequence  files  with reads in the same order belonging to a pair, or two
       adjacent reads in a single file (FASTA only) belonging to a pair.

       In the configuration file single end files are indicated by “f=/path/filename” or  “q=/pah/filename”  for
       fasta  or  fastq formats separately.  Paired reads in two fasta sequence files are indicated by “f1=” and
       “f2=”. While paired reads in two fastq sequences files are indicated by “q1=” and “q2=”. Paired reads  in
       a single fasta sequence file is indicated by “p=” item.

       All  the  above items in each library section are optional. The assembler assigns default values for most
       of them. If you are not sure how to set a parameter, you can remove it from your configuration file.

Get it started

       Once the configuration file is available,  a  typical  way  to  run  the  assembler  is:  ${bin}  all  –s
       config_file –K 63 –R –o graph_prefix

       User  can  also choose to run the assembly process step by step as: ${bin} pregraph \[u2013]s config_file
       \[u2013]K 63 [\[u2013]R -d \[u2013]p -a] \[u2013]o  graph_prefix  ${bin}  contig  \[u2013]g  graph_prefix
       [\[u2013]R  \[u2013]M  1  -D]  ${bin}  map \[u2013]s config_file \[u2013]g graph_prefix [-p] ${bin} scaff
       \[u2013]g graph_prefix [\[u2013]F -u -G -p]

Options

       -a     INT Initiate the memory assumption (GB) to avoid further reallocation

       -s     STR configuration file

       -o     STR output graph file prefix

       -g     STR input graph file prefix

       -K     INT K-mer size [default 23, min 13, max 127]

       -p     INT multithreads, n threads [default 8]

       -R     use reads to solve tiny repeats [default no]

       -d     INT remove low-frequency K-mers with frequency no larger than [default 0]

       -D     INT remove edges with coverage no larger that [default 1]

       -M     INT strength of merging similar sequences during contiging [default 1, min 0, max 3]

       -F     intra-scaffold gap closure [default no]

       -u     un-mask high coverage contigs before scaffolding [default mask]

       -G     INT allowed length difference between estimated and filled gap

       -L     minimum contigs length used for scaffolding

Output files

       These files are output as assembly results:

       a. *.contig

       contig sequences without using mate pair information

       b. *.scafSeq

       scaffold sequences (final contig sequences can be extracted by breaking down scaffold  sequences  at  gap
       regions)

       There  are  some  other  files  that  provide  useful information for advanced users, which are listed in
       Appendix B.

FAQ

   How to set K-mer size?
       The program accepts odd numbers between 13 and 31. Larger K-mers would have higher rate of uniqueness  in
       the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length
       to guarantee the overlap at any genomic location.

   How to set library rank?
       SOAPdenovo will use the pair-end  libraries  with  insert  size  from  smaller  to  larger  to  construct
       scaffolds.  Libraries  with  the same rank would be used at the same time. For example, in a dataset of a
       human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb,
       separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome.

APPENDIX A: an example.config

       #maximal read length
       max_rd_len=50
       [LIB]
       #average insert size
       avg_ins=200
       #if sequence needs to be reversed
       reverse_seq=0
       #in which part(s) the reads are used
       asm_flags=3
       #use only first 50 bps of each read
       rd_len_cutoff=50
       #in which order the reads are used while scaffolding
       rank=1
       # cutoff of pair number for a reliable connection (default 3)
       pair_num_cutoff=3
       #minimum aligned length to contigs for a reliable read location (default 32)
       map_len=32
       #fastq file for read 1
       q1=/path/**LIBNAMEA**/fastq_read_1.fq
       #fastq file for read 2 always follows fastq file for read 1
       q2=/path/**LIBNAMEA**/fastq_read_2.fq
       #fasta file for read 1
       f1=/path/**LIBNAMEA**/fasta_read_1.fa
       #fastq file for read 2 always follows fastq file for read 1
       f2=/path/**LIBNAMEA**/fasta_read_2.fa
       #fastq file for single reads
       q=/path/**LIBNAMEA**/fastq_read_single.fq
       #fasta file for single reads
       f=/path/**LIBNAMEA**/fasta_read_single.fa
       #a single fasta file for paired reads
       p=/path/**LIBNAMEA**/pairs_in_one_file.fa
       [LIB]
       avg_ins=2000
       reverse_seq=1
       asm_flags=2
       rank=2
       # cutoff of pair number for a reliable connection
       #(default 5 for large insert size)
       pair_num_cutoff=5
       #minimum aligned length to contigs for a reliable read location
       #(default 35 for large insert size)
       map_len=35
       q1=/path/**LIBNAMEB**/fastq_read_1.fq
       q2=/path/**LIBNAMEB**/fastq_read_2.fq
       q=/path/**LIBNAMEB**/fastq_read_single.fq
       f=/path/**LIBNAMEB**/fasta_read_single.fa

Appendix B: output files

       1. Output files from the command “pregraph”

       a. *.kmerFreq

       Each row shows the number of Kmers with a frequency equals the row number.

       b. *.edge

       Each  record  gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer
       coverage, whether it’s reverse-complementarily identical and the sequence.

       c. *.markOnEdge & *.path

       These two files are for using reads to solve small repeats

       e. *.preArc

       Connections between edges which are established by the read paths.

       f. *.vertex

       Kmers at the ends of edges.

       g. *.preGraphBasic

       Some basic information about the pre-graph: number of vertex, K value,  number  of  edges,  maximum  read
       length etc.

              2. Output files from the command “contig”

       a. *.contig

       Contig  information:  corresponding edge index, length, kmer coverage, whether it’s tip and the sequence.
       Either a contig or its reverse complementry counterpart is included. Each  reverse  complementary  contig
       index is indicated in the *.ContigIndex file.

       b. *.Arc

       Arcs coming out of each edge and their corresponding coverage by reads

       c. *.updated.edge

       Some  information  for  each  edge  in  graph:  length,  Kmers at both ends, index difference between the
       reverse-complementary edge and this one.

       d. *.ContigIndex

       Each record gives information about each contig in the *.contig:  it’s  edge  index,  length,  the  index
       difference between its reverse-complementary counterpart and itself.

              3. Output files from the command “map”

       a. *.peGrads

       Information  for each clone library: insert-size, read index upper bound, rank and pair number cutoff for
       a reliable link.

       This file can be revised manually for scaffolding tuning.

       b. *.readOnContig

       Read locations on contigs. Here contigs are referred by their edge index. However about half of them  are
       not listed in the *.contig file for their reverse-complementary counterparts are included already.

       c. *.readInGap

       This  file includes reads that could be located in gaps between contigs. This information will be used to
       close gaps in scaffolds.

              4. Output files from the command “scaff”

       a. *.newContigIndex

       Contigs are sorted according their length before scaffolding. Their new index are listed  in  this  file.
       This is useful if one wants to corresponds contigs in *.contig with those in *.links.

       b. *.links

       Links between contigs which are established by read pairs. New index are used.

       c. *.scaf_gap

       Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used.

       d. *.scaf

       Contigs for each scaffold: contig index (concordant to index in *.contig),  approximate start position on
       scaffold, orientation, contig length, and its links to others.

       e. *.gapSeq

       Gap sequences between contigs.

       f. *.scafSeq

       Sequence of each scaffold.

AUTHOR

       Olivier Sallou (olivier.sallou (at) irisa.fr) - Man page and packaging