Provided by: soapdenovo_1.05-5_amd64 bug

NAME

       soapdenovo -  Short-read assembly method that can build a de novo draft assembly

SYNOPSIS

       soapdenovo_31mer soapdenovo_63mer soapdenovo_127mer

Introduction

       SOAPdenovo  is  a novel short-read assembly method that can build a de novo draft assembly
       for the human-sized genomes. The program is specially designed  to  assemble  Illumina  GA
       short  reads.  It  creates new opportunities for building reference sequences and carrying
       out accurate analyses of unexplored genomes in a cost effective way.

       1) Support large kmer up to 127 to utilize long reads. Three version are provided.
           I. The 31mer version support kmer only <=31.
           II. The 63mer version support kmer only <=63 and doubles the memory  consumption  than
       31mer version, even being used with kmer <=31.
           III. The 127mer version support kmer only <=127 and double the memory consumption than
       63mer version, even being used with kmer <=63.

       Please notice that, with longer kmer, the quantity of nodes would decrease  significantly,
       thus the memory consumption is usually smaller than double with shifted version.

       2)  New  parameter  added  in  "pregraph"  module.  This  parameter  initiates  the memory
       assumption to avoid further reallocation. Unit of the parameter  is  GB.  Without  further
       reallocation, SOAPdenovo runs faster and provide the potential to eat up all the memory of
       the machine. For example, if the workstation provides  50g  free  memory,  use  -a  50  in
       pregraph  step,  then  a  static amount of 50g memory would be allocated before processing
       reads. This can also avoid being interrupted by other users sharing the same machine.

       3) Gap filled bases now represented by lowercase characters in 'scafSeq' file.

       4) Introduced SIMD instructions to boost the performance.

Configuration file

       For big genome projects with deep sequencing, the data is usually  organized  as  multiple
       read  sequence  files generated from multiple libraries.  The configuration file tells the
       assembler where to find these files and the relevant information.  “example.config” is  an
       example of such a file.

       The  configuration  file  has  a section for global information, and then multiple library
       sections.  Right now only “max_rd_len” is included in the global information section.  Any
       read longer than max_rd_len will be cut to this length.

       The  library information and the information of sequencing data generated from the library
       should be organized in the corresponding library section.   Each  library  section  starts
       with tag [LIB] and includes the following items:

       avg_ins
              This  value  indicates  the  average  insert size of this library or the peak value
              position in the insert size distribution figure.

       reverse_seq
              This option takes value 0 or 1. It tells the assembler if the read  sequences  need
              to  be  complementarily  reversed.   Illumima  GA  produces two types of paired-end
              libraries: a) forward-reverse, generated from  fragmented  DNA  ends  with  typical
              insert  size  less  than  500  bp; b) forward-forward, generated from circularizing
              libraries with typical insert size greater than 2 Kb.  The parameter  “reverse_seq”
              should be set to indicate this: 0, forward-reverse; 1, forward-forward.

       asm_flags=3
              This  indicator  decides in which part(s) the reads are used. It takes value 1(only
              contig assembly), 2 (only scaffold assembly), 3(both contig and scaffold assembly),
              or 4 (only gap closure).

       rd_len_cutoff
              The assembler will cut the reads from the current library to this length.

       rank   It  takes integer values and decides in which order the reads are used for scaffold
              assembly. Libraries with the same “rank” are used at the same time during  scaffold
              assembly.

       pair_num_cutoff
              This parameter is the cutoff value of pair number for a reliable connection between
              two contigs or pre-scaffolds.

       map_len
              This takes effect in the “map” step and is the minimum alignment length  between  a
              read and a contig required for a reliable read location.

       The  assembler  accepts  read file in two formats: FASTA or FASTQ.  Mate-pair relationship
       could be indicated in two ways: two sequence files with reads in the same order  belonging
       to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair.

       In  the  configuration  file  single  end  files  are  indicated  by “f=/path/filename” or
       “q=/pah/filename” for fasta or fastq  formats  separately.   Paired  reads  in  two  fasta
       sequence files are indicated by “f1=” and “f2=”. While paired reads in two fastq sequences
       files are indicated by “q1=” and “q2=”. Paired reads in a single fasta  sequence  file  is
       indicated by “p=” item.

       All  the  above  items in each library section are optional. The assembler assigns default
       values for most of them. If you are not sure how to set a parameter,  you  can  remove  it
       from your configuration file.

Get it started

       Once  the  configuration  file is available, a typical way to run the assembler is: ${bin}
       all –s config_file –K 63 –R –o graph_prefix

       User can also choose to run  the  assembly  process  step  by  step  as:  ${bin}  pregraph
       \[u2013]s  config_file  \[u2013]K  63  [\[u2013]R  -d \[u2013]p -a] \[u2013]o graph_prefix
       ${bin} contig \[u2013]g graph_prefix [\[u2013]R  \[u2013]M  1  -D]  ${bin}  map  \[u2013]s
       config_file  \[u2013]g graph_prefix [-p] ${bin} scaff \[u2013]g graph_prefix [\[u2013]F -u
       -G -p]

Options

       -a     INT Initiate the memory assumption (GB) to avoid further reallocation

       -s     STR configuration file

       -o     STR output graph file prefix

       -g     STR input graph file prefix

       -K     INT K-mer size [default 23, min 13, max 127]

       -p     INT multithreads, n threads [default 8]

       -R     use reads to solve tiny repeats [default no]

       -d     INT remove low-frequency K-mers with frequency no larger than [default 0]

       -D     INT remove edges with coverage no larger that [default 1]

       -M     INT strength of merging similar sequences during contiging [default 1, min  0,  max
              3]

       -F     intra-scaffold gap closure [default no]

       -u     un-mask high coverage contigs before scaffolding [default mask]

       -G     INT allowed length difference between estimated and filled gap

       -L     minimum contigs length used for scaffolding

Output files

       These files are output as assembly results:

       a. *.contig

       contig sequences without using mate pair information

       b. *.scafSeq

       scaffold  sequences  (final  contig  sequences  can be extracted by breaking down scaffold
       sequences at gap regions)

       There are some other files that provide useful information for advanced users,  which  are
       listed in Appendix B.

FAQ

   How to set K-mer size?
       The program accepts odd numbers between 13 and 31. Larger K-mers would have higher rate of
       uniqueness in the genome and would make the graph simpler, but it requires deep sequencing
       depth and longer read length to guarantee the overlap at any genomic location.

   How to set library rank?
       SOAPdenovo  will  use  the  pair-end  libraries with insert size from smaller to larger to
       construct scaffolds. Libraries with the same rank would be used  at  the  same  time.  For
       example,  in a dataset of a human genome, we set five ranks for five libraries with insert
       size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired  that  the  pairs  in
       each rank provide adequate physical coverage of the genome.

APPENDIX A: an example.config

       #maximal read length
       max_rd_len=50
       [LIB]
       #average insert size
       avg_ins=200
       #if sequence needs to be reversed
       reverse_seq=0
       #in which part(s) the reads are used
       asm_flags=3
       #use only first 50 bps of each read
       rd_len_cutoff=50
       #in which order the reads are used while scaffolding
       rank=1
       # cutoff of pair number for a reliable connection (default 3)
       pair_num_cutoff=3
       #minimum aligned length to contigs for a reliable read location (default 32)
       map_len=32
       #fastq file for read 1
       q1=/path/**LIBNAMEA**/fastq_read_1.fq
       #fastq file for read 2 always follows fastq file for read 1
       q2=/path/**LIBNAMEA**/fastq_read_2.fq
       #fasta file for read 1
       f1=/path/**LIBNAMEA**/fasta_read_1.fa
       #fastq file for read 2 always follows fastq file for read 1
       f2=/path/**LIBNAMEA**/fasta_read_2.fa
       #fastq file for single reads
       q=/path/**LIBNAMEA**/fastq_read_single.fq
       #fasta file for single reads
       f=/path/**LIBNAMEA**/fasta_read_single.fa
       #a single fasta file for paired reads
       p=/path/**LIBNAMEA**/pairs_in_one_file.fa
       [LIB]
       avg_ins=2000
       reverse_seq=1
       asm_flags=2
       rank=2
       # cutoff of pair number for a reliable connection
       #(default 5 for large insert size)
       pair_num_cutoff=5
       #minimum aligned length to contigs for a reliable read location
       #(default 35 for large insert size)
       map_len=35
       q1=/path/**LIBNAMEB**/fastq_read_1.fq
       q2=/path/**LIBNAMEB**/fastq_read_2.fq
       q=/path/**LIBNAMEB**/fastq_read_single.fq
       f=/path/**LIBNAMEB**/fasta_read_single.fa

Appendix B: output files

       1. Output files from the command “pregraph”

       a. *.kmerFreq

       Each row shows the number of Kmers with a frequency equals the row number.

       b. *.edge

       Each record gives the information of an edge in the pre-graph: length, Kmers on both ends,
       average kmer coverage, whether it’s reverse-complementarily identical and the sequence.

       c. *.markOnEdge & *.path

       These two files are for using reads to solve small repeats

       e. *.preArc

       Connections between edges which are established by the read paths.

       f. *.vertex

       Kmers at the ends of edges.

       g. *.preGraphBasic

       Some basic information about the pre-graph: number of vertex, K value,  number  of  edges,
       maximum read length etc.

              2. Output files from the command “contig”

       a. *.contig

       Contig  information: corresponding edge index, length, kmer coverage, whether it’s tip and
       the sequence. Either a contig or its reverse complementry counterpart  is  included.  Each
       reverse complementary contig index is indicated in the *.ContigIndex file.

       b. *.Arc

       Arcs coming out of each edge and their corresponding coverage by reads

       c. *.updated.edge

       Some  information  for  each  edge  in graph: length, Kmers at both ends, index difference
       between the reverse-complementary edge and this one.

       d. *.ContigIndex

       Each record gives information about each contig in the *.contig: it’s edge index,  length,
       the index difference between its reverse-complementary counterpart and itself.

              3. Output files from the command “map”

       a. *.peGrads

       Information  for  each  clone  library: insert-size, read index upper bound, rank and pair
       number cutoff for a reliable link.

       This file can be revised manually for scaffolding tuning.

       b. *.readOnContig

       Read locations on contigs. Here contigs are referred by their edge  index.  However  about
       half  of  them  are  not  listed  in  the  *.contig  file  for their reverse-complementary
       counterparts are included already.

       c. *.readInGap

       This file includes reads that could be located in gaps between contigs.  This  information
       will be used to close gaps in scaffolds.

              4. Output files from the command “scaff”

       a. *.newContigIndex

       Contigs  are  sorted according their length before scaffolding. Their new index are listed
       in this file.  This is useful if one wants to corresponds contigs in *.contig  with  those
       in *.links.

       b. *.links

       Links between contigs which are established by read pairs. New index are used.

       c. *.scaf_gap

       Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index
       are used.

       d. *.scaf

       Contigs for each scaffold: contig index (concordant to index  in  *.contig),   approximate
       start position on scaffold, orientation, contig length, and its links to others.

       e. *.gapSeq

       Gap sequences between contigs.

       f. *.scafSeq

       Sequence of each scaffold.

AUTHOR

       Olivier Sallou (olivier.sallou (at) irisa.fr) - Man page and packaging