Ubuntu Manpage: crac v. 2.5.2 - Crac is a tool to analyse RNA-Seq data provided by NGS.

Provided by: crac_2.5.2+dfsg-6build1_amd64

NAME

       crac v. 2.5.2 - Crac is a tool to analyse RNA-Seq data provided by NGS.

SYNOPSIS

       crac [ options ] -i <index_file> -r <reads_file1> [reads_file2] -k <int> -o <output_file>

       crac -h|--help
       crac -f|--full-help
       crac -v|--version

DESCRIPTION

       crac CRAC: an integrated approach to the analysis of RNA-seq reads

       Whatever the biological questions it addresses, each RNA-seq analysis requires a computational prediction
       of  either  small  scale mutations, indels, splice junctions or fusion RNAs. This prediction is currently
       performed using complex pipelines  involving  multiple  tools  for  mapping,  coverage  computation,  and
       prediction  at  distinct  steps.   We  propose  a  novel  way  of analyzing reads that integrates genomic
       locations and local coverage, and delivers all above mentioned predictions in a single step. Our program,
       CRAC, uses a double k-mer profiling approach to detect candidate  mutations,  indels,  splice  or  fusion
       junctions  in  each  single read.  Compared to existing tools, CRAC provides state of the art sensitivity
       and improved precision for all types of predictions, yielding high  rates  of  true  positive  candidates
       (99.5%  for  splice  junctions).  When  applied  to  four  breast cancer libraries, CRAC recovered 74% of
       validated fusion RNAs and predicted reccurrent fusion junctions that were overseen in  previous  studies.
       Importantly, CRAC improves its predictive performance when supplied with e.g. 200 nt reads and should fit
       future needs of read analyses.

SPECIAL OPTIONS

       crac As a lot of softwares, there are many optional parameters in CRAC but only three are mandatory. This
       document  is intended to guide users of CRAC to choose the more appropriate parameters according to their
       needs:

       -h, --help
              to print the principal help page of CRAC

       -f, --full-help
              to print the complete help page of CRAC

       -v, --version
              to print version of CRAC

USUAL ARGUMENTS

   Mandatory options
       All these flags must be set.

       -i <index_file>
              is the name of the index previously built with the crac-index binary file.  Note  that  crac-index
              construct  the  structure  <index_file.ssa>  with  its configuration <index_file.conf> so only the
              prefix <index_file> must be specified (without  extension)  to  consider  the  structure  and  the
              configuration files both in CRAC

       -r <reads_file1> [reads_file2]
              is the source(s) of the FASTA or FASTQ file(s) containing the reads. Note that the number of files
              depends if single or paired-reads. The input file may also be compressed using gzip

       -k <int>
              is  the  length  of the k-mer to be used to map the reads on the reference <index_file>. Note that
              the condition (k < m) is necessary and reads (or both paired reads) are ignored if m < k. It  must
              be  chosen  to  ensure  (as  much as possible) that a k-mer has a very high probability to occur a
              single time on the genome

       -o, --sam <output_file>
              is the output file in SAM format (see the Documentation of SAM format in CRAC for more details) or
              print on STDOUT with "-o -" argument

   Optional parameters
       --stranded
              must be specificied if reads are produced by a stranded  protocol  of  RNA-Seq  (not  stranded  by
              default)

       --fr/--rf/--ff
              set the mates alignement orientation (--rf by default)

       -m, --reads-length, -m <int>
              must  be specified for reads of fixed length. If the read length is fixed, we deeply recommend you
              to specify the read length, by using the  -m  parameter.  CRAC  will  therefore  be  much  faster.
              --reads-length  <int>  is  specified  for  variable or longer reads, reads shorter are ignored and
              reads longer are trimmed

       --treat-multiple <int>
              display alignments with multiple locations (with a fixed limit) rather than a single alignment per
              read in the SAM file

       --nb-threads <int>
              is the number of threads to run crac, computational time  is  almost  divided  by  the  number  of
              threads (one thread by default)

       --max-locs <int>
              corresponds  to the max number of occurrences retrieved in the index for a given k-mer: smaller is
              faster, but with a small value, you may miss some locations that would  help  CRAC  detecting  the
              right cause

       --no-ambiguity <none>
              discard  biological  events  (splice,  svn,  indel,  chimera)  which  have  several matches on the
              reference index.  Indeed, if crac has identified a biological cause in the read that can match  in
              differents places of the genome we classify this cause as a biological undetermined event.

   Optional output arguments
       --gz <none>
              all  output files specified after this argument are gzipped (included for the sam file if -o/--sam
              argument is specified after)

       --bam <none>
              sam output is encode in binary format(BAM)

       --summary <output_file>
              save some statistics about mapping and classification

       --show-progressbar <none>
              show a progress bar for the process times on STDERR

       --use-x-in-cigar <none>
              use X cigar operator when CRAC identifies a mismatch

   Optional output homemade file formats
       --all <base_filename>
              set output base filename for all  causes  following.  Note  that  only  a  base_filename  must  be
              specified.  Then,  the  appropriate  file extension is added for each cause (SNP, chimera, splice,
              etc) set output base filename for all causes following

       --normal <output_file>
              save reads that do not contain any break

       --almost-normal <output_file>
              save reads that do not contain any break but with a variable support

       --single <output_file>
              save reads which are located in this way: at least --min-percent-single-loc <float> of k-mers  are
              once located on the reference index

       --duplicate <output_file>
              save reads which are located in this way: at least --min-percent-duplication-loc <float> of k-mers
              are  a few times on the reference index (ie. between --min-duplication <int> and --max-duplication
              <int> of locations)

       --multiple <output_file>
              save reads which are located in this way: at least --min-percent-multiple-loc  <float>  of  k-mers
              are a many times on the reference index (ie. more than --max-duplication <int> of locations)

       --none <output_file>
               save reads which are not located on the reference index

       --snv <output_file>
              save reads that contain at least a snv

       --indel <output_file>
              save reads that contain at least a biological indel

       --splice <output_file>
              save reads that contain at least a splicing junction

       --weak-splice <output_file>
              save reads that contain at least a low coverage splicing junction

       --chimera <output_file>
              save reads that contain at least a chimera junction (junction on different chromosomes, strands or
              genes)

       --paired-end-chimera <output_file>
              paired-end-chimera  <output_file>=  save  paired-end  reads  that  contains  a chimera in the non-
              sequenced part of the original fragment.

       --biological <output_file>
              save reads that contain a biological cause but for which there is not enough  informations  to  be
              more specific. Note that the biological cause is described for each read

       --errors <output_file>
              save reads that contain at least a sequence error

       --repeat <output_file>
               save reads that contain a repeated sequence: at least
               --min-percent-repetition-loc <float> percent of k-mers of a given
               read are located at least --min-repetition <int> occurrences on the
               reference index

       --undetermined <output_file>
              save  reads that contain an undetermined error: some k-mers are not located on the genome, but the
              reason for that could not be determined. Note that the error is described for each read

       --nothing <output_file>
              save reads that are unclassified

   Optional process for specific research
       --deep-snv
              must be specified to increase sensitivity to find SNVs at the  cost  of  more  computations  (only
              substitution,  no  indels  YET).  That process searches for SNV in border cases reads. Those reads
              would otherwise be classified in bioundetermined

       --stringent-chimera
              must be specified to increase accuracy to find chimera junctions in exchange  of  sensitivity  and
              computational times

   Optional process launcher (once must be selected)
       --emt  launch  an  exact matching processing of reads on the index. Either the argument specified with -k
              is equal to 0 which means that the entire read is perfectly mapped on the genome or only a  factor
              of  length  k  per read is mapped (the first one with a location) and the rest is sofclipped. With
              this process, reads are not indexed and it provides a low memory consumption. Note  this  kind  of
              method is very useful for DGE reads mapping.

       --server
              launch  a  server to query a given read more precisely. That process is useful for debugging. Note
              that the output arguments will not be taken into account. Give an --input-name-server <string>  to
              set  the  input  fifo name (classify.fifo by default) and give an --output-name-server <string> to
              set the output fifo name (classify.out.fifo by default). The server can then  be  used  through  a
              client crac-client

   Additional settings for users
       --detailed-sam
              more  informations  are  added in SAM output file. See the Documentation of SAM format in CRAC for
              more details

       --min-percent-single-loc <float>
              is, to consider a given read as uniquely  mapped,  the  minimum  proportion  of  k-mers  that  are
              uniquely mapped on the index (0.15 by default)

       --min-duplication <int>
              is the minimum number of location to consider a duplicated k-mer (2 by default)

       --max-duplication <int>
              is the maximum number of location to consider a duplicated k-mer (9 by default)

       --min-percent-duplication-loc <float>
              is,  to  consider a given read as duplicated, the minimum proportion of k-mers that are duplicated
              on the index (0.15 by default)

       --min-percent-multiple-loc <float>
              is, to consider a given read as “multiple”, the minimum proportion of  k-mers  that  are  multiple
              mapped on the index (0.50 by default)

       --min-repetition <int>
              is the minimum number of locations to consider a repeated k-mer (20 by default)

       --max-percent-repetition-loc <float>
              is,  for a given read, the minimum proportion of k-mers that are repeated on the index to consider
              a repetition (0.20 by default)

       --max-splice-length <int>
              is the threshold to consider a splice, ie. a splice is reported if the junction  length  is  below
              max-splice-length <int>, a chimera is considered otherwise (distance by default is 300Kb)

       --max-bio-indel <int>
              is  the  threshold  to  consider a biological indel, ie. an indel is reported if the gap length is
              below max-bio-indel, a splice is considered otherwise (distance by default is 15)

       --max-bases-retrieved <int>
              is the number of nucleotides to display in outputfile in case of insertion (15 by default)

       --min-support-no-cover <float>
              is the minimum coverage to be able to report a biological  cause.  Note  that  if  a  single  read
              contains a given substitution, it is difficult (if not impossible) to distinguish a sequence error
              and a biological cause (1.30 by default)

   Additional settings for advanced users
       --min-break-length <int>
              is  the  minimal  break  length  (as the percentage of k, the k-mer length) so that a cause can be
              reported. Theoretically, for a given cause, the break length  is  always  >=  (kmer_length  -  1).
              Otherwise,  the  break may be merged with a close enough break, or the break will be considered as
              undetermined. (0.5 by default)

       --max-bases-randomly-matched <int>
              A k-mer overlapping an exon-exon junction, for example, may still  match  on  the  genome  if  the
              overlap  is  at the end of the read (without loss of generality). This is due to the fact that the
              nucleotides starting the second exon may be the same  as  the  nucleotides  starting  the  intron.
              Theoretically,  there is a 0.25 probability that we have the same nucleotide at the first position
              of the intron and the exon. This option specifies how many nucleotides may be matched randomly  at
              most

       --max-extension-length <int>
              is the maximum number of k-mers extended at each side of a read break. In fact, for a given break,
              k-mers  with  false  locations can generate false biological causes, so the consistency is checked
              for each side of the break to discard false k-mers and readjust the good boundaries of  the  break
              (10 by default)

       --nb-tags-info-stored <int>
              is  a  buffer  to store informations for each thread during the computing phase (1000 by default).
              This value must be increased if threads work below their real capabilities. With --nb-threads  15,
              CPU usage must be about 1400%

       --reads-index <string>
              the  reads  index  data-structure uses by CRAC. Available reads index are: JELLYFISH and GKARRAYS.
              (JELLYFISH by default).

       --nb-nucleotides-snp-comparison <int>
               is the minimum k-mer length tolerated for the deep SNVs search (8 by
               default)

       --max-number-of-merges <int>
               is the maximum number of merges tolerated during the break merge process
               for the chimera detection (4 by default)

       --min-score-chimera-stringent <float>
               is the mimimal score to consider a chimera event
               otherwise it is classify as a bioundetermined event (0.6 by default)

AUTHOR

   About the crac package.
       You  can  contact Nicolas PHILIPPE, Mikael SALSON, Jerome AUDOUX and Alban MANCHERON by sending an e-mail
       to <crac-bugs@lists.gforge.inria.fr>.

       Programming:
               Nicolas PHILIPPE <nphilippe.research@gmail.com>
               Mikaël SALSON    <mikael.salson@lifl.fr>      Jérome AUDOUX    <jerome.audoux@gmail.com>
       with additional contribution for the packaging of:
               Alban MANCHERON  <alban.mancheron@lirmm.fr>

   About the crac publication.
       You may cite the following paper if you use our tool:

       Gk-arrays: Querying large read collections in main memory: a versatile
       data structure
       Philippe N., Salson M., Lecroq T., Leonard M., Commes T., Rivals E.
       BMC Bioinformatics 2011, 12:242.

       Crac: An integrated RNA-Seq read analysis
       Philippe N., Salson M., Commes T., Rivals E.
       Genome Biology 2013; 14:R30.

                                                   2024-04-01                                            crac(1)

NAME

SYNOPSIS

DESCRIPTION

SPECIAL OPTIONS

USUAL ARGUMENTS

SEE ALSO

AUTHOR