lunar (1) crac.1.gz

Provided by: crac_2.5.2+dfsg-5_amd64 bug

NAME

       crac v. 2.5.2 - Crac is a tool to analyse RNA-Seq data provided by NGS.

SYNOPSIS

       crac [ options ] -i <index_file> -r <reads_file1> [reads_file2] -k <int> -o <output_file>

       crac -h|--help
       crac -f|--full-help
       crac -v|--version

DESCRIPTION

       crac CRAC: an integrated approach to the analysis of RNA-seq reads

       Whatever  the  biological  questions  it  addresses,  each  RNA-seq  analysis  requires  a
       computational prediction of either small scale  mutations,  indels,  splice  junctions  or
       fusion  RNAs.  This  prediction  is  currently performed using complex pipelines involving
       multiple tools for mapping, coverage computation, and prediction at  distinct  steps.   We
       propose  a  novel  way  of  analyzing  reads  that  integrates genomic locations and local
       coverage, and delivers all above mentioned predictions in  a  single  step.  Our  program,
       CRAC, uses a double k-mer profiling approach to detect candidate mutations, indels, splice
       or fusion junctions in each single read.  Compared to existing tools, CRAC provides  state
       of  the art sensitivity and improved precision for all types of predictions, yielding high
       rates of true positive candidates (99.5% for  splice  junctions).  When  applied  to  four
       breast  cancer  libraries,  CRAC  recovered  74%  of  validated  fusion RNAs and predicted
       reccurrent fusion junctions that were overseen  in  previous  studies.  Importantly,  CRAC
       improves  its  predictive  performance when supplied with e.g. 200 nt reads and should fit
       future needs of read analyses.

SPECIAL OPTIONS

       crac As a lot of softwares, there are many optional parameters in CRAC but only three  are
       mandatory. This document is intended to guide users of CRAC to choose the more appropriate
       parameters according to their needs:

       -h, --help
              to print the principal help page of CRAC

       -f, --full-help
              to print the complete help page of CRAC

       -v, --version
              to print version of CRAC

USUAL ARGUMENTS

   Mandatory options
       All these flags must be set.

       -i <index_file>
              is the name of the index previously built with the  crac-index  binary  file.  Note
              that  crac-index  construct  the  structure <index_file.ssa> with its configuration
              <index_file.conf> so only  the  prefix  <index_file>  must  be  specified  (without
              extension) to consider the structure and the configuration files both in CRAC

       -r <reads_file1> [reads_file2]
              is  the source(s) of the FASTA or FASTQ file(s) containing the reads. Note that the
              number of files depends if single or paired-reads.  The  input  file  may  also  be
              compressed using gzip

       -k <int>
              is  the  length  of  the  k-mer  to  be  used  to  map  the  reads on the reference
              <index_file>. Note that the condition (k < m)  is  necessary  and  reads  (or  both
              paired  reads)  are  ignored  if  m  <  k.  It must be chosen to ensure (as much as
              possible) that a k-mer has a very high probability to occur a single  time  on  the
              genome

       -o, --sam <output_file>
              is  the  output file in SAM format (see the Documentation of SAM format in CRAC for
              more details) or print on STDOUT with "-o -" argument

   Optional parameters
       --stranded
              must be specificied if reads are produced by a stranded protocol  of  RNA-Seq  (not
              stranded by default)

       --fr/--rf/--ff
              set the mates alignement orientation (--rf by default)

       -m, --reads-length, -m <int>
              must be specified for reads of fixed length. If the read length is fixed, we deeply
              recommend you to specify the read length, by using  the  -m  parameter.  CRAC  will
              therefore  be much faster. --reads-length <int> is specified for variable or longer
              reads, reads shorter are ignored and reads longer are trimmed

       --treat-multiple <int>
              display alignments with multiple locations (with  a  fixed  limit)  rather  than  a
              single alignment per read in the SAM file

       --nb-threads <int>
              is  the  number of threads to run crac, computational time is almost divided by the
              number of threads (one thread by default)

       --max-locs <int>
              corresponds to the max number of occurrences retrieved in the index for a given  k-
              mer:  smaller  is  faster, but with a small value, you may miss some locations that
              would help CRAC detecting the right cause

       --no-ambiguity <none>
              discard biological events (splice, svn, indel, chimera) which have several  matches
              on  the  reference index.  Indeed, if crac has identified a biological cause in the
              read that can match in differents places of the genome we classify this cause as  a
              biological undetermined event.

   Optional output arguments
       --gz <none>
              all  output  files  specified after this argument are gzipped (included for the sam
              file if -o/--sam argument is specified after)

       --bam <none>
              sam output is encode in binary format(BAM)

       --summary <output_file>
              save some statistics about mapping and classification

       --show-progressbar <none>
              show a progress bar for the process times on STDERR

       --use-x-in-cigar <none>
              use X cigar operator when CRAC identifies a mismatch

   Optional output homemade file formats
       --all <base_filename>
              set output base filename for all causes following. Note that only  a  base_filename
              must  be  specified.  Then,  the appropriate file extension is added for each cause
              (SNP, chimera, splice, etc) set output base filename for all causes following

       --normal <output_file>
              save reads that do not contain any break

       --almost-normal <output_file>
              save reads that do not contain any break but with a variable support

       --single <output_file>
              save reads which are located in this way: at least --min-percent-single-loc <float>
              of k-mers are once located on the reference index

       --duplicate <output_file>
              save  reads  which  are located in this way: at least --min-percent-duplication-loc
              <float>  of  k-mers  are  a  few  times  on  the  reference  index   (ie.   between
              --min-duplication <int> and --max-duplication <int> of locations)

       --multiple <output_file>
              save  reads  which  are  located  in  this way: at least --min-percent-multiple-loc
              <float> of k-mers  are  a  many  times  on  the  reference  index  (ie.  more  than
              --max-duplication <int> of locations)

       --none <output_file>
               save reads which are not located on the reference index

       --snv <output_file>
              save reads that contain at least a snv

       --indel <output_file>
              save reads that contain at least a biological indel

       --splice <output_file>
              save reads that contain at least a splicing junction

       --weak-splice <output_file>
              save reads that contain at least a low coverage splicing junction

       --chimera <output_file>
              save  reads  that  contain  at  least  a  chimera  junction  (junction on different
              chromosomes, strands or genes)

       --paired-end-chimera <output_file>
              paired-end-chimera <output_file>= save paired-end reads that contains a chimera  in
              the non-sequenced part of the original fragment.

       --biological <output_file>
              save  reads  that  contain  a  biological  cause  but for which there is not enough
              informations to be more specific. Note that the biological cause is  described  for
              each read

       --errors <output_file>
              save reads that contain at least a sequence error

       --repeat <output_file>
               save reads that contain a repeated sequence: at least
               --min-percent-repetition-loc <float> percent of k-mers of a given
               read are located at least --min-repetition <int> occurrences on the
               reference index

       --undetermined <output_file>
              save  reads  that contain an undetermined error: some k-mers are not located on the
              genome, but the reason for that could not be determined. Note  that  the  error  is
              described for each read

       --nothing <output_file>
              save reads that are unclassified

   Optional process for specific research
       --deep-snv
              must  be  specified  to  increase  sensitivity  to  find  SNVs  at the cost of more
              computations (only substitution, no indels YET). That process searches for  SNV  in
              border cases reads. Those reads would otherwise be classified in bioundetermined

       --stringent-chimera
              must  be  specified  to  increase accuracy to find chimera junctions in exchange of
              sensitivity and computational times

   Optional process launcher (once must be selected)
       --emt  launch an exact matching processing of reads on  the  index.  Either  the  argument
              specified  with  -k  is  equal  to  0 which means that the entire read is perfectly
              mapped on the genome or only a factor of length k per read is mapped (the first one
              with  a  location)  and  the  rest  is sofclipped. With this process, reads are not
              indexed and it provides a low memory consumption. Note this kind of method is  very
              useful for DGE reads mapping.

       --server
              launch  a  server  to query a given read more precisely. That process is useful for
              debugging. Note that the output arguments will not be taken into account.  Give  an
              --input-name-server  <string> to set the input fifo name (classify.fifo by default)
              and  give  an  --output-name-server  <string>  to  set   the   output   fifo   name
              (classify.out.fifo  by  default).  The  server  can  then  be used through a client
              crac-client

   Additional settings for users
       --detailed-sam
              more informations are added in SAM output file. See the Documentation of SAM format
              in CRAC for more details

       --min-percent-single-loc <float>
              is,  to  consider a given read as uniquely mapped, the minimum proportion of k-mers
              that are uniquely mapped on the index (0.15 by default)

       --min-duplication <int>
              is the minimum number of location to consider a duplicated k-mer (2 by default)

       --max-duplication <int>
              is the maximum number of location to consider a duplicated k-mer (9 by default)

       --min-percent-duplication-loc <float>
              is, to consider a given read as duplicated, the minimum proportion of  k-mers  that
              are duplicated on the index (0.15 by default)

       --min-percent-multiple-loc <float>
              is,  to  consider a given read as “multiple”, the minimum proportion of k-mers that
              are multiple mapped on the index (0.50 by default)

       --min-repetition <int>
              is the minimum number of locations to consider a repeated k-mer (20 by default)

       --max-percent-repetition-loc <float>
              is, for a given read, the minimum proportion of k-mers that  are  repeated  on  the
              index to consider a repetition (0.20 by default)

       --max-splice-length <int>
              is  the  threshold  to  consider a splice, ie. a splice is reported if the junction
              length  is  below  max-splice-length  <int>,  a  chimera  is  considered  otherwise
              (distance by default is 300Kb)

       --max-bio-indel <int>
              is  the  threshold  to consider a biological indel, ie. an indel is reported if the
              gap length is below max-bio-indel, a splice is considered  otherwise  (distance  by
              default is 15)

       --max-bases-retrieved <int>
              is  the  number of nucleotides to display in outputfile in case of insertion (15 by
              default)

       --min-support-no-cover <float>
              is the minimum coverage to be able to report a biological cause.  Note  that  if  a
              single  read  contains a given substitution, it is difficult (if not impossible) to
              distinguish a sequence error and a biological cause (1.30 by default)

   Additional settings for advanced users
       --min-break-length <int>
              is the minimal break length (as the percentage of k, the k-mer length)  so  that  a
              cause can be reported. Theoretically, for a given cause, the break length is always
              >= (kmer_length - 1). Otherwise, the break may be merged with a close enough break,
              or the break will be considered as undetermined. (0.5 by default)

       --max-bases-randomly-matched <int>
              A  k-mer  overlapping  an  exon-exon  junction, for example, may still match on the
              genome if the overlap is at the end of the read (without loss of generality).  This
              is due to the fact that the nucleotides starting the second exon may be the same as
              the nucleotides starting the intron. Theoretically, there  is  a  0.25  probability
              that  we have the same nucleotide at the first position of the intron and the exon.
              This option specifies how many nucleotides may be matched randomly at most

       --max-extension-length <int>
              is the maximum number of k-mers extended at each side of a read break. In fact, for
              a given break, k-mers with false locations can generate false biological causes, so
              the consistency is checked for each side of the break to discard false  k-mers  and
              readjust the good boundaries of the break (10 by default)

       --nb-tags-info-stored <int>
              is  a buffer to store informations for each thread during the computing phase (1000
              by default). This value  must  be  increased  if  threads  work  below  their  real
              capabilities. With --nb-threads 15, CPU usage must be about 1400%

       --reads-index <string>
              the  reads  index data-structure uses by CRAC. Available reads index are: JELLYFISH
              and GKARRAYS. (JELLYFISH by default).

       --nb-nucleotides-snp-comparison <int>
               is the minimum k-mer length tolerated for the deep SNVs search (8 by
               default)

       --max-number-of-merges <int>
               is the maximum number of merges tolerated during the break merge process
               for the chimera detection (4 by default)

       --min-score-chimera-stringent <float>
               is the mimimal score to consider a chimera event
               otherwise it is classify as a bioundetermined event (0.6 by default)

SEE ALSO

       The full documentation for crac is maintained as a org  manual.   If  the  info  and  crac
       programs are properly installed at your site, the command

              info crac

       should give you access to the complete manual.

AUTHOR

   About the crac package.
       You  can  contact  Nicolas  PHILIPPE,  Mikael SALSON, Jerome AUDOUX and Alban MANCHERON by
       sending an e-mail to <crac-bugs@lists.gforge.inria.fr>.

       Programming:
               Nicolas PHILIPPE <nphilippe.research@gmail.com>
               Mikaël      SALSON          <mikael.salson@lifl.fr>            Jérome       AUDOUX
       <jerome.audoux@gmail.com>
       with additional contribution for the packaging of:
               Alban MANCHERON  <alban.mancheron@lirmm.fr>

   About the crac publication.
       You may cite the following paper if you use our tool:

       Gk-arrays: Querying large read collections in main memory: a versatile
       data structure
       Philippe N., Salson M., Lecroq T., Leonard M., Commes T., Rivals E.
       BMC Bioinformatics 2011, 12:242.

       Crac: An integrated RNA-Seq read analysis
       Philippe N., Salson M., Commes T., Rivals E.
       Genome Biology 2013; 14:R30.

                                            2021-11-07                                    crac(1)