bionic (1) mlv-smile.1.gz

Provided by: mlv-smile_1.47-5_amd64 bug

NAME

       mlv-smile - inference of structured signals in multiple sequences

SYNOPSIS

       mlv-smile <parameter_file>
       mlv-smile [-g number]

DESCRIPTION

       This  manual page documents briefly the mlv-smile command.  For more details and example, you should have
       a look to the documentation files installed with it.

       mlv-smile is a program that was primarily made to extract promoter  sequences  from  DNA  sequences.  The
       interest  of this program is to infer simultaneously several motifs (called boxes) that respects distance
       constraints. The user has to write in a parameter_file the list of criteria that he wants the  signal  to
       respect.  In  a  first  step of extraction, all signals respecting these criteria are found.  In a second
       step, they are all statistically evaluated, aiming to detect the ones that are exceptionally  represented
       in  the  original  sequences.   Since the 1.4 version mlv-smile allows one to extract such signals on any
       alphabet in any kind of sequences.

OPTIONS

       The program usually waits for a parameter file that contains all the criteria needed. The only option is:

       -g number
              produces on the standard output a generic parameter file to extract number boxes signals.

HOW TO

   How to use mlv-smile?
       The only command you'll use is 'mlv-smile'. You have to give it just one parameter, which is the name  of
       a parameters file which should contain the characteristics of the motifs you want to extract.

   How to start?
       You  first  have to write an alphabet file, which contains the alphabet used to describe the motifs. Then
       you have to write a parameter file, and you're ready to use mlv-smile.

   What should I write in the alphabet file?
       The first line should contain the type of the  alphabet's  elements,  to  choose  between  "Nucleotides",
       "Proteins",  or "Others". This is to allow mlv-smile to change, for instance, the "A or G" symbol into an
       R in DNA sequences.  Then, on each line, you have to write the elements of the motifs's alphabet.

       Example: if you want to extract simple motifs (A,C,G,T) from clean DNA  sequences  written  with  a  four
       letters alphabet (A,C,G,T), then you may write an alphabet file containing:
           Type:Nucleotides
           A
           C
           G
           T
       Let's call this file 'alpha'.

   How to write a simple parameter file?
       You  have  to first write an alphabet file. You also need a sequence file, at the FASTA format. Then, you
       can create a parameter file, using the "mlv-smile -g number_of_boxes" command to help you.

       Example: Let's write a parameter file to extract simple motifs. If you  don't  already  have  one,  let's
       first create a small DNA file in FASTA format, containing several sequences:

           > Seq A
           AGGCTAGTCAGGGCATGCGATCAGCAGGCATCAGGCGAGCATCGACAGCA
           > Seq B
           GGAGAGCGCAGAGCGAGCATCATCATGCAGCATCAGAGATCTTTCT
       Let's call this file 'seq'.

       Our purpose is now to extract from these sequences all motifs of length 13 that appears at least one time
       in 100% of the sequences, allowing one substitution.  We may write the following parameter  file  (helped
       with the 'mlv-smile -g 1' command):
           FASTA file          seq     // previously created
           Output file         results

           Alphabet file       alpha   //previously created
           Quorum              100
           Total min length    13
           Total max length    13
           Total substitutions 1
           Boxes               1
       Let's call this file 'param'.

   How to extract a simple motif?
       You can launch  "mlv-smile" after having created the alphabet and parameter files.

       Example:  With  the previous alphabet, sequences and parameter files, you can now launch mlv-smile: "mlv-
       smile param". You will obtain the following motifs in the "results" file:
           GCGAGCATCAACA 2120210310010 2
           Seq     1   Pos    12
           Seq     0   Pos    34
           2
           GCGAGCATCGTCA 2120210312310 2
           Seq     1   Pos    12
           Seq     0   Pos    34
           2
       The first motif found, GCGAGCATCAACA, appears at position 12 in the second sequence and  position  34  in
       the first one (all positions or sequences counts starts at zero).

   How to evaluate the significance of the motifs found?
       You have to add some evaluation lines at the end of the parameter file.

       Example: At the bottom of the previous "param" parameter file, you can add:
           Shufflings  100
           Size k-mer  2
       which  means  that  the  original  sequences  will  be  shuffled 100 times, conserving dinucleotides. The
       significance of the motifs found previously will be computed from their frequency of  apparition  in  the
       shuffled sequences. The more number of shuffling you do, the more stable are the results, but it's longer
       to compute.

       For this example, you may find such results (in the "results.shuffle"):
           STATISTICS ON THE NUMBER OF SEQUENCES HAVING AT LEAST ONE OCCURRENCE
           Model          %right  #right %shfl. #shfl. Sigma Chi2 Z-score
           ==============================================================
           GCGAGCATCGTCA  100.00%    2   0.50%   0.01  0.10  3.96   19.90
           GCGAGCATCAACA  100.00%    2   1.00%   0.02  0.14  3.92   14.07

           STATISTICS ON THE TOTAL NUMBER OF OCCURRENCES
           Model            #right  #shfl. Sigma   Chi2    Z-score
           =======================================================
           GCGAGCATCGTCA        2   0.01   0.10    1.99    19.90
           GCGAGCATCAACA        2   0.02   0.14    1.96    14.07

       The first block of results shows  the  statistics  on  the  number  of  sequences  having  at  least  one
       occurrence.  You can read, for each motif found, the frequency of apparition in the original and shuffled
       sequences, and two statistical scores (Chi2 and Z-score) deduced. Motifs  are  sorted  according  to  the
       highest  Z-scores.  A  high  Z-score  means  that  the  motif appears in a surprising way in the original
       sequences.

   How to extract structured motifs?
       The parameter file should be modified to indicate the characteristics of the structured motifs to  infer.
       You have to write global parameters for the whole motif, and local parameters for each box of it.

       Example:  Let's extract from the previous "seq" sequences structured motifs composed of 2 boxes of length
       5 to 6, but the whole motif must have a  length  11.  The  two  boxes  may  be  separated  by  10  to  15
       nucleotides.  You allow at most one substitution in each box, and at least one occurrence of a motif must
       appear in 100% of the sequences, you may write the following parameter file:
           FASTA file          seq
           Output file         results

           Alphabet file       alpha
           Quorum              100
           Total min length    11
           Total max length    11
           Total substitutions 2
           Boxes               2

           BOX 1 ================
           Min length          5
           Max length          6
           Substitutions       1
           Min spacer length   10
           Max spacer length   15

           BOX 2 ================
           Min length          5
           Max length          6
           Substitutions       1

PARAMETER FILE CRITERIA

       FASTA File <filename>
              The name of the file which contains the sequences to use for inference.  These sequences  must  be
              at  the  FASTA  format. This file must contain at least two sequences, as you cannot detect motifs
              which are common to several sequences in one sequence!

       Output file <filename>
              The name of the file where results of extraction will be written.

       Alphabet file <filemane>
              The name of the file where you have to tell mlv-smile on which alphabet it will infer motifs.  The
              first  line  of  this  file  contains  "Type:"  followed by the type of symbols you use, to choose
              between "Nucleotides", "Proteins" or "Others". Then, on each line of the file, must be written the
              symbols  of the sequence that may be matched by a symbol of a motif. A line containing "ANR" means
              that there is a symbol in the motif's alphabet which matches A, N or R in the sequences.  If  Type
              is  defined  with  Nucleotides,  mlv-smile  will  change this ANR symbol into an A to make it more
              readable. These associations will be printed at the beginning of the execution.

       Quorum <number>
              The percentage of sequences where at least one occurrence of a motif must appear to make it valid.
              100 means that a motif must have occurrences in every sequences.

       Total min length <number>
              The  minimal  length of the whole motif, i.e. the sum of minimal lengths of each box. Warning: the
              length of the gaps between boxes mustn't me taken into  account.  The  total  minimal  length  may
              differ of the sum of boxs's minimal length: you can, for instance, infer motifs made of two boxes,
              with min length of boxes equals to 4 and a total min length equals to 10.

       Total max length <number>
              Same explanation as "Total min length", excepted that a 0 length means "infinity".

       Total substitutions <number>
              Total maximum number of substitutions for the whole motif. As for the total length,  this  is  not
              necessarily the sum of each box's substitution number.

       Boxes <number>
              The  number of boxes that compose the motifs to infer.  When inferring simple one box motifs, it's
              not necessary to use local criteria as global and local criteria will be the same.

       Composition in <symbol> <number> [OPTIONAL]
              The number of a given symbol of the motif's alphabet may  be  restrained  to  a  maximum  by  this
              criteria.

       BOX <number>
              Begin the description of the criteria of a given box of the motif.

       Min length <number>
              Minimum length for the current box.

       Max length <number>
              Same explanation as "Min length", excepted that a 0 length means "infinity".

       Substitution <number>
              Maximum number of substitutions allowed for the current box.

       Composition in <symbol> <number> [OPTIONAL]
              Same as the global composition, but for the current box.

       Min spacer length <number>
              Minimum  number  of  symbols between the end of the current box and the beginning of the next one.
              This parameter mustn't appear in the last box's criteria, which has no next box!

       Max spacer length <number>
              Same explanation as "Max spacer length".

       Delta <number>  [OPTIONAL]
              This criteria allows one to infer motifs composed of several  boxes  without  really  knowing  the
              distance  between  these  boxes. The min and max spacer length will be used as a "large" interval,
              and the delta's value will define the size of small intervals into this large one. An inference of
              two  boxes  motifs  with  a  [10-20] range of distance between the boxes will produce motifs whose
              occurrences respect this range. A "Delta" criteria fixed to 2, for instance, will realize the same
              inference  in  all  the  possible ranges [i-delta, i+delta] (here: [10-14], [11-15], ...). As many
              output files as different ranges will be produced.

       Palindrome of box <number>  [OPTIONAL]
              Indicate that the concerned box must be the biological palindrome of one of the previous boxes.

       Shufflings <number> [OPTIONAL]
              The number of shufflings  of  the  original  sequences  to  realize  for  the  evaluation  of  the
              statistical significance of the motifs found.

       Size k-mer <number> [OPTIONAL, always with shuffling]
              Length of the words to conserve during shufflings (usually 2).

       Against wrong sequences <filename> [OPTIONAL]
              Another  method  to  evaluate  the  significance  of the motifs (not compatible with the shuffling
              method). In the case where you have a sequence file where you believe that the motifs you look for
              in  the  first  sequences  set  won't  appear, you can give to mlv-smile such a sequence file. The
              statistical evaluation of motifs found will be made by computing theit  frequency  in  the  "wrong
              sequences".

WARNING

       mlv-smile  is an exact combinatorial algorithm. It is not made to infer any kind of motifs. The amount of
       data where the extraction is made can be very large, but some  criteria  (in  particular  the  number  of
       substitutions)  must  be restrained to reasonable values: one or two substitutions allowed in a 10 length
       motif is ok, but not 6 or 8 substitutions. The notion of spacers is made to avoid  the  use  of  to  much
       substitutions.

BUGS

       A bug has been found in the 1.46 version, which could generate wrong results in some particular cases. In
       particular, results may be wrong for incoherent length criteria.  There are still probably a lot of  bugs
       in  mlv-smile.  This  1.47  version  is  quite  stable, but do not hesitate to report any bug to <lama AT
       prism.uvsq DOT fr>.

SEE ALSO

       This software has been implemented from an algorithm proposed in

       L. Marsan and M.-F. Sagot,  Algorithms  for  extracting  structured  motifs  using  a  suffix  tree  with
       application  to  promoter  and  regulatory  site  consensus identification", J. of Comput. Biol. 7, 2001,
       345-360

       You should refer to these paper for algorithmic details. If bored by such things, just  notice  that  the
       extraction  step  of  mlv-smile  is  exact, which means that all motifs respecting the given criteria are
       found.  Please quote this article if you produce some results given by mlv-smile.

       For some examples of applications we made on biological datas (with good results), refer to

       A. Vanet and L. Marsan and M.-F. Sagot,"Promoter sequences and algorithmically  methods  for  identifying
       them", Research in Microbiology 150, 1999, 779-799

       and

       A. Vanet and L. Marsan and A. Labigne and M.-F. Sagot, Inferring regulatory elements from a whole genome.
       An application to the analysis of genome of Helicobacter Pylori Sigma 80 family of promoter signals",  J.
       Mol. Biol. 297, 2000, 335-353

AUTHOR

       This  manual  page  was  written by  Laurent Marsan <lama AT prism.uvsq DOT fr>, for the Debian GNU/Linux
       system (but may be used by others).

                                                                                                        SMILE(1)