lunar (1) mlv-smile.1.gz

Provided by: mlv-smile_1.47-8_amd64 bug

NAME

       mlv-smile - inference of structured signals in multiple sequences

SYNOPSIS

       mlv-smile <parameter_file>
       mlv-smile [-g number]

DESCRIPTION

       This  manual  page documents briefly the mlv-smile command.  For more details and example,
       you should have a look to the documentation files installed with it.

       mlv-smile is a program that was primarily made to  extract  promoter  sequences  from  DNA
       sequences.  The interest of this program is to infer simultaneously several motifs (called
       boxes) that respects distance constraints. The user has to write in a  parameter_file  the
       list  of  criteria that he wants the signal to respect. In a first step of extraction, all
       signals respecting these criteria are found.  In a second step, they are all statistically
       evaluated,  aiming  to  detect the ones that are exceptionally represented in the original
       sequences.  Since the 1.4 version mlv-smile allows one to  extract  such  signals  on  any
       alphabet in any kind of sequences.

OPTIONS

       The  program usually waits for a parameter file that contains all the criteria needed. The
       only option is:

       -g number
              produces on the standard output a generic parameter file to  extract  number  boxes
              signals.

HOW TO

   How to use mlv-smile?
       The  only command you'll use is 'mlv-smile'. You have to give it just one parameter, which
       is the name of a parameters file which should contain the characteristics  of  the  motifs
       you want to extract.

   How to start?
       You first have to write an alphabet file, which contains the alphabet used to describe the
       motifs. Then you have to write a parameter file, and you're ready to use mlv-smile.

   What should I write in the alphabet file?
       The first line should contain the type of  the  alphabet's  elements,  to  choose  between
       "Nucleotides",  "Proteins",  or  "Others".  This  is  to  allow  mlv-smile  to change, for
       instance, the "A or G" symbol into an R in DNA sequences.  Then, on each line, you have to
       write the elements of the motifs's alphabet.

       Example:  if  you want to extract simple motifs (A,C,G,T) from clean DNA sequences written
       with a four letters alphabet (A,C,G,T), then you may write an alphabet file containing:
           Type:Nucleotides
           A
           C
           G
           T
       Let's call this file 'alpha'.

   How to write a simple parameter file?
       You have to first write an alphabet file. You also need a  sequence  file,  at  the  FASTA
       format.  Then,  you  can create a parameter file, using the "mlv-smile -g number_of_boxes"
       command to help you.

       Example: Let's write a parameter file to extract simple motifs. If you don't already  have
       one, let's first create a small DNA file in FASTA format, containing several sequences:

           > Seq A
           AGGCTAGTCAGGGCATGCGATCAGCAGGCATCAGGCGAGCATCGACAGCA
           > Seq B
           GGAGAGCGCAGAGCGAGCATCATCATGCAGCATCAGAGATCTTTCT
       Let's call this file 'seq'.

       Our purpose is now to extract from these sequences all motifs of length 13 that appears at
       least one time in 100% of the sequences, allowing one  substitution.   We  may  write  the
       following parameter file (helped with the 'mlv-smile -g 1' command):
           FASTA file          seq     // previously created
           Output file         results

           Alphabet file       alpha   //previously created
           Quorum              100
           Total min length    13
           Total max length    13
           Total substitutions 1
           Boxes               1
       Let's call this file 'param'.

   How to extract a simple motif?
       You can launch  "mlv-smile" after having created the alphabet and parameter files.

       Example:  With  the  previous  alphabet, sequences and parameter files, you can now launch
       mlv-smile: "mlv-smile param". You will obtain the following motifs in the "results" file:
           GCGAGCATCAACA 2120210310010 2
           Seq     1   Pos    12
           Seq     0   Pos    34
           2
           GCGAGCATCGTCA 2120210312310 2
           Seq     1   Pos    12
           Seq     0   Pos    34
           2
       The first motif found, GCGAGCATCAACA, appears at position 12 in the  second  sequence  and
       position 34 in the first one (all positions or sequences counts starts at zero).

   How to evaluate the significance of the motifs found?
       You have to add some evaluation lines at the end of the parameter file.

       Example: At the bottom of the previous "param" parameter file, you can add:
           Shufflings  100
           Size k-mer  2
       which   means  that  the  original  sequences  will  be  shuffled  100  times,  conserving
       dinucleotides. The significance of the motifs found previously will be computed from their
       frequency  of  apparition  in the shuffled sequences. The more number of shuffling you do,
       the more stable are the results, but it's longer to compute.

       For this example, you may find such results (in the "results.shuffle"):
           STATISTICS ON THE NUMBER OF SEQUENCES HAVING AT LEAST ONE OCCURRENCE
           Model          %right  #right %shfl. #shfl. Sigma Chi2 Z-score
           ==============================================================
           GCGAGCATCGTCA  100.00%    2   0.50%   0.01  0.10  3.96   19.90
           GCGAGCATCAACA  100.00%    2   1.00%   0.02  0.14  3.92   14.07

           STATISTICS ON THE TOTAL NUMBER OF OCCURRENCES
           Model            #right  #shfl. Sigma   Chi2    Z-score
           =======================================================
           GCGAGCATCGTCA        2   0.01   0.10    1.99    19.90
           GCGAGCATCAACA        2   0.02   0.14    1.96    14.07

       The first block of results shows the statistics on the number of sequences having at least
       one  occurrence.  You  can  read, for each motif found, the frequency of apparition in the
       original and shuffled sequences, and two statistical scores (Chi2  and  Z-score)  deduced.
       Motifs  are  sorted according to the highest Z-scores. A high Z-score means that the motif
       appears in a surprising way in the original sequences.

   How to extract structured motifs?
       The parameter file should be modified to indicate the characteristics  of  the  structured
       motifs  to  infer.  You  have  to  write  global parameters for the whole motif, and local
       parameters for each box of it.

       Example: Let's extract from the previous "seq" sequences structured motifs composed  of  2
       boxes  of  length  5 to 6, but the whole motif must have a length 11. The two boxes may be
       separated by 10 to 15 nucleotides. You allow at most one substitution in each box, and  at
       least  one  occurrence  of a motif must appear in 100% of the sequences, you may write the
       following parameter file:
           FASTA file          seq
           Output file         results

           Alphabet file       alpha
           Quorum              100
           Total min length    11
           Total max length    11
           Total substitutions 2
           Boxes               2

           BOX 1 ================
           Min length          5
           Max length          6
           Substitutions       1
           Min spacer length   10
           Max spacer length   15

           BOX 2 ================
           Min length          5
           Max length          6
           Substitutions       1

PARAMETER FILE CRITERIA

       FASTA File <filename>
              The name of the file which contains the sequences  to  use  for  inference.   These
              sequences  must  be  at  the  FASTA  format.  This  file  must contain at least two
              sequences, as you cannot detect motifs which are common to several sequences in one
              sequence!

       Output file <filename>
              The name of the file where results of extraction will be written.

       Alphabet file <filemane>
              The  name  of  the  file where you have to tell mlv-smile on which alphabet it will
              infer motifs. The first line of this file contains "Type:" followed by the type  of
              symbols  you use, to choose between "Nucleotides", "Proteins" or "Others". Then, on
              each line of the file, must be written the symbols of  the  sequence  that  may  be
              matched  by  a  symbol  of  a  motif. A line containing "ANR" means that there is a
              symbol in the motif's alphabet which matches A, N or R in the sequences. If Type is
              defined  with  Nucleotides, mlv-smile will change this ANR symbol into an A to make
              it more readable. These associations will  be  printed  at  the  beginning  of  the
              execution.

       Quorum <number>
              The percentage of sequences where at least one occurrence of a motif must appear to
              make it valid. 100 means that a motif must have occurrences in every sequences.

       Total min length <number>
              The minimal length of the whole motif, i.e. the sum of minimal lengths of each box.
              Warning:  the  length  of the gaps between boxes mustn't me taken into account. The
              total minimal length may differ of the sum of boxs's minimal length: you  can,  for
              instance,  infer motifs made of two boxes, with min length of boxes equals to 4 and
              a total min length equals to 10.

       Total max length <number>
              Same explanation as "Total min length", excepted that a 0 length means "infinity".

       Total substitutions <number>
              Total maximum number of substitutions for the whole motif. As for the total length,
              this is not necessarily the sum of each box's substitution number.

       Boxes <number>
              The  number  of  boxes that compose the motifs to infer.  When inferring simple one
              box motifs, it's not necessary to use local criteria as global and  local  criteria
              will be the same.

       Composition in <symbol> <number> [OPTIONAL]
              The number of a given symbol of the motif's alphabet may be restrained to a maximum
              by this criteria.

       BOX <number>
              Begin the description of the criteria of a given box of the motif.

       Min length <number>
              Minimum length for the current box.

       Max length <number>
              Same explanation as "Min length", excepted that a 0 length means "infinity".

       Substitution <number>
              Maximum number of substitutions allowed for the current box.

       Composition in <symbol> <number> [OPTIONAL]
              Same as the global composition, but for the current box.

       Min spacer length <number>
              Minimum number of symbols between the end of the current box and the  beginning  of
              the  next  one. This parameter mustn't appear in the last box's criteria, which has
              no next box!

       Max spacer length <number>
              Same explanation as "Max spacer length".

       Delta <number>  [OPTIONAL]
              This criteria allows one to infer motifs composed of several boxes  without  really
              knowing  the  distance  between  these boxes. The min and max spacer length will be
              used as a "large" interval, and the delta's value will define  the  size  of  small
              intervals  into  this  large  one.  An inference of two boxes motifs with a [10-20]
              range of distance between the boxes will produce motifs whose  occurrences  respect
              this  range.  A  "Delta"  criteria  fixed to 2, for instance, will realize the same
              inference in all the possible ranges [i-delta, i+delta]  (here:  [10-14],  [11-15],
              ...). As many output files as different ranges will be produced.

       Palindrome of box <number>  [OPTIONAL]
              Indicate  that  the  concerned  box must be the biological palindrome of one of the
              previous boxes.

       Shufflings <number> [OPTIONAL]
              The number of shufflings of the original sequences to realize for the evaluation of
              the statistical significance of the motifs found.

       Size k-mer <number> [OPTIONAL, always with shuffling]
              Length of the words to conserve during shufflings (usually 2).

       Against wrong sequences <filename> [OPTIONAL]
              Another  method to evaluate the significance of the motifs (not compatible with the
              shuffling method). In the case where you have a sequence  file  where  you  believe
              that  the motifs you look for in the first sequences set won't appear, you can give
              to mlv-smile such a sequence file. The statistical evaluation of motifs found  will
              be made by computing theit frequency in the "wrong sequences".

WARNING

       mlv-smile is an exact combinatorial algorithm. It is not made to infer any kind of motifs.
       The amount of data where the extraction is made can be very large, but some  criteria  (in
       particular  the  number  of substitutions) must be restrained to reasonable values: one or
       two substitutions allowed in a 10 length motif is ok, but not 6 or  8  substitutions.  The
       notion of spacers is made to avoid the use of to much substitutions.

BUGS

       A  bug  has  been  found  in  the 1.46 version, which could generate wrong results in some
       particular cases. In particular, results may be  wrong  for  incoherent  length  criteria.
       There  are  still  probably a lot of bugs in mlv-smile. This 1.47 version is quite stable,
       but do not hesitate to report any bug to <lama AT prism.uvsq DOT fr>.

SEE ALSO

       This software has been implemented from an algorithm proposed in

       L. Marsan and M.-F. Sagot, Algorithms for extracting structured motifs using a suffix tree
       with  application to promoter and regulatory site consensus identification", J. of Comput.
       Biol. 7, 2001, 345-360

       You should refer to these paper for algorithmic details. If bored  by  such  things,  just
       notice  that  the  extraction  step  of  mlv-smile  is  exact, which means that all motifs
       respecting the given criteria are found.  Please quote this article if  you  produce  some
       results given by mlv-smile.

       For  some  examples of applications we made on biological datas (with good results), refer
       to

       A. Vanet and L. Marsan and M.-F. Sagot,"Promoter sequences and algorithmically methods for
       identifying them", Research in Microbiology 150, 1999, 779-799

       and

       A.  Vanet and L. Marsan and A. Labigne and M.-F. Sagot, Inferring regulatory elements from
       a whole genome. An application to the analysis of genome of Helicobacter Pylori  Sigma  80
       family of promoter signals", J. Mol. Biol. 297, 2000, 335-353

AUTHOR

       This  manual  page  was  written  by   Laurent Marsan <lama AT prism.uvsq DOT fr>, for the
       Debian GNU/Linux system (but may be used by others).

                                                                                         SMILE(1)