Provided by: spamoracle_1.4-14build3_amd64 bug

NAME

       spamoracle.conf - SpamOracle configuration file format

DESCRIPTION

       The  spamoracle.conf  file  is  a  configuration  file  governing  the  operation  of  the
       spamoracle(1) e-mail classification tool.  By default, the configuration file is  searched
       in  $HOME/.spamoracle.conf  but  an  alternate location can be specified using the -config
       flag to spamoracle(1).

       Important note: most of the configuration parameters should not be  modified  lightly,  as
       this  may  result  in  completely  wrong e-mail classification.  Familiarity with Graham's
       filtering algorithm, as described in the paper referenced at the  end  of  this  page,  is
       required to really understand the effect of the parameters.

SYNTAX

       The  spamoracle.conf  file  is  composed  of  lines  of  the form variable = value.  Lines
       starting with a hash sign (#) are treated  as  comments  and  ignored.   Blank  lines  are
       ignored.

       Depending on the type of the variable (see the list of variables below), the value part is
       of the following forms:

       string A sequence of characters.  Blanks (spaces, tabs) at the beginning and  the  end  of
              the string are ignored.  Alternatively, the string can be enclosed in double quotes
              ("), in which case spaces are not trimmed.  Inside quoted strings,  blackslashes  (
              and double quotes (") must be escaped with a backslash, as in \ or

       boolean
              Either  on, yes, true, or 1 to activate the boolean option, or off, no, false, or 0
              to deactivate it.

       integer
              A decimal integer

       float  A decimal floating-point number.

       regexp A regular expression in emacs(1) syntax.  The repetition operators are *, +, and ?.
              Alternation  is  written \| and grouping is written \(...\).  Character classes are
              written between brackets [...]  as usual.   A  single  dot  denotes  any  character
              except newline.  Regular expressions are case-insensitive.

CONFIGURABLE PARAMETERS

       database_file
              (type string, default value $HOME/.spamoracle.db )
              The  location  of  the  file that contains the database of word frequencies used by
              spamoracle(1).

       html_retain_tags
              (type boolean, default value false)
              In HTML-formatted e-mails and attachments, the names of HTML tags are normally  not
              treated  as  words  and  are  ignored  for  the word frequency calculations. If the
              html_retain_tags parameter is set to true, HTML tags (such  as  img  or  bold)  are
              treated as words and included in the computation of word frequencies.

       html_tag_attributes
              (type regexp, default value
              a/href\|img/src\|img/alt\|frame/src\|font/face\|font/color)
              This  regular  expression matches pairs of HTML tags and HTML attributes written as
              tag/attribute.  When scanning HTML-formatted e-mails and attachments, attributes to
              HTML  tags  are normally ignored, unless the tag/attribute pair matches the regular
              expression html_tag_attributes.  If the tag/attribute pair matches this regexp, the
              value  of the attribute (for instance, the URL for the a/href attribute) is scanned
              for words.

       mail_headers
              (type regexp, default value from:\|subject:)
              A regular expression determining which headers of an e-mail message are scanned for
              words.

       spam_header
              (type string, default value X-Spam)
              The  name of the header that spamoracle mark adds to incoming e-mail messages, with
              the results of the spam/non-spam classification.

       attachments_header
              (type string, default value X-Attachments)
              The name of the header that spamoracle mark adds to incoming e-mail messages,  with
              the one-line summary of attachment types, names and character sets.  The generation
              of this header can be turned off with the summarize_attachment parameter.

       summarize_attachment
              (type boolean, default value true)
              If this parameter is set, spamoracle mark  generates  a  one-line  summary  of  the
              attachments  of  the  incoming  messages,  and  inserts this summary in the message
              headers.  Setting this parameter to false disables the  generation  of  this  extra
              header.

       num_meaningful_words
              (type integer, default value 15)
              Maximal  number  of  "meaningful"  words  that  are retained for computing the spam
              probability.  During mail analysis, spamoracle extracts all words of  the  message,
              and  retains  those whose spam frequency (frequency of occurrence in spam messages)
              is closest to 1 or to 0.  At most num_meaningful_words such "meaningful" words  are
              retained.

       max_repetitions
              (type integer, default value 2)
              Maximum  number  of  times  a given word can occur in the set of "meaningful" words
              retained for computing the spam probability.  The default value of 2 means that  at
              most 2 occurrences of the same word will be retained.

       low_freq_limit
              (type float, default value 0.01)

       high_freq_limit
              (type float, default value 0.99)
              The  spam  frequency  of  a  word  is computed as the number of occurrences in spam
              divided by number of occurrences in all messages.  This ratio is  then  clipped  to
              the  interval [ low_freq_limit, high_freq_limit ], so that words that are extremely
              rare or extremely common in spam do not bias the probability computation too  much.
              The  default values of 0.01 and 0.99 are adequate for a corpus of a few thousand e-
              mails.  For larger corpora (e.g. 10000 e-mails), the values  0.001  and  0.999  may
              give better results.

       min_meaningful_words
              (type integer, default value 5)
              Minimum  number  of  "meaningful"  words  below  which  spamoracle  mark refuses to
              classify the e-mail and outputs "unknown" status.  This happens with very short  e-
              mails, or e-mails that consist exclusively of links and pictures.

       good_mail_prob
              (type float, default value 0.2)
              Spam probability below which the e-mail is classified as non-spam.

       spam_mail_prob
              (type float, default value 0.8)
              Spam  probability  above  which  the  e-mail is classified as spam.  Messages whose
              probability falls between  good_mail_prob  and  spam_mail_prob  are  classified  as
              "unknown".

AUTHOR

       Xavier Leroy <Xavier.Leroy@inria.fr>

SEE ALSO

       spamoracle(1)

       http://www.paulgraham.com/spam.html (Paul Graham's seminal paper)

                                                                               SPAMORACLE.CONF(5)