Provided by: spamoracle_1.6-1build1_amd64 bug

NAME

       spamoracle.conf - SpamOracle configuration file format

DESCRIPTION

       The  spamoracle.conf  file  is  a  configuration  file  governing  the  operation  of  the
       spamoracle(1) e-mail classification tool.  By default, the configuration file is  searched
       in  $HOME/.spamoracle.conf  but  an  alternate location can be specified using the -config
       flag to spamoracle(1).

       Important note: most of the configuration parameters should not be  modified  lightly,  as
       this  may  result  in  completely  wrong e-mail classification.  Familiarity with Graham's
       filtering algorithm, as described in the paper referenced at the  end  of  this  page,  is
       recommended to fully understand the effect of the parameters.

SYNTAX

       The  spamoracle.conf  file  is  composed  of  lines  of  the form variable = value.  Lines
       starting with a # sign are treated as comments and ignored.  Blank lines are ignored.

       Depending on the type of the variable (see the list of variables below),  the  value  part
       takes one of the following forms:

       string A  sequence  of  characters.  Blanks (spaces, tabs) at the beginning and the end of
              the string are ignored.  Alternatively, the string can be enclosed in double quotes
              ("), in which case spaces are not trimmed.  Inside quoted strings, blackslashes (\)
              and double quotes (") must be escaped with a backslash, as in \\ or \

       boolean
              Either on, yes, true, or 1 to activate the boolean option, or off, no, false, or  0
              to deactivate it.

       integer
              A decimal integer

       float  A decimal floating-point number.

       regexp A regular expression in emacs(1) syntax.  The repetition operators are *, +, and ?.
              Alternation is written \| and grouping is written \(...\).  Character  classes  are
              written  between  brackets  [...]   as  usual.   A single dot denotes any character
              except newline.  Regular expressions are case-insensitive.

CONFIGURABLE PARAMETERS

       database_file
              (type string, default value $HOME/.spamoracle.db )
              The location of the file that contains the database of  word  frequencies  used  by
              spamoracle(1).

       html_retain_tags
              (type boolean, default value false)
              In  HTML-formatted e-mails and attachments, the names of HTML tags are normally not
              treated as words and are ignored  for  the  word  frequency  calculations.  If  the
              html_retain_tags  parameter  is  set  to  true, HTML tags (such as img or bold) are
              treated as words and included in the computation of word frequencies.

       html_tag_attributes
              (type regexp, default value
              a/href\|img/src\|img/alt\|frame/src\|font/face\|font/color)
              This regular expression matches pairs of HTML tags and HTML attributes  written  as
              tag/attribute.  When scanning HTML-formatted e-mails and attachments, attributes to
              HTML tags are normally ignored, unless the tag/attribute pair matches  the  regular
              expression html_tag_attributes.  If the tag/attribute pair matches this regexp, the
              value of the attribute (for instance, the URL for the a/href attribute) is  scanned
              for words.

       mail_headers
              (type regexp, default value from:\|subject:)
              A regular expression determining which headers of an e-mail message are scanned for
              words.

       alternative_favor_html
              (type bool, default value true)
              Determine how multipart/alternative messages are treated.   If  this  parameter  is
              set, and one part of the alternative is of type text/html, this part is scanned and
              all other parts are ignored.  In all other cases, all parts of the alternative  are
              scanned.

       spam_header
              (type string, default value X-Spam)
              The  name of the header that spamoracle mark adds to incoming e-mail messages, with
              the results of the spam/non-spam classification.

       attachments_header
              (type string, default value X-Attachments)
              The name of the header that spamoracle mark adds to incoming e-mail messages,  with
              the one-line summary of attachment types, names and character sets.  The generation
              of this header can be turned off with the summarize_attachment parameter.

       summarize_attachment
              (type boolean, default value true)
              If this parameter is set, spamoracle mark  generates  a  one-line  summary  of  the
              attachments  of  the  incoming  messages,  and  inserts this summary in the message
              headers.  Setting this parameter to false disables the  generation  of  this  extra
              header.

       num_meaningful_words
              (type integer, default value 15)
              Maximal  number  of  "meaningful"  words  that  are retained for computing the spam
              probability.  During mail analysis, spamoracle extracts all words of  the  message,
              and  retains  those whose spam frequency (frequency of occurrence in spam messages)
              is closest to 1 or to 0.  At most num_meaningful_words such "meaningful" words  are
              retained.

       max_repetitions
              (type integer, default value 2)
              Maximum  number  of  times  a given word can occur in the set of "meaningful" words
              retained for computing the spam probability.  The default value of 2 means that  at
              most 2 occurrences of the same word will be retained.

       low_freq_limit
              (type float, default value 0.01)

       high_freq_limit
              (type float, default value 0.99)
              The  spam  frequency  of  a  word  is computed as the number of occurrences in spam
              divided by number of occurrences in all messages.  This ratio is  then  clipped  to
              the  interval [ low_freq_limit, high_freq_limit ], so that words that are extremely
              rare or extremely common in spam do not bias the probability computation too  much.
              The  default values of 0.01 and 0.99 are adequate for a corpus of a few thousand e-
              mails.  For larger corpora (e.g. 10000 e-mails), the values  0.001  and  0.999  may
              give better results.

       min_meaningful_words
              (type integer, default value 5)
              Minimum  number  of  "meaningful"  words  below  which  spamoracle  mark refuses to
              classify the e-mail and outputs "unknown" status.  This happens with very short  e-
              mails, or e-mails that consist exclusively of links and pictures.

       good_mail_prob
              (type float, default value 0.2)
              Spam probability below which the e-mail is classified as non-spam.

       spam_mail_prob
              (type float, default value 0.8)
              Spam  probability  above  which  the  e-mail is classified as spam.  Messages whose
              probability falls between  good_mail_prob  and  spam_mail_prob  are  classified  as
              "unknown".

AUTHOR

       Xavier Leroy <Xavier.Leroy@inria.fr>

SEE ALSO

       spamoracle(1)

       http://www.paulgraham.com/spam.html (Paul Graham's seminal paper)

                                                                               SPAMORACLE.CONF(5)