Provided by: spamoracle_1.4-15_amd64 

NAME
spamoracle.conf - SpamOracle configuration file format
DESCRIPTION
The spamoracle.conf file is a configuration file governing the operation of the spamoracle(1) e-mail
classification tool. By default, the configuration file is searched in $HOME/.spamoracle.conf but an
alternate location can be specified using the -config flag to spamoracle(1).
Important note: most of the configuration parameters should not be modified lightly, as this may result
in completely wrong e-mail classification. Familiarity with Graham's filtering algorithm, as described
in the paper referenced at the end of this page, is required to really understand the effect of the
parameters.
SYNTAX
The spamoracle.conf file is composed of lines of the form variable = value. Lines starting with a hash
sign (#) are treated as comments and ignored. Blank lines are ignored.
Depending on the type of the variable (see the list of variables below), the value part is of the
following forms:
string A sequence of characters. Blanks (spaces, tabs) at the beginning and the end of the string are
ignored. Alternatively, the string can be enclosed in double quotes ("), in which case spaces are
not trimmed. Inside quoted strings, blackslashes ( and double quotes (") must be escaped with a
backslash, as in \ or
boolean
Either on, yes, true, or 1 to activate the boolean option, or off, no, false, or 0 to deactivate
it.
integer
A decimal integer
float A decimal floating-point number.
regexp A regular expression in emacs(1) syntax. The repetition operators are *, +, and ?. Alternation
is written \| and grouping is written \(...\). Character classes are written between brackets
[...] as usual. A single dot denotes any character except newline. Regular expressions are
case-insensitive.
CONFIGURABLE PARAMETERS
database_file
(type string, default value $HOME/.spamoracle.db )
The location of the file that contains the database of word frequencies used by spamoracle(1).
html_retain_tags
(type boolean, default value false)
In HTML-formatted e-mails and attachments, the names of HTML tags are normally not treated as
words and are ignored for the word frequency calculations. If the html_retain_tags parameter is
set to true, HTML tags (such as img or bold) are treated as words and included in the computation
of word frequencies.
html_tag_attributes
(type regexp, default value
a/href\|img/src\|img/alt\|frame/src\|font/face\|font/color)
This regular expression matches pairs of HTML tags and HTML attributes written as tag/attribute.
When scanning HTML-formatted e-mails and attachments, attributes to HTML tags are normally
ignored, unless the tag/attribute pair matches the regular expression html_tag_attributes. If the
tag/attribute pair matches this regexp, the value of the attribute (for instance, the URL for the
a/href attribute) is scanned for words.
mail_headers
(type regexp, default value from:\|subject:)
A regular expression determining which headers of an e-mail message are scanned for words.
spam_header
(type string, default value X-Spam)
The name of the header that spamoracle mark adds to incoming e-mail messages, with the results of
the spam/non-spam classification.
attachments_header
(type string, default value X-Attachments)
The name of the header that spamoracle mark adds to incoming e-mail messages, with the one-line
summary of attachment types, names and character sets. The generation of this header can be
turned off with the summarize_attachment parameter.
summarize_attachment
(type boolean, default value true)
If this parameter is set, spamoracle mark generates a one-line summary of the attachments of the
incoming messages, and inserts this summary in the message headers. Setting this parameter to
false disables the generation of this extra header.
num_meaningful_words
(type integer, default value 15)
Maximal number of "meaningful" words that are retained for computing the spam probability. During
mail analysis, spamoracle extracts all words of the message, and retains those whose spam
frequency (frequency of occurrence in spam messages) is closest to 1 or to 0. At most
num_meaningful_words such "meaningful" words are retained.
max_repetitions
(type integer, default value 2)
Maximum number of times a given word can occur in the set of "meaningful" words retained for
computing the spam probability. The default value of 2 means that at most 2 occurrences of the
same word will be retained.
low_freq_limit
(type float, default value 0.01)
high_freq_limit
(type float, default value 0.99)
The spam frequency of a word is computed as the number of occurrences in spam divided by number of
occurrences in all messages. This ratio is then clipped to the interval [ low_freq_limit,
high_freq_limit ], so that words that are extremely rare or extremely common in spam do not bias
the probability computation too much. The default values of 0.01 and 0.99 are adequate for a
corpus of a few thousand e-mails. For larger corpora (e.g. 10000 e-mails), the values 0.001 and
0.999 may give better results.
min_meaningful_words
(type integer, default value 5)
Minimum number of "meaningful" words below which spamoracle mark refuses to classify the e-mail
and outputs "unknown" status. This happens with very short e-mails, or e-mails that consist
exclusively of links and pictures.
good_mail_prob
(type float, default value 0.2)
Spam probability below which the e-mail is classified as non-spam.
spam_mail_prob
(type float, default value 0.8)
Spam probability above which the e-mail is classified as spam. Messages whose probability falls
between good_mail_prob and spam_mail_prob are classified as "unknown".
AUTHOR
Xavier Leroy <Xavier.Leroy@inria.fr>
SEE ALSO
spamoracle(1)
http://www.paulgraham.com/spam.html (Paul Graham's seminal paper)
SPAMORACLE.CONF(5)