Ubuntu Manpage: bmf - efficient Bayesian mail filter

NAME

       bmf - efficient Bayesian mail filter

SYNOPSIS

       bmf [-t] [-n] [-s] [-N] [-S] [-f fmt] [-d db] [-i file] [-k n] [-m type] [-p]
           [-v] [-V] [-h]

DESCRIPTION

       bmf  is a Bayesian mail filter. In its normal mode of operation, it takes an email message
       or other text on standard input, does a statistical check  against  lists  of  "good"  and
       "spam"  words, registers the new data, and returns a status code indicating whether or not
       the message is spam. BMF is written with fast, zero-copy algorithms, coded directly in  C,
       and  tuned  for  speed.  It  aims  to  be faster, smaller, and more versatile than similar
       applications.

       bmf supports both mbox and maildir mail storage formats.  It  will  automatically  process
       multiple messages within an mbox file separately.

OPTIONS

Without command-line options, bmf processes the input, registers it as either "good" or
"spam", and returns the appropriate error code. The wordlist directory and nonexistent
wordfiles are created if absent.

-t Test to see if the input is spam. The word lists are not updated. A report is written
to stdout showing the final score and the tokens with the highest deviation form a mean of
0.5.

-n Register the input as non-spam.

-s Register the input as spam.

-N Register the input as non-spam and undo a prior registration as spam.

-S Register the input as spam and undo a prior registration as non-spam.

-f fmt Specify database format. Valid formats are text, db, and mysql. Text is always
valid. The others may not be available if the corresponding option was not enabled at
compile time. The default is db if available, else text.

-d db Specify database or directory for loading and saving word lists. The default is
~/.bmf in text mode.

-i file Use file for input instead of stdin.

-k n Specify the number of extrema (keepers) to use in the Bayes calculation. The default
is 15.

-m fmt Specify mail storage format. Valid formats are mbox and maildir. The default is to
automatically detect the mail storage format. This option is deprecated.

-p Copy the input to the output (passthrough) and insert spam headers in the style of
SpamAssassin. An X-Spam-Status header is always inserted with processing details. The
contents of this header always begin with either "Yes" or "No". If the input is judged to
be spam, the header "X-Spam-Flag: YES" is also inserted.

-v Be more verbose. This option is not well supported yet.

-V Display version information.

-h Display usage information.

THEORY OF OPERATION

       bmf  treats  its  input as a bag of tokens. Each token is checked against "good" and "bad"
       wordlists, which maintain counts of the numbers of times it has occurred in  non-spam  and
       spam  mails.  These  numbers  are used to compute the probability that a mail in which the
       token occurs is spam. After probabilities for all input tokens have been computed, a fixed
       number  of the probabilities that deviate furthest from average are combined using Bayes's
       theorem on conditional probabilities.

       While this method sounds crude compared to the more usual  pattern-matching  approach,  it
       turns   out   to   be   extremely   effective.  Paul  Graham's  paper  A  Plan  For  Spam:
       http://www.paulgraham.com/spam.html is recommended reading.

       bmf improves on  Paul's  proposal  by  doing  smarter  lexical  analysis.  In  particular,
       hostnames  and  IP  addresses  are not discarded, and certain types of MTA information are
       discarded (such as message ids and dates).

       MIME and other attachments are not decoded. Experience from  watching  the  token  streams
       suggests  that  spam  with  enclosures  invariably  gives  itself away through cues in the
       headers and non-enclosure parts. Nonetheless, I would like to add the  ability  to  decode
       quoted-printable and perhaps base64 encodings for textual attachments.

INTEGRATION WITH OTHER TOOLS

       Please see the /usr/share/doc/bmf/README.gz for samples and suggestions.

RETURN VALUES

       In passthrough mode: zero for success, nonzero for failure.

       In non-passthrough mode: 0 for spam; 1 for non-spam; 2 for I/O or other errors.

FILES

       ~/.bmf/goodlist.txt
              List of good tokens for text mode.

       ~/.bmf/spamlist.txt
              List of bad tokens for text mode.

       ~/.bmf/goodlist.db
              List of good tokens for libdb mode.

       ~/.bmf/spamlist.db
              List of bad tokens for libdb mode.

BUGS

       Only  one  copy  of  bmf(1)  instance  can access the database (see options -d and -f). In
       Procmail recipes, ensure sequential access with a lock file:

               :0 fw: bmf.lock
               | bmf -p

       The lexer does not recognize multiline headers.

       The lexer does not recognize MIME attachments.

       Content-Transfer-Encoding is not decoded.

AUTHOR

       Tom Marshall <tommy@tig-grr.com>.

       The Bayes algorithm is from bogofilter by Eric S.  Raymond  <esr@thyrsus.com>.  bogofilter
       can be found at the bogofilter project page: http://bogofilter.sourceforge.net/.

                                                                                           BMF(1)