Ubuntu Manpage: dbacl - a digramic Bayesian classifier for text recognition.

NAME

       dbacl - a digramic Bayesian classifier for text recognition.

SYNOPSIS

       dbacl  [-01dvnirmwMNDXW]  [-T  type  ]  -l  category  [-h  size] [-H gsize] [-x decim] [-q
              quality] [-w  max_order]  [-e  deftok]  [-o  online]  [-L  measure]  [-g  regex]...
              [FILE]...

       dbacl [-vnimNRX] [-h size] [-T type] -c category [-c category]...  [-f keep]...  [FILE]...

       dbacl -V

OVERVIEW

dbacl is a Bayesian text and email classifier. When using the -l switch, it learns a body
of text and produce a file named category which summarizes the text. When using the -c
switch, it compares an input text stream with any number of category files, and outputs
the name of the closest match, or optionally various numerical scores explained below.

Whereas this manual page is intended as a reference, there are several tutorials and
documents you can read to get specialized information. Specific documentation about the
design of dbacl and the statistical models that it uses can be found in dbacl.ps. For a
basic overview of text classification using dbacl, see tutorial.html. A companion tutorial
geared towards email filtering is email.html. If you have trouble getting dbacl to
classify reliably, read is_it_working.html. The USAGE section of this manual page also
has some examples.

/usr/share/doc/dbacl/dbacl.ps

/usr/share/doc/dbacl/tutorial.html

/usr/share/doc/dbacl/email.html

/usr/share/doc/dbacl/is_it_working.html

dbacl uses a maximum entropy (minimum divergence) language model constructed with respect
to a digramic reference measure (unknown tokens are predicted from digrams, i.e. pairs of
letters). Practically, this means that a category is constructed from tokens in the
training set, while previously unseen tokens can be predicted automatically from their
letters. A token here is either a word (fragment) or a combination of words (fragments),
selected according to various switches. Learning roughly works by tweaking token
probabilities until the training data is least surprising.

EXIT STATUS

       The normal shell exit conventions aren't followed (sorry!).  When  using  the  -l  command
       form,  dbacl  returns zero on success, nonzero if an error occurs. When using the -c form,
       dbacl returns a positive integer corresponding to the category with the highest  posterior
       probability.  In  case  of  a tie, the first most probable category is chosen. If an error
       occurs, dbacl returns zero.

DESCRIPTION

When using the -l command form, dbacl learns a category when given one or more FILE names,
which should contain readable ASCII text. If no FILE is given, dbacl learns from STDIN. If
FILE is a directory, it is opened and all its files are read, but not its subdirectories.
The result is saved in the binary file named category, and completely replaces any
previous contents. As a convenience, if the environment variable DBACL_PATH contains a
directory, then that is prepended to the file path, unless category starts with a '/' or a
'.'.

The input text for learning is assumed to be unstructured plain text by default. This is
not suitable for learning email, because email contains various transport encodings and
formatting instructions which can reduce classification effectiveness. You must use the -T
switch in that case so that dbacl knows it should perform decoding and filtering of MIME
and HTML as appropriate. Apropriate switch values are "-T email" for RFC2822 email input,
"-T html" for HTML input, "-T xml" for generic XML style input and "-T text" is the
default plain text format. There are other values of the -T switch that also allow fine
tuning of the decoding capabilities.

When using the -c command form, dbacl attempts to classify the text found in FILE, or
STDIN if no FILE is given. Each possible category must be given separately, and should be
the file name of a previously learned text corpus. As a convenience, if the variable
DBACL_PATH contains a directory, it is prepended to each file path which doesn't start
with a '/' or a '.'. The visible output of the classification depends on the combination
of extra switches used. If no switch is used, then no output is shown on STDOUT. However,
dbacl always produces an exit code which can be tested.

To see an output for a classification, you must use at least one of the -v,-U,-n,-N,-D,-d
switches. Sometimes, they can be used in combination to produce a natural variation of
their individual outputs. Sometimes, dbacl also produces warnings on STDERR if applicable.

The -v switch outputs the name of the best category among all the choices given.

The -U switch outputs the name of the best category followed by a confidence percentage.
Normally, this is the switch that you want to use. A percentage of 100% means that dbacl
is sure of its choice, while a percentage of 0% means that some other category is equally
likely. This is not the model probability, but measures how unambiguous the classification
is, and can be used to tag unsure classifications (e.g. if the confidence is 25% or less).

The -N switch prints each category name followed by its (posterior) probability, expressed
as a percentage. The percentages always sum to 100%. This is intuitive, but only valuable
if the document being classified contains a handful of tokens (ten or less). In the common
case with many more tokens, the probabilities are always extremely close to 100% and 0%.

The -n switch prints each category name followed by the negative logarithm of its
probability. This is equivalent to using the -N switch, but much more useful. The smallest
number gives the best category. A more convenient form is to use both -n and -v which
prints each category name followed by the cross entropy and the number of tokens analyzed.
The cross entropy measures (in bits) the average compression rate which is achievable,
under the given category model, per token of input text. If you use all three of -n,-v,-X
then an extra value is output for each category, representing a kind of p-value for each
category score. This indicates how typical the score is compared to the training
documents, but only works if the -X switch was used during learning, and only for some
types of models (e.g. email). These p-values are uniformly distributed and independent
(if the categories are independent), so can be combined using Fisher's chi squared test to
obtain composite p-values for groupings of categories.

The -v and -X switches together print each category name followed by a detailed
decomposition of the category score, factored into ( divergence rate + shannon entropy
rate )* token count @ p-value. Again, this only works in some types of models.

The -v and -U switches print each category name followed by a decomposition of the
category score into ( divergence rate + shannon entropy rate # score variance )* token
count.

The -D switch prints out the input text as modified internally by dbacl prior to
tokenization. For example, if a MIME encoded email document is classified, then this
prints the decoded text that will be actually tokenized and classified. This switch is
mainly useful for debugging.

The -d switch dumps tokens and scores while they are being read. It is useful for
debugging, or if you want to create graphical representations of the classification. A
detailed explanation of the output is beyond the scope of this manual page, but is
straightforward if you've read dbacl.ps. Possible variations include -d together with -n
or -N.

Classification can be done with one or several categories in principle. When two or more
categories are used, the Bayesian posterior probability is used, given the input text,
with a uniform prior distribution on categories. For other choices of prior, see the
companion utility bayesol(1). When a single category is used, classification can be done
by comparing the score with a treshold. In practice however, much better results are
obtained with several categories.

Learning and classifying cannot be mixed on the same command invocation, however there are
no locking issues and separate dbacl processes can operate simultaneously with obvious
results, because file operations are designed to be atomic.

Finally, note that dbacl does not manage your document corpora or your computed
categories, and in particular it does not allow you to extend an existing category file
with new documents. This is unlike various current spam filters, which can learn new
emails incrementally. This limitation of dbacl is partially due to the nonlinear procedure
used in the learning algorithm, and partially a desire for increased flexibility.

You can simulate the effect of incremental learning by saving your training documents into
archives and adding to these archives over time, relearning from scratch periodically.
Learning is actually faster if these archives are compressed and decompressed on the fly
when needed. By keeping control of your archives, you can never lose the information in
your categories, and you can easily experiment with different switches or tokenizations or
sets of training documents if you like.

SECONDARY SWITCHES

By default, dbacl classifies the input text as a whole. However, when using the -f option,
dbacl can be used to filter each input line separately, printing only those lines which
match one or more models identified by keep (use the category name or number to refer to a
category). This is useful if you want to filter out some lines, but note that if the lines
are short, then the error rate can be high.

The -e,-w,-g,-j switches are used for selecting an appropriate tokenization scheme. A
token is a word or word fragment or combination of words or fragments. The shape of tokens
is important because it forms the basis of the language models used by dbacl. The -e
switch selects a predefined tokenization scheme, which is speedy but limited. The -w
switch specifies composite tokens derived from the -e switch. For example, "-e alnum -w 2"
means that tokens should be alphanumeric word fragments combined into overlapping pairs
(bigrams). When the -j switch is used, all tokens are converted to lowercase, which
reduces the number of possible tokens and therefore memory consumption.

If the -g switch is used, you can completely specify what the tokens should look like
using a regular expression. Several -g switches can be used to construct complex
tokenization schemes, and parentheses within each expression can be used to select
fragments and combine them into n-grams. The cost of such flexibility is reduced
classification and learning speed. When experimenting with tokenization schemes, try using
the -d or -D switches while learning or classifying, as they will print the tokens
explicitly so you can see what text fragments are picked up or missed out. For regular
exression syntax, see regex(7).

The -h and -H switches regulate how much memory dbacl may use for learning. Text
classification can use a lot of memory, and by default dbacl limits itself even at the
expense of learning accuracy. In many cases if a limit is reached, a warning message will
be printed on STDERR with some advice.

When relearning the same category several times, a significant speedup can be obtained by
using the -1 switch, as this allows the previously learned probabilities to be read from
the category and reused.

Note that classification accuracy depends foremost on the amount and quality of the
training samples, and then only on amount of tweaking.

EXIT STATUS

       When using the -l command form, dbacl returns zero on success. When  using  the  -c  form,
       dbacl returns a positive integer (1,2,3...) corresponding to the category with the highest
       posterior probability. In case of a tie, the first most probable category is chosen. If an
       error occurs, dbacl returns zero.

OPTIONS

-0 When learning, prevents weight preloading. Normally, dbacl checks if the category
file already exists, and if so, tries to use the existing weights as a starting
point. This can dramatically speed up learning. If the -0 (zero) switch is set,
then dbacl behaves as if no category file already exists. This is mainly useful for
testing. This switch is now enabled by default, to protect against weight drift
which can reduce accuracy over many learning iterations. Use -1 to force
preloading.

-1 Force weight preloading if the category file already exists. See discussion of the
-0 switch.

-a Append scores. Every input line is written to STDOUT and the dbacl scores are
appended. This is useful for postprocessing with bayesol(1). For ease of
processing, every original input line is indented by a single space (to distinguish
them from the appended scores), and the line with the scores (if -n is used) is
prefixed with the string "scores ". If a second copy of dbacl needs to read this
output later, it should be invoked with the -A switch.

-d Dump the model parameters to STDOUT. In conjunction with the -l option, this
produces a human-readable summary of the maximum entropy model. In conjunction with
the -c option, displays the contribution of each token to the final score.
Suppresses all other normal output.

-e Select character class for default (not regex-based) tokenization. By default,
tokens are alphabetic strings only. This corresponds to the case when deftok is
"alpha". Possible values for deftok are "alpha", "alnum", "graph", "char", "cef"
and "adp". The last two are custom tokenizers intended for email messages. See
also isalpha(3). The "char" tokenizer picks up single printable characters rather
than bigger tokens, and is intended for testing only.

-f Filter each line of input separately, passing to STDOUT only lines which match the
category identified as keep. This option should be used repeatedly for each
category which must be kept. keep can be either the category file name, or a
positive integer representing the required category in the same order it appears on
the command line.

Output lines are flushed as soon as they are written. If the input file is a pipe
or character device, then an attempt is made to use line buffering mode, otherwise
the more efficient block buffering is used.

-g Learn only features described by the extended regular expression regex. This
overrides the default feature selection method (see -w option) and learns, for each
line of input, only tokens constructed from the concatenation of strings which
match the tagged subexpressions within the supplied regex. All substrings which
match regex within a suffix of each input line are treated as features, even if
they overlap on the input line.

As an optional convenience, regex can include the suffix ||xyz which indicates
which parenthesized subexpressions should be tagged. In this case, xyz should
consist exclusively of digits 1 to 9, numbering exactly those subexpressions which
should be tagged. Alternatively, if no parentheses exist within regex, then it is
assumed that the whole expression must be captured.

-h Set the size of the hash table to 2^size elements. When using the -l option, this
refers to the total number of features allowed in the maximum entropy model being
learned. When using the -c option toghether with the -M switch and multinomial type
categories, this refers to the maximum number of features taken into account during
classification. Without the -M switch, this option has no effect.

-i Fully internationalized mode. Forces the use of wide characters internally, which
is necessary in some locales. This incurs a noticeable performance penalty.

-j Make features case sensitive. Normally, all features are converted to lower case
during processing, which reduces storage requirements and improves statistical
estimates for small datasets. With this option, the original capitalization is used
for each feature. This can improve classification accuracy.

-m Aggressively maps categories into memory and locks them into RAM to prevent
swapping, if possible. This is useful when speed is paramount and memory is
plentiful, for example when testing the classifier on large datasets.

Locking may require relaxing user limits with ulimit(1). Ask your system
administrator. Beware when using the -m switch together with the -o switch, as only
one dbacl process must learn or classify at a time to prevent file corruption. If
no learning takes place, then the -m switch for classifying is always safe to use.
See also the discussion for the -o switch.

-n Print scores for each category. Each score is the product of two numbers, the
cross entropy and the complexity of the input text under each model. Multiplied
together, they represent the log probability that the input resembles the model. To
see these numbers separately, use also the -v option. In conjunction with the -f
option, stops filtering but prints each input line prepended with a list of scores
for that line.

-q Select quality of learning, where quality can be 1,2,3,4. Higher values take longer
to learn, and should be slightly more accurate. The default quality is 1 if the
category file doesn't exist or weights cannot be preloaded, and 2 otherwise.

-o When learning, reads/writes partial token counts so they can be reused. Normally,
category files are learned from exactly the input data given, and don't contain
extraneous information. When this option is in effect, some extra information is
saved in the file online, after all input was read. This information can be reread
the next time that learning occurs, to continue where the previous dataset left
off. If online doesn't exist, it is created. If online exists, it is read before
learning, and updated afterwards. The file is approximately 3 times bigger (at
least) than the learned category.

In dbacl, file updates are atomic, but if using the -o switch, two or more
processes should not learn simultaneously, as only one process will write a lasting
category and memory dump. The -m switch can also speed up online learning, but
beware of possible corruption. Only one process should read or write a file. This
option is intended primarily for controlled test runs.

-r Learn the digramic reference model only. Skips the learning of extra features in
the text corpus.

-v Verbose mode. When learning, print out details of the computation, when
classifying, print out the name of the most probable category. In conjunction with
the -n option, prints the scores as an explicit product of the cross entropy and
the complexity.

-w Select default features to be n-grams up to max_order. This is incompatible with
the -g option, which always takes precedence. If no -w or -g options are given,
dbacl assumes -w 1. Note that n-grams for n greater than 1 do not straddle line
breaks by default. The -S switch enables line straddling.

-x Set decimation probability to 1 - 2^(-decim). To reduce memory requirements when
learning, some inputs are randomly skipped, and only a few are added to the model.
Exact behaviour depends on the applicable -T option (default is -T "text"). When
the type is not "email" (eg "text"), then individual input features are added with
probability 2^(-decim). When the type is "email", then full input messages are
added with probability 2^(-decim). Within each such message, all features are
used.

-A Expect indented input and scores. With this switch, dbacl expects input lines to be
indented by a single space character (which is then skipped). Lines starting with
any other character are ignored. This is the counterpart to the -a switch above.
When used together with the -a switch, dbacl outputs the skipped lines as they are,
and reinserts the space at the front of each processed input line.

-D Print debug output. Do not use normally, but can be very useful for displaying the
list features picked up while learning.

-H Allow hash table to grow up to a maximum of 2^gsize elements during learning.
Initial size is given by -h option.

-L Select the digramic reference measure for character transitions. The measure can be
one of "uniform", "dirichlet" or "maxent". Default is "uniform".

-M Force multinomial calculations. When learning, forces the model features to be
treated multinomially. When classifying, corrects entropy scores to reflect
multinomial probabilities (only applicable to multinomial type models, if present).
Scores will always be lower, because the ordering of features is lost.

-N Print posterior probabilities for each category. This assumes the supplied
categories form an exhaustive list of possibilities. In conjunction with the -f
option, stops filtering but prints each input line prepended with a summary of the
posterior distribution for that line.

-R Include an extra category for purely random text. The category is called "random".
Only makes sense when using the -c option.

-S Enable line straddling. This is useful together with the -w option to allow n-grams
for n > 1 to ignore line breaks, so a complex token can continue past the end of
the line. This is not recommended for email.

-T Specify nonstandard text format. By default, dbacl assumes that the input text is a
purely ASCII text file. This corresponds to the case when type is "text".

There are several types and subtypes which can be used to clean the input text of
extraneous tokens before actual learning or classifying takes place. Each (sub)type
you wish to use must be indicated with a separate -T option on the command line,
and automatically implies the corresponding type.

The "text" type is for unstructured plain text. No cleanup is performed. This is
the default if no types are given on the command line.

The "email" type is for mbox format input files or single RFC822 emails. Headers
are recognized and most are skipped. To include extra RFC822 standard headers
(except for trace headers), use the "email:headers" subtype. To include trace
headers, use the "email:theaders" subtype. To include all headers in the email, use
the "email:xheaders" subtype. To skip all headers, except the subject, use
"email:noheaders". To scan binary attachments for strings, use the "email:atts"
subtype.

When the "email" type is in effect, HTML markup is automatically removed from text
attachments except text/plain attachments. To also remove HTML markup from plain
text attachments, use "email:noplain". To prevent HTML markup removal in all text
attachments, use "email:plain".

The "html" type is for removing HTML markup (between <html> and </html> tags) and
surrounding text. Note that if the "email" type is enabled, then "html" is
automatically enabled for compatible message attachments only.

The "xml" type is like "html", but doesn't honour <html> and </html>, and doesn't
interpret tags (so this should be more properly called "angle markup" removal, and
has nothing to do with actual XML semantics).

When "html" is enabled, most markup attributes are lost (for values of 'most' close
to 'all'). The "html:links" subtype forces link urls to be parsed and learned,
which would otherwise be ignored. The "html:alt" subtype forces parsing of
alternative text in ALT attributes and various other tags. The "html:scripts"
subtype forces parsing of scripts, "html:styles" forces parsing of styles,
"html:forms" forces parsing of form values, while "html:comments" forces parsing of
HTML comments.

-U Print (U)nambiguity. When used in conjunction with the -v switch, prints scores
followed by their empirical standard deviations. When used alone, prints the best
category, followed by an estimated probability that this category choice is
unambiguous. More precisely, the probability measures lack of overlap of CLT
confidence intervals for each category score (If there is overlap, then there is
ambiguity).

This estimated probability can be used as an "unsure" flag, e.g. if the estimated
probability is lower than 50%. Formally, a score of 0% means another category is
equally likely to apply to the input, and a score of 100% means no other category
is likely to apply to the input. Note that this type of confidence is unrelated to
the -X switch. Also, the probability estimate is usually low if the document is
short, or if the message contains many tokens that have never been seen before
(only applies to uniform digramic measure).

-V Print the program version number and exit.

-W Like -w, but prevents features from straddling newlines. See the description of -w.

-X Print the confidence in the score calculated for each category, when used together
with the -n or -N switch. Prepares the model for confidence scores, when used with
the -l switch. The confidence is an estimate of the typicality of the score,
assuming the null hypothesis that the given category is correct. When used with the
-v switch alone, factorizes the score as the empirical divergence plus the shannon
entropy, multiplied by complexity, in that order. The -X switch is not supported in
all possible models, and displays a percentage of "0.0" if it can't be calculated.
Note that for unknown documents, it is quite common to have confidences close to
zero.

USAGE

       To create two category files in the current directory from  two  ASCII  text  files  named
       Mark_Twain.txt and William_Shakespeare.txt respectively, type:

       % dbacl -l twain Mark_Twain.txt
       % dbacl -l shake William_Shakespeare.txt

       Now you can classify input text, for example:

       % echo "howdy" | dbacl -v -c twain -c shake
       twain
       % echo "to be or not to be" | dbacl -v -c twain -c shake
       shake

       Note  that  the  -v option at least is necessary, otherwise dbacl does not print anything.
       The return value is 1 in the first case, 2 in the second.

       % echo "to be or not to be" | dbacl -v -N -c twain -c shake
       twain 22.63% shake 77.37%
       % echo "to be or not to be" | dbacl -v -n -c twain -c shake
       twain  7.04 * 6.0 shake  6.74 * 6.0

       These invocations are equivalent. The numbers  6.74  and  7.04  represent  how  close  the
       average  token  is to each category, and 6.0 is the number of tokens observed. If you want
       to print a simple confidence value together with the best category, replace -v with -U.

       % echo "to be or not to be" | dbacl -U -c twain -c shake
       shake # 34%

       Note that the true probability of category shake versus category twain is 77.37%, but  the
       calculation  is  somewhat  ambiguous,  and  34%  is  the  confidence  out of 100% that the
       calculation is qualitatively correct.

       Suppose a file document.txt contains English text lines interspersed with noise lines.  To
       filter  out the noise lines from the English lines, assuming you have an existing category
       shake say, type:

       % dbacl -c shake -f shake -R document.txt > document.txt_eng
       % dbacl -c shake -f random -R document.txt > document.txt_rnd

       Note that the quality of the results will vary depending on how well the categories  shake
       and  random  represent  each  input  line.   It  is  sometimes useful to see the posterior
       probabilities for each line without filtering:

       % dbacl -c shake -f shake -RN document.txt > document.txt_probs

       You can now postprocess the posterior probabilities for each line  of  text  with  another
       script, to replicate an arbitrary Bayesian decision rule of your choice.

       In the special case of exactly two categories, the optimal Bayesian decision procedure can
       be implemented for documents as follows: let p1 be the prior probability  that  the  input
       text  is  classified  as category1.  Consequently, the prior probability of classifying as
       category2 is 1 - p1.  Let u12 be the cost of misclassifying  a  category1  input  text  as
       belonging to category2 and vice versa for u21.  We assume there is no cost for classifying
       correctly.  Then the following command implements the optimal Bayesian decision:

       % dbacl -n -c category1 -c category2 | awk '{ if($2 * p1 * u12 > $4 * (1 - p1) * u21) {
              print $1; } else { print $3; } }'

       dbacl  can  also  be  used  in conjunction with procmail(1) to implement a simple Bayesian
       email classification system. Assume that incoming mail should be  automatically  delivered
       to  one  of  three  mail  folders  located in $MAILDIR and named work, personal, and spam.
       Initially, these must be created and filled with appropriate sample emails.  A  crontab(1)
       file can be used to learn the three categories once a day, e.g.

       CATS=$HOME/.dbacl
       5  0 * * * dbacl -T email -l $CATS/work $MAILDIR/work
       10 0 * * * dbacl -T email -l $CATS/personal $MAILDIR/personal
       15 0 * * * dbacl -T email -l $CATS/spam $MAILDIR/spam

       To  automatically  deliver  each incoming email into the appropriate folder, the following
       procmailrc(5) recipe fragment could be used:

       CATS=$HOME/.dbacl

       # run the spam classifier
       :0 c
       YAY=| dbacl -vT email -c $CATS/work -c $CATS/personal -c $CATS/spam

       # send to the appropriate mailbox
       :0:
       * ? test -n "$YAY"
       $MAILDIR/$YAY

       :0:
       $DEFAULT

       Sometimes, dbacl will send the email to the wrong mailbox. In that case, the misclassified
       message  should  be  removed from its wrong destination and placed in the correct mailbox.
       The error will be corrected the next time your messages are learned.  If it is left in the
       wrong category, dbacl will learn the wrong corpus statistics.

       The  default  text  features  (tokens)  read by dbacl are purely alphabetic strings, which
       minimizes memory requirements but can be unrealistic in some cases.  To  construct  models
       based  on alphanumeric tokens, use the -e switch. The example below also uses the optional
       -D switch, which prints a list of actual tokens found in the document:

       % dbacl -e alnum -D -l twain Mark_Twain.txt | less

       It is also possible to override the default feature selection method  used  to  learn  the
       category  model by means of regular expressions. For example, the following duplicates the
       default feature selection method in the C locale, while being much slower:

       % dbacl -l twain -g '^([[:alpha:]]+)' -g '[^[:alpha:]]([[:alpha:]]+)' Mark_Twain.txt

       The category twain which is obtained depends only on single alphabetic words in  the  text
       file  Mark_Twain.txt  (and  computed  digram  statistics  for  prediction).   For a second
       example, the following command builds a  smoothed  Markovian  (word  bigram)  model  which
       depends  on  pairs of consecutive words within each line (but pairs cannot straddle a line
       break):

       % dbacl -l twain2 -g '(^|[^[:alpha:]])([[:alpha:]]+)||2' -g
              '(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23' Mark_Twain.txt

       More  general, line based, n-gram models of all orders (up to 7) can be built in a similar
       way.  To construct paragraph based models, you should  reformat  the  input  corpora  with
       awk(1)  or  sed(1)  to  obtain  one  paragraph per line. Line size is limited by available
       memory, but note that regex performance will degrade quickly for long lines.

PERFORMANCE

The underlying assumption of statistical learning is that a relatively small number of
training documents can represent a much larger set of input documents. Thus in the long
run, learning can grind to a halt without serious impact on classification accuracy. While
not true in reality, this assumption is surprisingly accurate for problems such as email
filtering. In practice, this means that a well chosen corpus on the order of ten thousand
documents is sufficient for highly accurate results for years. Continual learning after
such a critical mass results in diminishing returns. Of course, when real world input
document patterns change dramatically, the predictive power of the models can be lost. At
the other end, a few hundred documents already give acceptable results in most cases.

dbacl is heavily optimized for the case of frequent classifications but infrequent batch
learning. This is the long run optimum described above. Under ideal conditions, dbacl can
classify a hundred emails per second on low end hardware (500Mhz Pentium III). Learning
speed is not very much slower, but takes effectively much longer for large document
collections for various reasons. When using the -m switch, data structures are
aggressively mapped into memory if possible, reducing overheads for both I/O and memory
allocations.

dbacl throws away its input as soon as possible, and has no limits on the input document
size. Both classification and learning speed are directly proportional to the number of
tokens in the input, but learning also needs a nonlinear optimization step which takes
time proportional to the number of unique tokens discovered. At time of writing, dbacl is
one of the fastest open source mail filters given its optimal usage scenario, but uses
more memory for learning than other filters.

MULTIPLE PROCESSES AND DATA CORRUPTION

       When  saving category files, dbacl first writes out a temporary file in the same location,
       and renames it afterwards. If a problem or crash occurs during learning, the old  category
       file  is therefore left untouched. This ensures that categories can never be corrupted, no
       matter how many processes try to simultaneously learn or classify, and  means  that  valid
       categories are available for classification at any time.

       When  using the -m switch, file contents are memory mapped for speedy reading and writing.
       This, together with the -o switch, is intended mainly for testing purposes, when  tens  of
       thousands  of  messages  must  be  learned  and  scored in a laboratory to measure dbacl's
       accuracy. Because no file locking is attempted for performance  reasons,  corruptions  are
       possible, unless you make sure that only one dbacl process reads or writes any file at any
       given time. This is the only case (-m and -o together) when corruption is possible.

MEMORY USE

When classifying a document, dbacl loads all indicated categories into RAM, so the total
memory needed is approximately the sum of the category file sizes plus a fixed small
overhead. The input document is consumed while being read, so its size doesn't matter,
but very long lines can take up space. When using the -m switch, the categories are read
using mmap(2) as available.

When learning, dbacl keeps a large structure in memory which contains many objects which
won't be saved into the output category. The size of this structure is proportional to the
number of unique tokens read, but not the size of the input documents, since they are
discarded while being read. As a rough guide, this structure is 4x-5x the size of the
final category file that is produced.

To prevent unchecked memory growth, dbacl allocates by default a fixed smallish amount of
memory for tokens. When this space is used up, further tokens are discarded which has the
effect of skewing the learned category making it less usable as more tokens are dropped. A
warning is printed on STDERR in such a case.

The -h switch lets you fix the initial size of the token space in powers of 2, ie "-h 17"
means 2^17 = 131072 possible tokens. If you type "dbacl -V", you can see the number of
bytes needed for each token when either learning or classifying. Multiply this number by
the maximum number of possible tokens to estimate the memory needed for learning. The -H
switch lets dbacl grow its tables automatically if and when needed, up to a maximum
specified. So if you type "-H 21", then the initial size will be doubled repeatedly if
necessary, up to approximately two million unique tokens.

When learning with the -X switch, a handful of input documents are also kept in RAM
throughout.

ENVIRONMENT

       DBACL_PATH
              When this variable is set, its value is prepended to every category filename  which
              doesn't start with a '/' or a '.'.

SIGNALS

       INT    If  this  signal  is  caught, dbacl simply exits without doing any cleanup or other
              operations. This signal can often be sent by pressing Ctrl-C on the  keyboard.  See
              stty(1).

       HUP, QUIT, TERM
              If  one  of  these  signals  is caught, dbacl stops reading input and continues its
              operation as if no more input was available. This is a way of quitting  gracefully,
              but  note  that  in  learning  mode,  a  category file will be written based on the
              incomplete input. The QUIT signal can  often  be  sent  by  pressing  Ctrl- on  the
              keyboard. See stty(1).

       USR1   If  this  signal  is  caught,  dbacl reloads the current categories at the earliest
              feasible opportunity. This is not normally useful at all, but might be  in  special
              cases,  such as if the -f switch is invoked together with input from a long running
              pipe.

NOTES

dbacl generated category files are in binary format, and may or may not be portable to
systems using a different byte order architecture (this depends on how dbacl was
compiled). The -V switch prints out whether categories are portable, or else you can just
experiment.

dbacl does not recognize functionally equivalent regular expressions, and in this case
duplicate features will be counted several times.

With every learned category, the command line options that were used are saved. When
classifying, make sure that every relevant category was learned with the same set of
options (regexes are allowed to differ), otherwise behaviour is undefined. There is no
need to repeat all the switches when classifying.

If you get many digitization warnings, then you are trying to learn too much data at once,
or your model is too complex. dbacl is compiled to save memory by digitizing final
weights, but you can disable digitization by editing dbacl.h and recompiling.

dbacl offers several built-in tokenizers (see -e switch) with more to come in future
versions, as the author invents them. While the default tokenizer may evolve, no
tokenizer should ever be removed, so that you can always simulate previous dbacl behaviour
subject to bug fixes and architectural changes.

The confidence estimates obtained through the -X switch are underestimates, ie are more
conservative than they should be.

BUGS

       "Ya  know,  some  day  scientists are gonna invent something that will outsmart a rabbit."
       (Robot Rabbit, 1953)

SOURCE

       The source code for the latest version of this  program  is  available  at  the  following
       locations:

       http://www.lbreyer.com/gpl.html
       http://dbacl.sourceforge.net

AUTHOR

       Laird A. Breyer <laird@lbreyer.com>

NAME

SYNOPSIS

OVERVIEW

EXIT STATUS

DESCRIPTION

SECONDARY SWITCHES

EXIT STATUS

OPTIONS

USAGE

PERFORMANCE

MULTIPLE PROCESSES AND DATA CORRUPTION

MEMORY USE

ENVIRONMENT

SIGNALS

NOTES

BUGS

SOURCE

AUTHOR

SEE ALSO