Provided by: dbacl_1.12-3_amd64 bug

NAME

       dbacl - a digramic Bayesian classifier for text recognition.

SYNOPSIS


       dbacl  [-01dvnirmwMNDXW]  [-T  type  ]  -l  category  [-h  size]  [-H  gsize] [-x decim] [-q quality] [-w
              max_order] [-e deftok] [-o online] [-L measure] [-g regex]...  [FILE]...

       dbacl [-vnimNRX] [-h size] [-T type] -c category [-c category]...  [-f keep]...  [FILE]...

       dbacl -V

OVERVIEW

       dbacl is a Bayesian text and email classifier. When using the -l switch, it learns a  body  of  text  and
       produce  a  file named category which summarizes the text. When using the -c switch, it compares an input
       text stream with any number of category files, and outputs the name of the closest match,  or  optionally
       various numerical scores explained below.

       Whereas  this  manual  page is intended as a reference, there are several tutorials and documents you can
       read to get  specialized  information.   Specific  documentation  about  the  design  of  dbacl  and  the
       statistical  models  that  it uses can be found in dbacl.ps.  For a basic overview of text classification
       using dbacl, see tutorial.html. A companion tutorial geared towards email filtering is email.html. If you
       have trouble getting dbacl to classify reliably, read is_it_working.html.   The  USAGE  section  of  this
       manual page also has some examples.

       /usr/share/doc/dbacl/dbacl.ps

       /usr/share/doc/dbacl/tutorial.html

       /usr/share/doc/dbacl/email.html

       /usr/share/doc/dbacl/is_it_working.html

       dbacl  uses  a maximum entropy (minimum divergence) language model constructed with respect to a digramic
       reference measure (unknown tokens are predicted from digrams, i.e. pairs of letters).  Practically,  this
       means  that a category is constructed from tokens in the training set, while previously unseen tokens can
       be predicted automatically from their letters. A token here is either a word (fragment) or a  combination
       of  words  (fragments),  selected according to various switches. Learning roughly works by tweaking token
       probabilities until the training data is least surprising.

EXIT STATUS

       The normal shell exit conventions aren't followed (sorry!). When using the -l command form, dbacl returns
       zero on success, nonzero if an error occurs. When using the -c form, dbacl  returns  a  positive  integer
       corresponding  to  the  category with the highest posterior probability. In case of a tie, the first most
       probable category is chosen. If an error occurs, dbacl returns zero.

DESCRIPTION

       When using the -l command form, dbacl learns a category when given one or more FILE names,  which  should
       contain  readable ASCII text. If no FILE is given, dbacl learns from STDIN. If FILE is a directory, it is
       opened and all its files are read, but not its subdirectories. The result is saved  in  the  binary  file
       named  category,  and  completely  replaces  any  previous contents. As a convenience, if the environment
       variable DBACL_PATH contains a directory, then that is prepended to the file path, unless category starts
       with a '/' or a '.'.

       The input text for learning is assumed to be unstructured plain text by default. This is not suitable for
       learning email, because email contains various transport encodings and formatting instructions which  can
       reduce  classification  effectiveness.  You  must  use  the -T switch in that case so that dbacl knows it
       should perform decoding and filtering of MIME and HTML as appropriate.  Apropriate switch values are  "-T
       email"  for  RFC2822  email input, "-T html" for HTML input, "-T xml" for generic XML style input and "-T
       text" is the default plain text format. There are other values of the -T  switch  that  also  allow  fine
       tuning of the decoding capabilities.

       When using the -c command form, dbacl attempts to classify the text found in FILE, or STDIN if no FILE is
       given.  Each  possible  category  must  be  given separately, and should be the file name of a previously
       learned text corpus. As a convenience, if the variable DBACL_PATH contains a directory, it  is  prepended
       to  each  file  path  which  doesn't  start with a '/' or a '.'. The visible output of the classification
       depends on the combination of extra switches used. If no switch is used,  then  no  output  is  shown  on
       STDOUT. However, dbacl always produces an exit code which can be tested.

       To  see  an  output  for  a  classification, you must use at least one of the -v,-U,-n,-N,-D,-d switches.
       Sometimes, they can be used in combination to produce a natural variation of  their  individual  outputs.
       Sometimes, dbacl also produces warnings on STDERR if applicable.

       The -v switch outputs the name of the best category among all the choices given.

       The  -U  switch outputs the name of the best category followed by a confidence percentage. Normally, this
       is the switch that you want to use. A percentage of 100% means that dbacl is sure of its choice, while  a
       percentage of 0% means that some other category is equally likely. This is not the model probability, but
       measures  how  unambiguous  the classification is, and can be used to tag unsure classifications (e.g. if
       the confidence is 25% or less).

       The -N switch prints each  category  name  followed  by  its  (posterior)  probability,  expressed  as  a
       percentage.  The  percentages  always  sum  to 100%. This is intuitive, but only valuable if the document
       being classified contains a handful of tokens (ten or less). In the common case with  many  more  tokens,
       the probabilities are always extremely close to 100% and 0%.

       The  -n  switch  prints each category name followed by the negative logarithm of its probability. This is
       equivalent to using the -N switch, but much more useful. The smallest number gives the best  category.  A
       more  convenient  form  is  to  use  both -n and -v which prints each category name followed by the cross
       entropy and the number of tokens analyzed. The cross entropy measures (in bits) the  average  compression
       rate  which  is achievable, under the given category model, per token of input text. If you use all three
       of -n,-v,-X then an extra value is output for each category, representing a  kind  of  p-value  for  each
       category  score.  This  indicates  how  typical the score is compared to the training documents, but only
       works if the -X switch was used during learning, and only for some types of models (e.g.  email).   These
       p-values  are  uniformly  distributed  and  independent  (if  the  categories are independent), so can be
       combined using Fisher's chi squared test to obtain composite p-values for groupings of categories.

       The -v and -X switches together print each category name followed by  a  detailed  decomposition  of  the
       category  score,  factored into ( divergence rate + shannon entropy rate )* token count @ p-value. Again,
       this only works in some types of models.

       The -v and -U switches print each category name followed by a decomposition of the category score into  (
       divergence rate + shannon entropy rate # score variance )* token count.

       The  -D  switch  prints  out  the  input  text as modified internally by dbacl prior to tokenization. For
       example, if a MIME encoded email document is classified, then this prints the decoded text that  will  be
       actually tokenized and classified. This switch is mainly useful for debugging.

       The  -d  switch  dumps tokens and scores while they are being read. It is useful for debugging, or if you
       want to create graphical representations of the classification. A detailed explanation of the  output  is
       beyond  the  scope  of  this  manual  page,  but  is  straightforward  if you've read dbacl.ps.  Possible
       variations include -d together with -n or -N.

       Classification can be done with one or several categories in principle. When two or more  categories  are
       used, the Bayesian posterior probability is used, given the input text, with a uniform prior distribution
       on  categories. For other choices of prior, see the companion utility bayesol(1).  When a single category
       is used, classification can be done by comparing the score with a treshold.  In  practice  however,  much
       better results are obtained with several categories.

       Learning  and  classifying  cannot  be mixed on the same command invocation, however there are no locking
       issues and separate dbacl processes  can  operate  simultaneously  with  obvious  results,  because  file
       operations are designed to be atomic.

       Finally,  note  that  dbacl  does  not  manage  your document corpora or your computed categories, and in
       particular it does not allow you to extend an existing category file with new documents.  This is  unlike
       various  current  spam  filters,  which  can  learn new emails incrementally. This limitation of dbacl is
       partially due to the nonlinear procedure used in the learning  algorithm,  and  partially  a  desire  for
       increased flexibility.

       You  can  simulate the effect of incremental learning by saving your training documents into archives and
       adding to these archives over time, relearning from scratch periodically. Learning is actually faster  if
       these  archives  are  compressed  and  decompressed  on  the  fly when needed. By keeping control of your
       archives, you can never lose the information in your categories,  and  you  can  easily  experiment  with
       different switches or tokenizations or sets of training documents if you like.

SECONDARY SWITCHES

       By  default,  dbacl classifies the input text as a whole. However, when using the -f option, dbacl can be
       used to filter each input line separately, printing only those lines  which  match  one  or  more  models
       identified  by  keep (use the category name or number to refer to a category). This is useful if you want
       to filter out some lines, but note that if the lines are short, then the error rate can be high.

       The -e,-w,-g,-j switches are used for selecting an appropriate tokenization scheme. A token is a word  or
       word fragment or combination of words or fragments. The shape of tokens is important because it forms the
       basis  of  the  language  models  used by dbacl.  The -e switch selects a predefined tokenization scheme,
       which is speedy but limited.  The -w switch specifies composite tokens derived from the  -e  switch.  For
       example,  "-e  alnum  -w  2"  means  that  tokens  should  be  alphanumeric  word fragments combined into
       overlapping pairs (bigrams). When the -j switch is used, all tokens are  converted  to  lowercase,  which
       reduces the number of possible tokens and therefore memory consumption.

       If  the  -g  switch  is used, you can completely specify what the tokens should look like using a regular
       expression. Several -g switches can be used to construct complex tokenization  schemes,  and  parentheses
       within  each  expression  can be used to select fragments and combine them into n-grams. The cost of such
       flexibility is reduced classification and learning speed. When experimenting with  tokenization  schemes,
       try  using  the -d or -D switches while learning or classifying, as they will print the tokens explicitly
       so you can see what text fragments are picked up  or  missed  out.  For  regular  exression  syntax,  see
       regex(7).

       The -h and -H switches regulate how much memory dbacl may use for learning. Text classification can use a
       lot of memory, and by default dbacl limits itself even at the expense of learning accuracy. In many cases
       if a limit is reached, a warning message will be printed on STDERR with some advice.

       When  relearning  the  same category several times, a significant speedup can be obtained by using the -1
       switch, as this allows the previously learned probabilities to be read from the category and reused.

       Note that classification accuracy depends foremost on the amount and quality of the training samples, and
       then only on amount of tweaking.

EXIT STATUS

       When using the -l command form, dbacl returns zero on success. When using the -c form,  dbacl  returns  a
       positive integer (1,2,3...) corresponding to the category with the highest posterior probability. In case
       of a tie, the first most probable category is chosen. If an error occurs, dbacl returns zero.

OPTIONS

       -0     When  learning,  prevents  weight  preloading. Normally, dbacl checks if the category file already
              exists, and if so, tries to use the existing weights as a starting point.  This  can  dramatically
              speed  up  learning.   If  the  -0 (zero) switch is set, then dbacl behaves as if no category file
              already exists. This is mainly useful for testing.  This switch is  now  enabled  by  default,  to
              protect  against  weight  drift which can reduce accuracy over many learning iterations. Use -1 to
              force preloading.

       -1     Force weight preloading if the category file already exists. See discussion of the -0 switch.

       -a     Append scores. Every input line is written to STDOUT and the dbacl scores are  appended.  This  is
              useful  for  postprocessing with bayesol(1).  For ease of processing, every original input line is
              indented by a single space (to distinguish them from the appended scores), and the line  with  the
              scores  (if  -n is used) is prefixed with the string "scores ". If a second copy of dbacl needs to
              read this output later, it should be invoked with the -A switch.

       -d     Dump the model parameters to STDOUT. In conjunction with the -l option,  this  produces  a  human-
              readable  summary  of  the  maximum entropy model. In conjunction with the -c option, displays the
              contribution of each token to the final score. Suppresses all other normal output.

       -e     Select character class  for  default  (not  regex-based)  tokenization.  By  default,  tokens  are
              alphabetic  strings only. This corresponds to the case when deftok is "alpha". Possible values for
              deftok are "alpha", "alnum", "graph", "char", "cef" and "adp".  The last two are custom tokenizers
              intended for email messages.  See also isalpha(3).  The "char" tokenizer picks up single printable
              characters rather than bigger tokens, and is intended for testing only.

       -f     Filter each line of input separately, passing to  STDOUT  only  lines  which  match  the  category
              identified  as  keep.  This option should be used repeatedly for each category which must be kept.
              keep can be either the category file  name,  or  a  positive  integer  representing  the  required
              category in the same order it appears on the command line.

              Output  lines  are  flushed  as soon as they are written. If the input file is a pipe or character
              device, then an attempt is made to use line buffering mode, otherwise  the  more  efficient  block
              buffering is used.

       -g     Learn  only  features  described  by  the  extended  regular expression regex.  This overrides the
              default feature selection method (see -w option) and learns, for each line of input,  only  tokens
              constructed  from  the  concatenation  of strings which match the tagged subexpressions within the
              supplied regex.  All substrings which match regex within a suffix of each input line  are  treated
              as features, even if they overlap on the input line.

              As an optional convenience, regex can include the suffix ||xyz which indicates which parenthesized
              subexpressions  should  be  tagged. In this case, xyz should consist exclusively of digits 1 to 9,
              numbering exactly those subexpressions which should be tagged. Alternatively,  if  no  parentheses
              exist within regex, then it is assumed that the whole expression must be captured.

       -h     Set  the  size  of the hash table to 2^size elements. When using the -l option, this refers to the
              total number of features allowed in the maximum entropy model being learned.  When  using  the  -c
              option  toghether  with  the -M switch and multinomial type categories, this refers to the maximum
              number of features taken into account during classification.  Without the -M switch,  this  option
              has no effect.

       -i     Fully  internationalized mode. Forces the use of wide characters internally, which is necessary in
              some locales. This incurs a noticeable performance penalty.

       -j     Make features  case  sensitive.  Normally,  all  features  are  converted  to  lower  case  during
              processing,  which  reduces  storage  requirements  and  improves  statistical estimates for small
              datasets. With this option, the original capitalization is used for each feature. This can improve
              classification accuracy.

       -m     Aggressively maps categories into memory and locks them into RAM to prevent swapping, if possible.
              This is useful when speed is paramount and memory is  plentiful,  for  example  when  testing  the
              classifier on large datasets.

              Locking  may  require  relaxing user limits with ulimit(1).  Ask your system administrator. Beware
              when using the -m switch together with the -o switch, as only one  dbacl  process  must  learn  or
              classify  at a time to prevent file corruption. If no learning takes place, then the -m switch for
              classifying is always safe to use. See also the discussion for the -o switch.

       -n     Print scores for each category.  Each score is the product of two numbers, the cross  entropy  and
              the  complexity  of  the  input text under each model. Multiplied together, they represent the log
              probability that the input resembles the model. To see these numbers separately, use also  the  -v
              option.  In  conjunction  with the -f option, stops filtering but prints each input line prepended
              with a list of scores for that line.

       -q     Select quality of learning, where quality can be 1,2,3,4. Higher values take longer to learn,  and
              should  be  slightly more accurate. The default quality is 1 if the category file doesn't exist or
              weights cannot be preloaded, and 2 otherwise.

       -o     When learning, reads/writes partial token counts so they can be reused. Normally,  category  files
              are learned from exactly the input data given, and don't contain extraneous information. When this
              option is in effect, some extra information is saved in the file online, after all input was read.
              This  information can be reread the next time that learning occurs, to continue where the previous
              dataset left off. If online doesn't exist, it is created. If online  exists,  it  is  read  before
              learning,  and  updated  afterwards.  The file is approximately 3 times bigger (at least) than the
              learned category.

              In dbacl, file updates are atomic, but if using the -o switch, two or more  processes  should  not
              learn  simultaneously,  as  only one process will write a lasting category and memory dump. The -m
              switch can also speed up online learning, but beware of possible  corruption.   Only  one  process
              should read or write a file. This option is intended primarily for controlled test runs.

       -r     Learn the digramic reference model only. Skips the learning of extra features in the text corpus.

       -v     Verbose mode. When learning, print out details of the computation, when classifying, print out the
              name  of  the  most probable category.  In conjunction with the -n option, prints the scores as an
              explicit product of the cross entropy and the complexity.

       -w     Select default features to be n-grams up to max_order.  This is incompatible with the  -g  option,
              which  always takes precedence. If no -w or -g options are given, dbacl assumes -w 1. Note that n-
              grams for n greater than 1 do not straddle line breaks by default.  The  -S  switch  enables  line
              straddling.

       -x     Set  decimation  probability to 1 - 2^(-decim).  To reduce memory requirements when learning, some
              inputs are randomly skipped, and only a few are added to the model.  Exact  behaviour  depends  on
              the  applicable  -T option (default is -T "text").  When the type is not "email" (eg "text"), then
              individual input features are added with probability 2^(-decim). When the type  is  "email",  then
              full input messages are added with probability 2^(-decim).  Within each such message, all features
              are used.

       -A     Expect  indented input and scores. With this switch, dbacl expects input lines to be indented by a
              single space character (which is then skipped).  Lines  starting  with  any  other  character  are
              ignored.  This  is the counterpart to the -a switch above.  When used together with the -a switch,
              dbacl outputs the skipped lines as they are,  and  reinserts  the  space  at  the  front  of  each
              processed input line.

       -D     Print  debug  output. Do not use normally, but can be very useful for displaying the list features
              picked up while learning.

       -H     Allow hash table to grow up to a maximum of 2^gsize elements  during  learning.  Initial  size  is
              given by -h option.

       -L     Select  the  digramic  reference  measure  for  character  transitions.  The measure can be one of
              "uniform", "dirichlet" or "maxent". Default is "uniform".

       -M     Force  multinomial  calculations.  When  learning,  forces  the  model  features  to  be   treated
              multinomially.  When  classifying,  corrects  entropy  scores to reflect multinomial probabilities
              (only applicable to multinomial type models, if present).  Scores will always  be  lower,  because
              the ordering of features is lost.

       -N     Print  posterior  probabilities  for  each category.  This assumes the supplied categories form an
              exhaustive list of possibilities.  In conjunction with the -f option, stops filtering  but  prints
              each input line prepended with a summary of the posterior distribution for that line.

       -R     Include  an  extra  category  for purely random text. The category is called "random".  Only makes
              sense when using the -c option.

       -S     Enable line straddling. This is useful together with the -w option to allow n-grams for n >  1  to
              ignore  line  breaks,  so  a  complex  token  can  continue  past the end of the line. This is not
              recommended for email.

       -T     Specify nonstandard text format. By default, dbacl assumes that the input text is a  purely  ASCII
              text file. This corresponds to the case when type is "text".

              There  are  several  types  and  subtypes  which can be used to clean the input text of extraneous
              tokens before actual learning or classifying takes place. Each (sub)type you wish to use  must  be
              indicated  with  a  separate  -T  option  on  the  command  line,  and  automatically  implies the
              corresponding type.

              The "text" type is for unstructured plain text. No cleanup is performed. This is the default if no
              types are given on the command line.

              The "email" type is for mbox format input files or single RFC822 emails.  Headers  are  recognized
              and most are skipped. To include extra RFC822 standard headers (except for trace headers), use the
              "email:headers"  subtype.   To include trace headers, use the "email:theaders" subtype. To include
              all headers in the email, use the "email:xheaders"  subtype.  To  skip  all  headers,  except  the
              subject,  use  "email:noheaders".  To  scan  binary  attachments for strings, use the "email:atts"
              subtype.

              When the "email" type is in effect, HTML markup is automatically  removed  from  text  attachments
              except  text/plain  attachments.  To  also  remove  HTML  markup  from plain text attachments, use
              "email:noplain". To prevent HTML markup removal in all text attachments, use "email:plain".

              The "html" type is for removing HTML markup (between <html>  and  </html>  tags)  and  surrounding
              text.  Note  that  if  the  "email"  type  is  enabled,  then  "html" is automatically enabled for
              compatible message attachments only.

              The "xml" type is like "html", but doesn't honour <html> and </html>, and doesn't  interpret  tags
              (so  this should be more properly called "angle markup" removal, and has nothing to do with actual
              XML semantics).

              When "html" is enabled, most markup attributes are lost (for values of  'most'  close  to  'all').
              The  "html:links"  subtype  forces  link  urls  to be parsed and learned, which would otherwise be
              ignored. The "html:alt" subtype forces parsing of alternative text in ALT attributes  and  various
              other  tags. The "html:scripts" subtype forces parsing of scripts, "html:styles" forces parsing of
              styles, "html:forms" forces parsing of form values, while "html:comments" forces parsing  of  HTML
              comments.

       -U     Print (U)nambiguity.  When used in conjunction with the -v switch, prints scores followed by their
              empirical standard deviations. When used alone, prints the best category, followed by an estimated
              probability  that  this  category  choice is unambiguous. More precisely, the probability measures
              lack of overlap of CLT confidence intervals for each category score (If  there  is  overlap,  then
              there is ambiguity).

              This  estimated  probability can be used as an "unsure" flag, e.g. if the estimated probability is
              lower than 50%. Formally, a score of 0% means another category is equally likely to apply  to  the
              input, and a score of 100% means no other category is likely to apply to the input. Note that this
              type of confidence is unrelated to the -X switch. Also, the probability estimate is usually low if
              the  document  is  short,  or if the message contains many tokens that have never been seen before
              (only applies to uniform digramic measure).

       -V     Print the program version number and exit.

       -W     Like -w, but prevents features from straddling newlines. See the description of -w.

       -X     Print the confidence in the score calculated for each category, when used together with the -n  or
              -N switch. Prepares the model for confidence scores, when used with the -l switch.  The confidence
              is  an  estimate  of  the  typicality  of  the  score, assuming the null hypothesis that the given
              category is correct. When used with the -v switch alone, factorizes the  score  as  the  empirical
              divergence plus the shannon entropy, multiplied by complexity, in that order. The -X switch is not
              supported  in  all  possible models, and displays a percentage of "0.0" if it can't be calculated.
              Note that for unknown documents, it is quite common to have confidences close to zero.

USAGE

       To create two category files in the current directory from two ASCII text files named Mark_Twain.txt  and
       William_Shakespeare.txt respectively, type:

       % dbacl -l twain Mark_Twain.txt
       % dbacl -l shake William_Shakespeare.txt

       Now you can classify input text, for example:

       % echo "howdy" | dbacl -v -c twain -c shake
       twain
       % echo "to be or not to be" | dbacl -v -c twain -c shake
       shake

       Note  that the -v option at least is necessary, otherwise dbacl does not print anything. The return value
       is 1 in the first case, 2 in the second.

       % echo "to be or not to be" | dbacl -v -N -c twain -c shake
       twain 22.63% shake 77.37%
       % echo "to be or not to be" | dbacl -v -n -c twain -c shake
       twain  7.04 * 6.0 shake  6.74 * 6.0

       These invocations are equivalent. The numbers 6.74 and 7.04 represent how close the average token  is  to
       each  category,  and 6.0 is the number of tokens observed. If you want to print a simple confidence value
       together with the best category, replace -v with -U.

       % echo "to be or not to be" | dbacl -U -c twain -c shake
       shake # 34%

       Note that the true probability of category shake versus category twain is 77.37%, but the calculation  is
       somewhat ambiguous, and 34% is the confidence out of 100% that the calculation is qualitatively correct.

       Suppose  a file document.txt contains English text lines interspersed with noise lines. To filter out the
       noise lines from the English lines, assuming you have an existing category shake say, type:

       % dbacl -c shake -f shake -R document.txt > document.txt_eng
       % dbacl -c shake -f random -R document.txt > document.txt_rnd

       Note that the quality of the results will vary depending on how well  the  categories  shake  and  random
       represent  each  input  line.   It  is  sometimes useful to see the posterior probabilities for each line
       without filtering:

       % dbacl -c shake -f shake -RN document.txt > document.txt_probs

       You can now postprocess the posterior probabilities for  each  line  of  text  with  another  script,  to
       replicate an arbitrary Bayesian decision rule of your choice.

       In the special case of exactly two categories, the optimal Bayesian decision procedure can be implemented
       for documents as follows: let p1 be the prior probability that the input text is classified as category1.
       Consequently,  the  prior  probability  of  classifying  as  category2 is 1 - p1.  Let u12 be the cost of
       misclassifying a category1 input text as belonging to category2 and vice versa for u21.  We assume  there
       is  no  cost  for  classifying  correctly.   Then  the  following command implements the optimal Bayesian
       decision:

       % dbacl -n -c category1 -c category2 | awk '{ if($2 * p1 * u12 > $4 * (1 - p1) * u21) { print $1; } else
              { print $3; } }'

       dbacl  can  also  be  used  in  conjunction  with  procmail(1)  to  implement  a  simple  Bayesian  email
       classification  system.  Assume that incoming mail should be automatically delivered to one of three mail
       folders located in $MAILDIR and named work, personal, and spam.  Initially, these  must  be  created  and
       filled  with appropriate sample emails.  A crontab(1) file can be used to learn the three categories once
       a day, e.g.

       CATS=$HOME/.dbacl
       5  0 * * * dbacl -T email -l $CATS/work $MAILDIR/work
       10 0 * * * dbacl -T email -l $CATS/personal $MAILDIR/personal
       15 0 * * * dbacl -T email -l $CATS/spam $MAILDIR/spam

       To automatically deliver each incoming email into the appropriate  folder,  the  following  procmailrc(5)
       recipe fragment could be used:

       CATS=$HOME/.dbacl

       # run the spam classifier
       :0 c
       YAY=| dbacl -vT email -c $CATS/work -c $CATS/personal -c $CATS/spam

       # send to the appropriate mailbox
       :0:
       * ? test -n "$YAY"
       $MAILDIR/$YAY

       :0:
       $DEFAULT

       Sometimes, dbacl will send the email to the wrong mailbox. In that case, the misclassified message should
       be removed from its wrong destination and placed in the correct mailbox.  The error will be corrected the
       next  time  your  messages  are learned.  If it is left in the wrong category, dbacl will learn the wrong
       corpus statistics.

       The default text features (tokens) read by dbacl are purely alphabetic strings,  which  minimizes  memory
       requirements  but can be unrealistic in some cases. To construct models based on alphanumeric tokens, use
       the -e switch. The example below also uses the optional -D switch, which prints a list of  actual  tokens
       found in the document:

       % dbacl -e alnum -D -l twain Mark_Twain.txt | less

       It  is also possible to override the default feature selection method used to learn the category model by
       means of regular expressions. For example, the following duplicates the default feature selection  method
       in the C locale, while being much slower:

       % dbacl -l twain -g '^([[:alpha:]]+)' -g '[^[:alpha:]]([[:alpha:]]+)' Mark_Twain.txt

       The  category  twain  which  is  obtained  depends  only  on  single  alphabetic  words  in the text file
       Mark_Twain.txt (and computed digram statistics for prediction).  For  a  second  example,  the  following
       command  builds  a  smoothed  Markovian  (word  bigram) model which depends on pairs of consecutive words
       within each line (but pairs cannot straddle a line break):

       % dbacl -l twain2 -g '(^|[^[:alpha:]])([[:alpha:]]+)||2' -g
              '(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23' Mark_Twain.txt

       More general, line based, n-gram models of all orders (up to 7) can  be  built  in  a  similar  way.   To
       construct  paragraph  based models, you should reformat the input corpora with awk(1) or sed(1) to obtain
       one paragraph per line. Line size is limited by available memory, but note that  regex  performance  will
       degrade quickly for long lines.

PERFORMANCE

       The underlying assumption of statistical learning is that a relatively small number of training documents
       can  represent  a  much larger set of input documents. Thus in the long run, learning can grind to a halt
       without serious impact on classification  accuracy.  While  not  true  in  reality,  this  assumption  is
       surprisingly  accurate  for problems such as email filtering.  In practice, this means that a well chosen
       corpus on the order of ten thousand documents is  sufficient  for  highly  accurate  results  for  years.
       Continual learning after such a critical mass results in diminishing returns.  Of course, when real world
       input document patterns change dramatically, the predictive power of the models can be lost. At the other
       end, a few hundred documents already give acceptable results in most cases.

       dbacl  is  heavily optimized for the case of frequent classifications but infrequent batch learning. This
       is the long run optimum described above. Under ideal conditions, dbacl can classify a hundred emails  per
       second  on  low  end  hardware  (500Mhz  Pentium  III). Learning speed is not very much slower, but takes
       effectively much longer for large document collections for various reasons.  When using  the  -m  switch,
       data  structures  are  aggressively  mapped  into memory if possible, reducing overheads for both I/O and
       memory allocations.

       dbacl throws away its input as soon as possible, and has no limits  on  the  input  document  size.  Both
       classification  and  learning  speed  are directly proportional to the number of tokens in the input, but
       learning also needs a nonlinear optimization step which takes time proportional to the number  of  unique
       tokens  discovered.   At  time of writing, dbacl is one of the fastest open source mail filters given its
       optimal usage scenario, but uses more memory for learning than other filters.

MULTIPLE PROCESSES AND DATA CORRUPTION

       When saving category files, dbacl first writes out a temporary file in the same location, and renames  it
       afterwards.  If  a  problem  or  crash  occurs  during  learning, the old category file is therefore left
       untouched. This ensures that categories can never be corrupted, no  matter  how  many  processes  try  to
       simultaneously learn or classify, and means that valid categories are available for classification at any
       time.

       When  using the -m switch, file contents are memory mapped for speedy reading and writing. This, together
       with the -o switch, is intended mainly for testing purposes, when tens of thousands of messages  must  be
       learned  and scored in a laboratory to measure dbacl's accuracy. Because no file locking is attempted for
       performance reasons, corruptions are possible, unless you make sure that only one dbacl process reads  or
       writes  any  file  at  any  given  time.  This  is  the only case (-m and -o together) when corruption is
       possible.

MEMORY USE

       When classifying a document, dbacl loads all indicated categories into RAM, so the total memory needed is
       approximately the sum of the category file sizes plus a fixed small  overhead.   The  input  document  is
       consumed while being read, so its size doesn't matter, but very long lines can take up space.  When using
       the -m switch, the categories are read using mmap(2) as available.

       When  learning,  dbacl keeps a large structure in memory which contains many objects which won't be saved
       into the output category. The size of this structure is proportional to the number of unique tokens read,
       but not the size of the input documents, since they are discarded while being read.  As  a  rough  guide,
       this structure is 4x-5x the size of the final category file that is produced.

       To  prevent  unchecked  memory  growth,  dbacl allocates by default a fixed smallish amount of memory for
       tokens. When this space is used up, further tokens are discarded which has  the  effect  of  skewing  the
       learned category making it less usable as more tokens are dropped. A warning is printed on STDERR in such
       a case.

       The  -h  switch  lets you fix the initial size of the token space in powers of 2, ie "-h 17" means 2^17 =
       131072 possible tokens. If you type "dbacl -V", you can see the number of bytes  needed  for  each  token
       when  either  learning  or  classifying. Multiply this number by the maximum number of possible tokens to
       estimate the memory needed for learning. The -H switch lets dbacl grow its tables  automatically  if  and
       when  needed,  up  to  a maximum specified. So if you type "-H 21", then the initial size will be doubled
       repeatedly if necessary, up to approximately two million unique tokens.

       When learning with the -X switch, a handful of input documents are also kept in RAM throughout.

ENVIRONMENT

       DBACL_PATH
              When this variable is set, its value is prepended to every category filename which  doesn't  start
              with a '/' or a '.'.

SIGNALS

       INT    If  this  signal is caught, dbacl simply exits without doing any cleanup or other operations. This
              signal can often be sent by pressing Ctrl-C on the keyboard. See stty(1).

       HUP, QUIT, TERM
              If one of these signals is caught, dbacl stops reading input and continues its operation as if  no
              more  input was available. This is a way of quitting gracefully, but note that in learning mode, a
              category file will be written based on the incomplete input. The QUIT signal can often be sent  by
              pressing Ctrl- on the keyboard. See stty(1).

       USR1   If  this  signal  is  caught,  dbacl  reloads  the  current  categories  at  the earliest feasible
              opportunity. This is not normally useful at all, but might be in special cases, such as if the  -f
              switch is invoked together with input from a long running pipe.

NOTES

       dbacl  generated  category  files are in binary format, and may or may not be portable to systems using a
       different byte order architecture (this depends on how dbacl was compiled).  The  -V  switch  prints  out
       whether categories are portable, or else you can just experiment.

       dbacl does not recognize functionally equivalent regular expressions, and in this case duplicate features
       will be counted several times.

       With  every  learned category, the command line options that were used are saved.  When classifying, make
       sure that every relevant category was learned with the same  set  of  options  (regexes  are  allowed  to
       differ), otherwise behaviour is undefined. There is no need to repeat all the switches when classifying.

       If  you get many digitization warnings, then you are trying to learn too much data at once, or your model
       is too complex.  dbacl is compiled to save memory by  digitizing  final  weights,  but  you  can  disable
       digitization by editing dbacl.h and recompiling.

       dbacl  offers  several  built-in  tokenizers (see -e switch) with more to come in future versions, as the
       author invents them.  While the default tokenizer may evolve, no tokenizer should  ever  be  removed,  so
       that you can always simulate previous dbacl behaviour subject to bug fixes and architectural changes.

       The confidence estimates obtained through the -X switch are underestimates, ie are more conservative than
       they should be.

BUGS

       "Ya  know,  some  day  scientists are gonna invent something that will outsmart a rabbit." (Robot Rabbit,
       1953)

SOURCE

       The source code for the latest version of this program is available at the following locations:

       http://www.lbreyer.com/gpl.html
       http://dbacl.sourceforge.net

AUTHOR

       Laird A. Breyer <laird@lbreyer.com>

SEE ALSO

       awk(1), bayesol(1), crontab(1), hmine(1), hypex(1), less(1), mailcross(1),  mailfoot(1),  mailinspect(1),
       mailtoe(1), procmailex(5), regex(7), stty(1), sed(1)

Version 1.12                           Bayesian Text Classification Tools                               DBACL(1)