lunar (1) ifile.1.gz

Provided by: ifile_1.3.9-8_amd64 bug

NAME

       ifile - core executable for the ifile mail filtering system

SYNOPSIS

       ifile [-b file] [-q|-Q] [-g] [-k] [-o] [-v num] [lexing options] file ...
       ifile -c -q|-Q [-T threshold] [-b file] [-g] [-k] [-o] [lexing options] file ...
       ifile [-b file] [-d folder] [-i folder|-u folder] [-g] [-k] [-o] [-v num] [lexing options]
       file ...
       ifile -r [-b file]

DESCRIPTION

       ifile is a mail  filter  client  that  uses  machine  learning  to  classify  e-mail  into
       folders/mail  boxes.  The algorithm that it uses is called Naive Bayes.   Basically, naive
       bayes considers each document an unordered collection of words and classifies by  matching
       the document distribution with the most closely matching folder/mailbox distribution.

OPTIONS

       -b, --db-file=file
              Location to read/store ifile database.  Default is ~/.idata

       -c, --concise
              equivalent of "ifile -v 0 | head -1 | cut -f1 -d".  Must be used with -q or -Q.

       -d, --delete=folder
              Delete the statistics for each of files from the category folder

       -f, --folder-calcs=folder
              Show the word-probability calculations for folder

       -g, --log-file
              Create and store debugging information in ~/.ifile.log

       -i, --insert=folder
              Add the statistics for each of the files to the category folder

       -k, --keep-infrequent
              Leave in the database words that occur infrequently (normally they are tossed)

       -l, --query-loocv=folder
              For  each  of  the  files, temporarily removes file from folder, performs query and
              then reinserts file in folder.  Database is not modified.

       -o, --occur
              Uses document bit-vector representation.  Count each word once per document.

       -q, --query
              Output rating scores for each of the files

       -Q, --query-insert
              For each of the files, output rating scores and add statistics for the folder  with
              the highest score

       -T, --threshold=threshold
              When  used  with both -c and -q, output the two highest ranking categories if their
              score differs by at most threshold / 1000, which  can  be  used  to  detect  border
              cases.   When  used with -q only and any threshold > 0, output the score difference
              percentage.  For example,
                     ifile -T1 -q foo.txt
              might result in
                     spam -15570.48640776
                     non-spam -18728.00272369
                     diff[spam,non-spam](%) 9.21
              If so, then
                     ifile -T93 -q -c foo.txt
              will result in
                     foo.txt spam,non-spam
              whereas
                     ifile -T92 -q -c foo.txt
              will result in
                     foo.txt spam

       -r, --reset-data
              Erases all currently stored information

       -u, --update=folder
              Same as 'insert' except only adds stats if folder already exists

       -v, --verbosity=num
              Amount of output while running: 0=silent, 1=quiet, 2=progress, 3=verbose, 4=debug

       Lexing options:

       -a, --alpha-lexer
              Lex words as sequences of alphabetic characters (default)

       -A, --alpha-only-lexer
              Only lex  space-separated  character  sequences  which  are  composed  entirely  of
              alphabetic characters

       -h, --strip-header
              Skip all of the header lines except Subject:, From: and To:

       -m, --max-length=char
              Ignore  portion of message after first char characters.  Use entire message if char
              set to 0.  Default is 50,000.

       -p, --print-tokens
              Just tokenize and print, don't do any other processing.  Documents are returned  as
              a list of word, frequency pairs.

       -s, --no-stoplist
              Do not throw out overly frequent (stoplist) words when lexing

       -S, --stemming
              Use 'Porter' stemming algorithm when lexing documents

       -w, --white-lexer
              Lex words as sequences of space separated characters

       If  no  files  are  specified  on  the  command line, ifile will use standard input as its
       message to process.

       -?, --help
              Give this help list

       --usage
              Give a short usage message

       -V, --version
              Print program version

       Mandatory or optional arguments to long options are also mandatory  or  optional  for  any
       corresponding short options.

FILES

       ~/.idata
              ifile  database  (default  location).   See  FAQ  included  in  ifile  package  for
              description of database format.

AUTHOR

       Jason Rennie <jrennie@csail.mit.edu> and many others.  See  the  ChangeLog  for  the  full
       list.

EXAMPLES

       Before  using ifile, you need to train it.  Let's say that you have three folders, "spam",
       "ifile" and "friends", and the following directory structure:

              /--+--spam----+--1
                 |          +--2
                 |          +--3
                 |
                 +--ifile---+--1
                 |          +--2
                 |          +--3
                 |
                 +--friends-+--1
                            +--2
                            +--3

       The following commands build the ifile database in ~/.idata (use the -d option to  specify
       a different location for the database):

              ifile -h -i spam /spam/*
              ifile -h -i ifile /ifile/*
              ifile -h -i friends /friends/*

       The  -h  option  strips off headers besides "Subject:", "From:" and "To:".  I find that -h
       improves ifile's performance, but you may find otherwise for your personal collection.

       Note that we have made the argument to -i the same as the corresponding folder name.  This
       is  not  necessary.  The  argument  to  -i  can  be any word you want to use to identify a
       category of e-mails. The argument to -i must not include space characters (including  tab,
       feedline, etc.).

       At this point, your ~/.idata file should look something like this:

              spam ifile friends
              662 1020 6451
              3 3 3
              jrennie 9 0:3 1:18 2:16
              mindspring 6 1:7 2:5
              make 9 0:5 1:3
              yahoo 9 0:1 1:22 2:2

       The  first  line  is  the  space-separated  list  of  folders.  Their ordering specifies a
       numbering (spam=0, ifile=1, friends=2). The second line is a token count for  each  folder
       (e.g.  662  tokens observed in the three spam messages). The third line is an e-mail count
       for each folder (e.g. 3 e-mails for each of spam, ifile and friends). Each following  line
       specifies statistics for a word. The format of a line is

              word age folder:count [folder:count ...]

       where  folder  is  the folder number determined by the first line ordering. Folders with a
       count of zero are not listed.  So,  the  line  beginning  with  "jrennie"  indicates  that
       "jrennie"  appeared 3 times in "spam" e-mails, 18 times in "ifile" e-mails and 16 times in
       "friends" e-mails. The age is the number of e-mails that have  been  processed  since  the
       word was added to the database. Very infrequent words are pruned from the database to keep
       the database size down.

       Now that you have a database, you might want to filter some  e-mails.  Say  you  have  the
       following incoming e-mails:

              /--inbox--+--1
                        +--2
                        +--3

       To find out what folders ifile thinks these e-mails belong in, run

              ifile -c -q /inbox/1
              ifile -c -q /inbox/2
              ifile -c -q /inbox/3

       Let's say that 1 is about ifile, 2 is spam and 3 is from a friend. Assuming ifile does its
       job correctly, you'll see output like this:

              /inbox/1 ifile
              /inbox/2 spam
              /inbox/3 friends

       With such little training data, ifile is unlikely to  get  the  labels  correct,  but  you
       should get the idea :-)

       Now,  if you move the e-mails to the folders suggested by ifile, you'll want to update the
       database accordingly. You can do this with the -i option, like before. Or, you can  simply
       use  -Q  in  place  of  -q  above.  This automatically adds the e-mail to the folder ifile
       suggests.

       Now, assume for a moment that e-mail 1 was actually spam. We've added 1 to ifile  and  put
       it  in  the  ifile  folder.  We  need  to  move it to the spam folder and update the ifile
       database accordingly. We can update the database with the following command:

              ifile -d ifile -i spam /inbox/1

       This deletes the e-mail from "ifile" and adds it to "spam".

SEE ALSO

       Examples of how to use ifile together with procmail(1) and metamail(1) can be found in the
       directory /usr/share/doc/ifile/examples.