lunar (1) autoclass.1.gz

Provided by: autoclass_3.3.6.dfsg.2-1_amd64 bug

NAME

       autoclass - automatically discover classes in data

SYNOPSIS

       autoclass -search data_file header_file model_file s_param_file
       autoclass -report results_file search_file r_params_file
       autoclass -predict results_file search_file results_file

DESCRIPTION

       AutoClass  solves  the problem of automatic discovery of classes in data (sometimes called
       clustering,  or  unsupervised  learning),  as  distinct  from  the  generation  of   class
       descriptions  from labeled examples (called supervised learning).  It aims to discover the
       "natural" classes in the data.  AutoClass is applicable to observations of things that can
       be  described  by a set of attributes, without referring to other things.  The data values
       corresponding to each attribute are limited to be either numbers  or  the  elements  of  a
       fixed set of symbols.  With numeric data, a measurement error must be provided.

       AutoClass  is  looking  for  the  best  classification(s)  of  the  data  it  can find.  A
       classification is composed of:

       1)     A set of classes, each of which is described by a set of  class  parameters,  which
              specify  how  the  class is distributed along the various attributes.  For example,
              "height normally distributed with mean 4.67 ft and standard deviation .32 ft",

       2)     A set of class weights, describing what percentage of cases are  likely  to  be  in
              each class.

       3)     A  probabilistic  assignment  of cases in the data to these classes.  I.e. for each
              case, the relative probability that it is a member of each class.

       As a strictly Bayesian system (accept no substitutes!), the quality measure AutoClass uses
       is  the  total  probability that, had you known nothing about your data or its domain, you
       would have found this set of data generated by this underlying model.  This  includes  the
       prior  probability  that the "world" would have chosen this number of classes, this set of
       relative class weights, and this set of parameters for each class, and the likelihood that
       such  a  set  of classes would have generated this set of values for the attributes in the
       data cases.

       These probabilities are typically very small, in the range of e^-30000, and so are usually
       expressed in exponential notation.

       When  run with the -search command, AutoClass searches for a classification.  The required
       arguments are the paths to the four input files, which supply the data, the  data  format,
       the desired classification model, and the search parameters, respectively.

       By  default,  AutoClass  writes  intermediate  results in a binary file.  With the -report
       command, AutoClass generates an ASCII report.  The arguments are the full  path  names  of
       the .results, .search, and .r-params files.

       When  run  with  the -predict command, AutoClass predicts the class membership of a "test"
       data set based on classes found in a "training" data set (see "PREDICTIONS" below).

INPUT FILES

       An AutoClass data set resides in two files.  There is a header file (file type "hd2") that
       describes  the specific data format and attribute definitions.  The actual data values are
       in a data file (file type "db2").  We use two files to allow editing of data  descriptions
       without  having  to  deal with the entire data set.  This makes it easy to experiment with
       different descriptions  of  the  database  without  having  to  reproduce  the  data  set.
       Internally,  an  AutoClass  database structure is identified by its header and data files,
       and the number of data loaded.

       For   more   detailed   information   on    the    formats    of    these    files,    see
       /usr/share/doc/autoclass/preparation-c.text.

   DATA FILE
       The data file contains a sequence of data objects (datum or case) terminated by the end of
       the file. The number of values for each data  object  must  be  equal  to  the  number  of
       attributes defined in the header file.  Data objects must be groups of tokens delimited by
       "new-line".  Attributes are typed as REAL, DISCRETE, or DUMMY.  Real attribute values  are
       numbers,  either  integer  or  floating  point.  Discrete attribute values can be strings,
       symbols, or integers.  A dummy attribute value can be any of these types.  Dummys are read
       in  but otherwise ignored -- they will be set to zeros in the the internal database.  Thus
       the actual values will not be available for use in report output.  To have these attribute
       values  available,  use  either type REAL or type DISCRETE, and define their model type as
       IGNORE in the .model file.  Missing values for any attribute type may  be  represented  by
       either  "?", or other token specified in the header file.  All are translated to a special
       unique value after being read, so this symbol is effectively reserved for  unknown/missing
       values.

       For example:
             white       38.991306 0.54248405  2 2 1
             red         25.254923 0.5010235   9 2 1
             yellow      32.407973 ?           8 2 1
             all_white   28.953982 0.5267696   0 1 1

   HEADER FILE
       The  header  file  specifies  the  data  file  format,  and  the  definitions  of the data
       attributes.  The header file functional specifications consists of two parts --  the  data
       set  format  definition  specifications,  and  the  attribute descriptors. ";" in column 1
       identifies a comment.

       A header file follows this general format:

           ;; num_db2_format_defs value (number of format def lines
           ;; that follow), range of n is 1 -> 5
           num_db2_format_defs n
           ;; number_of_attributes token and value required
           number_of_attributes <as required>
           ;; following are optional - default values are specified
           separator_char  ' '
           comment_char    ';'
           unknown_token   '?'
           separator_char  ','

           ;; attribute descriptors
           ;; <zero-based att#>  <att_type>  <att_sub_type>  <att_description>
           ;; <att_param_pairs>

       Each attribute descriptor is a line of:

             Attribute index (zero based, beginning in column 1)
             Attribute type.  See below.
             Attribute subtype.  See below
             Attribute description: symbol (no embedded blanks) or
                   string; <= 40 characters
             Specific property and value pairs.
                   Currently available combinations:

                type           subtype         property type(s)
                ----           --------        ---------------
                dummy          none/nil        --
                discrete       nominal         range
                real           location        error
                real           scalar          zero_point rel_error

       The ERROR property should represent your best estimate of the average  error  expected  in
       the  measurement  and  recording  of that real attribute.  Lacking better information, the
       error can be taken as 1/2 the minimum possible difference between measured values.  It can
       be  argued  that real values are often truncated, so that smaller errors may be justified,
       particularly for generated data.  But AutoClass only sees  the  recorded  values.   So  it
       needs the error in the recorded values, rather than the actual measurement error.  Setting
       this error much smaller than the minimum expressible difference implies the possibility of
       values  that cannot be expressed in the data.  Worse, it implies that two identical values
       must represent measurements that were much closer than  they  might  actually  have  been.
       This leads to over-fitting of the classification.

       The  REL_ERROR  property  is  used  for SCALAR reals when the error is proportional to the
       measured value.  The ERROR property is not supported.

       AutoClass uses the error as a lower bound on the width of  the  normal  distribution.   So
       small  error  estimates  tend  to  give  narrower peaks and to increase both the number of
       classes and the classification probability.  Broad  error  estimates  tend  to  limit  the
       number of classes.

       The  scalar  ZERO_POINT  property is the smallest value that the measurement process could
       have produced.  This is often 0.0, or less by some error range.   Similarly,  the  bounded
       real's  min  and max properties are exclusive bounds on the attributes generating process.
       For a calculated percentage these would be 0-e and 100+e, where e is an error value.   The
       discrete  attribute's  range  is  the number of possible values the attribute can take on.
       This range must include unknown as a value when such values occur.

       Header File Example:

       !#; AutoClass C header file -- extension .hd2
       !#; the following chars in column 1 make the line a comment:
       !#; '!', '#', ';', ' ', and '\n' (empty line)

       ;#! num_db2_format_defs <num of def lines -- min 1, max 4>
       num_db2_format_defs 2
       ;; required
       number_of_attributes 7
       ;; optional - default values are specified
       ;; separator_char  ' '
       ;; comment_char    ';'
       ;; unknown_token   '?'
       separator_char     ','

       ;; <zero-based att#>  <att_type>  <att_sub_type>  <att_description>
       <att_param_pairs>
       0 dummy nil       "True class, range = 1 - 3"
       1 real location "X location, m. in range of 25.0 - 40.0" error .25
       2 real location "Y location, m. in range of 0.5 - 0.7" error .05
       3 real scalar   "Weight, kg. in range of 5.0 - 10.0" zero_point 0.0
       rel_error .001
       4 discrete nominal  "Truth value, range = 1 - 2" range 2
       5 discrete nominal  "Color of foobar, 10 values" range 10
       6 discrete nominal  Spectral_color_group range 6

   MODEL FILE
       A classification of a data set is made with respect to a model which specifies the form of
       the  probability  distribution  function for classes in that data set.  Normally the model
       structure is defined in a model file (file type "model"), containing one or  more  models.
       Internally,  a  model is defined relative to a particular database.  Thus it is identified
       by the corresponding database, the model's model file and its sequential position  in  the
       file.

       Each  model  is  specified  by one or more model group definition lines.  Each model group
       line associates attribute indices with a model term type.

       Here is an example model file:

       # AutoClass C model file -- extension .model
       model_index 0 7
       ignore 0
       single_normal_cn 3
       single_normal_cn 17 18 21
       multi_normal_cn 1 2
       multi_normal_cn 8 9 10
       multi_normal_cn 11 12 13
       single_multinomial default

       Here, the first line is a comment.  The following characters in column 1 make the  line  a
       comment: `!', `#', ` ', `;', and `\n' (empty line).

       The  tokens  "model_index  n m" must appear on the first non-comment line, and precede the
       model term definition lines. n is the zero-based model index, typically 0 where  there  is
       only  one  model  --  the  majority  of  search situations.  m is the number of model term
       definition lines that follow.

       The last seven lines are model group lines.  Each model group line consists of:

       A model term type (one of single_multinomial, single_normal_cm, single_normal_cn,
           multi_normal_cn, or ignore).

       A list of attribute indices (the attribute set list), or the symbol default.  Attribute
           indices are zero-based.  Single model terms may have one or more attribute indices on
           each line, while multi model terms require two or more attribute indices per line.  An
           attribute index must not appear more than once in a model list.

       Notes:

       1)     At least one model definition is required (model_index token).

       2)     There may be multiple entries in a model for any model term type.

       3)     Model term types currently consist of:

              single_multinomial
                     models discrete attributes as multinomials, with missing values.

              single_normal_cn
                     models real valued attributes as normals; no missing values.

              single_normal_cm
                     models real valued attributes with missing values.

              multi_normal_cn
                     is a covariant normal model without missing values.

              ignore allows the model to ignore one or more attributes.  ignore is  not  a  valid
                     default model term type.

              See the documentation in models-c.text for further information about specific model
              terms.

       4)     Single_normal_cn, single_normal_cm, and multi_normal_cn modeled data, whose subtype
              is  scalar  (value  distribution  is  away  from  0.0,  and  is thus not a "normal"
              distribution) will be log transformed and modeled with the log-normal  model.   For
              data  whose subtype is location (value distribution is around 0.0), no transform is
              done, and the normal model is used.

SEARCHING

       AutoClass, when invoked in the "search" mode will check the validity of the set  of  data,
       header, model, and search parameter files.  Errors will stop the search from starting, and
       warnings will ask the user whether to continue.   A  history  of  the  error  and  warning
       messages is saved, by default, in the log file.

       Once  you  have  succeeded  in describing your data with a header file and model file that
       passes the AUTOCLASS -SEARCH <...> input checks, you will have entered the  search  domain
       where AutoClass classifies your data.  (At last!)

       The  main  function  to  use  in  finding  a good classification of your data is AUTOCLASS
       -SEARCH, and using it will take most of the computation time.  Searches are invoked with:

       autoclass -search <.db2 file path> <.hd2 file path>
            <.model file path> <.s-params file path>

       All files must be specified as fully qualified relative or absolute pathnames.  File  name
       extensions  (file  types)  for  all  files  are forced to canonical values required by the
       AutoClass program:

               data file   ("ascii")   db2
               data file   ("binary")  db2-bin
               header file             hd2
               model file              model
               search params file      s-params

       The sample-run (/usr/share/doc/autoclass/examples/) that comes with AutoClass  shows  some
       sample  searches,  and browsing these is probably the fastest way to get familiar with how
       to do searches.  The test data sets located under /usr/share/doc/autoclass/examples/  will
       show  you  some  other  header  (.hd2), model (.model), and search params (.s-params) file
       setups.  The remainder of this section describes how  to  do  searches  in  somewhat  more
       detail.

       The  bold  faced  tokens  below  are  generally  search  params file parameters.  For more
       information   on   the    s-params    file,    see    SEARCH    PARAMETERS    below,    or
       /usr/share/doc/autoclass/search-c.text.gz.

   WHAT RESULTS ARE
       AutoClass  is  looking  for  the  best  classification(s)  of  the  data  it  can find.  A
       classification is composed of:

       1)     a set of classes, each of which is described by a set of  class  parameters,  which
              specify  how  the  class is distributed along the various attributes.  For example,
              "height normally distributed with mean 4.67 ft and standard deviation .32 ft",

       2)     a set of class weights, describing what percentage of cases are  likely  to  be  in
              each class.

       3)     a  probabilistic  assignment  of cases in the data to these classes.  I.e. for each
              case, the relative probability that it is a member of each class.

       As a strictly Bayesian system (accept no substitutes!), the quality measure AutoClass uses
       is  the  total  probability that, had you known nothing about your data or its domain, you
       would have found this set of data generated by this underlying model.  This  includes  the
       prior  probability  that the "world" would have chosen this number of classes, this set of
       relative class weights, and this set of parameters for each class, and the likelihood that
       such  a  set  of classes would have generated this set of values for the attributes in the
       data cases.

       These probabilities are typically very small, in the range of e^-30000, and so are usually
       expressed in exponential notation.

   WHAT RESULTS MEAN
       It  is important to remember that all of these probabilities are GIVEN that the real model
       is in the model family that AutoClass has restricted its attention to.   If  AutoClass  is
       looking  for  Gaussian  classes  and  the  real  classes  are  Poisson, then the fact that
       AutoClass found 5 Gaussian classes may not say much about how many Poisson  classes  there
       really are.

       The  relative  probability between different classifications found can be very large, like
       e^1000, so the very best classification found is usually overwhelmingly more probable than
       the  rest  (and  overwhelmingly  less  probable  than  any  better  classifications as yet
       undiscovered).  If AutoClass should manage to find two  classifications  that  are  within
       about  exp(5-10)  of  each  other (i.e. within 100 to 10,000 times more probable) then you
       should consider them to be about equally probable, as our computation is usually not  more
       accurate than this (and sometimes much less).

   HOW IT WORKS
       AutoClass repeatedly creates a random classification and then tries to massage this into a
       high probability classification though local changes, until it converges  to  some  "local
       maximum".   It  then  remembers  what it found and starts over again, continuing until you
       tell it to stop.  Each effort is called a "try", and the computed probability is  intended
       to  cover  the  whole  volume in parameter space around this maximum, rather than just the
       peak.

       The standard approach to massaging is to

       1)     Compute the probabilistic class memberships of cases using the class parameters and
              the implied relative likelihoods.

       2)     Using  the  new  class members, compute class statistics (like mean) and revise the
              class parameters.

       and repeat till they stop changing.  There are  three  available  convergence  algorithms:
       "converge_search_3"    (the   default),   "converge_search_4"   and   "converge".    Their
       specification is controlled by search params file parameter try_fn_type.

   WHEN TO STOP
       You can tell AUTOCLASS -SEARCH to stop by: 1) giving a max_duration (in seconds)  argument
       at the beginning; 2) giving a max_n_tries (an integer) argument at the beginning; or 3) by
       typing a "q" and <return>  after  you  have  seen  enough  tries.   The  max_duration  and
       max_n_tries arguments are useful if you desire to run AUTOCLASS -SEARCH in batch mode.  If
       you are restarting AUTOCLASS -SEARCH from a previous search, the value of max_n_tries  you
       provide,  for  instance  3,  will  tell the program to compute 3 more tries in addition to
       however many it  has  already  done.   The  same  incremental  behavior  is  exhibited  by
       max_duration.

       Deciding  when to stop is a judgment call and it's up to you.  Since the search includes a
       random component, there's always the chance that if you let it keep  going  it  will  find
       something  better.   So you need to trade off how much better it might be with how long it
       might take to find it.  The search status  reports  that  are  printed  when  a  new  best
       classification  is  found  are  intended  to provide you information to help you make this
       tradeoff.

       One clear sign that you should probably stop is if most of the classifications  found  are
       duplicates of previous ones (flagged by "dup" as they are found).  This should only happen
       for very small sets of data or when fixing a very small number of classes, like two.

       Our experience is that for moderately large to extremely large data sets (~200 to  ~10,000
       datum), it is necessary to run AutoClass for at least 50 trials.

   WHAT GETS RETURNED
       Just  before  returning,  AUTOCLASS  -SEARCH  will  give  short  descriptions  of the best
       classifications found.  How many will be described can be controlled with n_final_summary.

       By default AUTOCLASS -SEARCH will write out a  number  of  files,  both  at  the  end  and
       periodically  during  the  search (in case your system crashes before it finishes).  These
       files will all have the same name  (taken  from  the  search  params  pathname  [<name>.s-
       params]), and differ only in their file extensions.  If your search runs are very long and
       there is a possibility that your machine may crash, you can  have  intermediate  "results"
       files  written  out.   These  can  be used to restart your search run with minimum loss of
       search effort.  See the documentation file /usr/share/doc/autoclass/checkpoint-c.text.

       A ".log" file will hold a listing of most of what was printed to  the  screen  during  the
       run,  unless  you  set  log_file_p  to  false to say you want no such foolishness.  Unless
       results_file_p is false, a binary ".results-bin" file (the default) or an ASCII ".results"
       text file, will hold the best classifications that were returned, and unless search_file_p
       is false, a ".search" file will hold  the  record  of  the  search  tries.  save_compact_p
       controls whether the "results" files are saved as binary or ASCII text.

       If  the  C  global  variable  "G_safe_file_writing_p"  is  defined  as TRUE in "autoclass-
       c/prog/globals.c",  the  names  of  "results"  files  (those  that   contain   the   saved
       classifications)  are  modified  internally to account for redundant file writing.  If the
       search params file name is "my_saved_clsfs" you will  see  the  following  "results"  file
       names (ignoring directories and pathnames for this example)

         save_compact_p = true --
         "my_saved_clsfs.results-bin"     - completely written file
         "my_saved_clsfs.results-tmp-bin" - partially written file, renamed
                             when complete

         save_compact_p = false --
         "my_saved_clsfs.results"    - completely written file
         "my_saved_clsfs.results-tmp"  - partially written file, renamed
                             when complete

       If check pointing is being done, these additional names will appear

         save_compact_p = true --
         "my_saved_clsfs.chkpt-bin"  - completely written checkpoint file
         "my_saved_clsfs.chkpt-tmp-bin" - partially written checkpoint file,
                                renamed when complete
         save_compact_p = false --
         "my_saved_clsfs.chkpt" - completely written checkpoint file
         "my_saved_clsfs.chkpt-tmp"    - partially written checkpoint file,
                                renamed when complete

   HOW TO GET STARTED
       The way to invoke AUTOCLASS -SEARCH is:

       autoclass -search <.db2 file path> <.hd2 file path>
            <.model file path> <.s-params file path>

       To  restart  a previous search, specify that force_new_search_p has the value false in the
       search params file, since its default is true.  Specifying false tells  AUTOCLASS  -SEARCH
       to  try  to  find  a  previous  compatible  search (<...>.results[-bin] & <...>.search) to
       continue from, and will restart using it if found.  To  force  a  new  search  instead  of
       restarting an old one, give the parameter force_new_search_p the value of true, or use the
       default.  If there is an existing search (<...>.results[-bin] &  <...>.search),  the  user
       will be asked to confirm continuation since continuation will discard the existing search.

       If  a  previous search is continued, the message "RESTARTING SEARCH" will be given instead
       of the usual "BEGINNING SEARCH".  It is generally better to  continue  a  previous  search
       than to start a new one, unless you are trying a significantly different search method, in
       which case statistics from the previous search may mislead the current one.

   STATUS REPORTS
       A running commentary on the search will be printed to the  screen  and  to  the  log  file
       (unless  log_file_p  is  false).   Note that the ".log" file will contain a listing of all
       default search params values, and the values of all params that are overridden.

       After each try a very short report (only a few characters long) is given.  After each  new
       best  classification,  a  longer report is given, but no more often than min_report_period
       (default is 30 seconds).

   SEARCH VARIATIONS
       AUTOCLASS -SEARCH by default uses a certain  standard  search  method  or  "try  function"
       (try_fn_type  =  "converge_search_3").  Two others are also available: "converge_search_4"
       and "converge").  They are provided in case your problem is one that may happen to benefit
       from them.  In general the default method will result in finding better classifications at
       the expense of a longer search time.  The default was chosen so as to  be  robust,  giving
       even  performance  across many problems.  The alternatives to the default may do better on
       some problems, but may do substantially worse on others.

       "converge_search_3" uses an absolute stopping criterion (rel_delta_range, default value of
       0.0025)  which  tests  the  variation  of  each class of the delta of the log approximate-
       marginal-likelihood  of  the  class  statistics  with-respect-to  the   class   hypothesis
       (class->log_a_w_s_h_j)  divided  by  the  class  weight  (class->w_j)  between  successive
       convergence cycles.  Increasing this value loosens the convergence and reduces the  number
       of  cycles.   Decreasing  this  value tightens the convergence and increases the number of
       cycles. n_average (default value of 3) specifies how many successive cycles must meet  the
       stopping criterion before the trial terminates.

       "converge_search_4" uses an absolute stopping criterion (cs4_delta_range, default value of
       0.0025) which tests the variation of each class  of  the  slope  for  each  class  of  log
       approximate-marginal-likelihood   of   the  class  statistics  with-respect-to  the  class
       hypothesis  (class->log_a_w_s_h_j)  divided  by  the  class   weight   (class->w_j)   over
       sigma_beta_n_values  (default  value  6)  convergence  cycles.   Increasing  the  value of
       cs4_delta_range loosens the convergence and reduces the number of cycles.  Decreasing this
       value  tightens the convergence and increases the number of cycles.  Computationally, this
       try function is more expensive than "converge_search_3",  but  may  prove  useful  if  the
       computational  "noise"  is  significant compared to the variations in the computed values.
       Key calculations are done in double precision floating point, and  for  the  largest  data
       base  we  have  tested so far ( 5,420 cases of 93 attributes), computational noise has not
       been a problem, although the value of max_cycles needed to be increased to 400.

       "converge" uses one of two absolute stopping criterion which test  the  variation  of  the
       classification  (clsf) log_marginal (clsf->log_a_x_h) delta between successive convergence
       cycles.   The  largest   of   halt_range   (default   value   0.5)   and   halt_factor   *
       current_clsf_log_marginal)  is  used (default value of halt_factor is 0.0001).  Increasing
       these values loosens the convergence and reduces the number of cycles.   Decreasing  these
       values  tightens  the  convergence and increases the number of cycles.  n_average (default
       value of 3) specifies how many cycles must meet the stopping  criteria  before  the  trial
       terminates.   This  is  a very approximate stopping criterion, but will give you some feel
       for the kind of classifications to expect.  It would be useful for "exploratory"  searches
       of a data base.

       The  purpose  of reconverge_type = "chkpt" is to complete an interrupted classification by
       continuing from its last checkpoint.  The purpose of reconverge_type  =  "results"  is  to
       attempt further refinement of the best completed classification using a different value of
       try_fn_type ("converge_search_3", "converge_search_4",  "converge").   If  max_n_tries  is
       greater  than  1, then in each case, after the reconvergence has completed, AutoClass will
       perform further search trials based on the parameter values in the <...>.s-params file.

       With the use of reconverge_type ( default value ""), you  may  apply  more  than  one  try
       function  to  a  classification.   Say  you  generate  several  exploratory  trials  using
       try_fn_type = "converge", and quit the search saving  .search  and  .results[-bin]  files.
       Then  you can begin another search with try_fn_type = "converge_search_3", reconverge_type
       = "results", and max_n_tries = 1.  This will result in the further convergence of the best
       classification   generated   with   try_fn_type   =   "converge",   with   try_fn_type   =
       "converge_search_3".   When  AutoClass  completes  this  search  try,  you  will  have  an
       additional refined classification.

       A  good  way  to  verify  that  any  of  the  alternate try_fun_type are generating a well
       converged classification is to run AutoClass in prediction mode on the same data used  for
       generating  the classification.  Then generate and compare the corresponding case or class
       cross  reference  files  for  the  original  classification  and  the  prediction.   Small
       differences  between  these  files  are  to  be expected, while large differences indicate
       incomplete convergence.  Differences between such file pairs should, on average and modulo
       class deletions, decrease monotonically with further convergence.

       The  standard  way  to  create  a random classification to begin a try is with the default
       value of "random" for start_fn_type.  At this point there are no alternatives.  Specifying
       "block"  for  start_fn_type  produces  repeatable  non-random  searches.   That is how the
       <..>.s-params files in the autoclass-c/data/.. sub-directories are specified.  This is how
       development testing is done.

       max_cycles controls the maximum number of convergence cycles that will be performed in any
       one trial by the convergence functions.  Its default value  is  200.   The  screen  output
       shows  a  period (".") for each cycle completed. If your search trials run for 200 cycles,
       then either your data base is very complex (increase the value), or the try_fn_type is not
       adequate for situation (try another of the available ones, and use converge_print_p to get
       more information on what is going on).

       Specifying converge_print_p to be true will generate a  brief  print-out  for  each  cycle
       which   will   provide   information  so  that  you  can  modify  the  default  values  of
       rel_delta_range & n_average for "converge_search_3"; cs4_delta_range & sigma_beta_n_values
       for "converge_search_4"; and halt_range, halt_factor, and n_average for "converge".  Their
       default values are given in the  <..>.s-params  files  in  the  autoclass-c/data/..   sub-
       directories.

   HOW MANY CLASSES?
       Each new try begins with a certain number of classes and may end up with a smaller number,
       as some classes may drop out of the convergence.  In general, you want to  begin  the  try
       with  some  number  of  classes that previous tries have indicated look promising, and you
       want to be sure you are fishing around elsewhere in case you missed something before.

       n_classes_fn_type = "random_ln_normal" is the default way to make this choice.  It fits  a
       log  normal  to  the  number  of  classes  (usually  called  "j" for short) of the 10 best
       classifications found so far, and randomly selects  from  that.   There  is  currently  no
       alternative.

       To start the game off, the default is to go down start_j_list for the first few tries, and
       then switch to n_classes_fn_type.  If you believe that the probable number of  classes  in
       your  data  base is say 75, then instead of using the default value of start_j_list (2, 3,
       5, 7, 10, 15, 25), specify something like 50, 60, 70, 80, 90, 100.

       If one wants to always look for, say, three classes, one can use fixed_j and override  the
       above.  Search status reports will describe what the current method for choosing j is.

   DO I HAVE ENOUGH MEMORY AND DISK SPACE?
       Internally, the storage requirements in the current system are of order n_classes_per_clsf
       * (n_data + n_stored_clsfs * n_attributes *  n_attribute_values).   This  depends  on  the
       number  of  cases,  the  number  of  attributes, the values per attribute (use 2 if a real
       value), and the number of classifications stored away for comparison to see if others  are
       duplicates -- controlled by max_n_store (default value = 10).  The search process does not
       itself consume significant memory, but storage of the results may do so.

       AutoClass C is configured to handle a maximum of 999 attributes.  If you  attempt  to  run
       with  more  than  that  you  will  get array bound violations.  In that case, change these
       configuration parameters in prog/autoclass.h and recompile AutoClass C:

       #define ALL_ATTRIBUTES                  999
       #define VERY_LONG_STRING_LENGTH         20000
       #define VERY_LONG_TOKEN_LENGTH          500

       For example, these values will handle several thousand attributes:

       #define ALL_ATTRIBUTES                  9999
       #define VERY_LONG_STRING_LENGTH         50000
       #define VERY_LONG_TOKEN_LENGTH          50000

       Disk space taken up by the "log" file will of course depend on the duration of the search.
       n_save  (default  value  =  2) determines how many best classifications are saved into the
       ".results[-bin]" file.  save_compact_p controls whether  the  "results"  and  "checkpoint"
       files  are  saved  as  binary.   Binary  files  are  faster  and more compact, but are not
       portable.  The default value of save_compact_p is true, which causes binary  files  to  be
       written.

       If  the  time  taken  to  save  the  "results"  files  is  a  problem, consider increasing
       min_save_period (default value = 1800 seconds or 30 minutes).  Files  are  saved  to  disk
       this often if there is anything different to report.

   JUST HOW SLOW IS IT?
       Compute   time   is   of   order   n_data   *   n_attributes   *  n_classes  *  n_tries  *
       converge_cycles_per_try. The major uncertainties in this are the number of basic back  and
       forth  cycles till convergence in each try, and of course the number of tries.  The number
       of cycles per trial is typically  10-100  for  try_fn_type  "converge",  and  10-200+  for
       "converge_search_3"   and   "converge_search-4".   The  maximum  number  is  specified  by
       max_n_tries (default value = 200).  The number of trials is up to you and  your  available
       computing resources.

       The  running  time  of very large data sets will be quite uncertain.  We advise that a few
       small scale test runs be made on your system to determine a baseline.  Specify  n_data  to
       limit  how many data vectors are read.  Given a very large quantity of data, AutoClass may
       find its most probable classifications at upwards of a  hundred  classes,  and  this  will
       require  that  start_j_list  be  specified  appropriately  (See  above  section  HOW  MANY
       CLASSES?).  If you are quite certain that you only want  a  few  classes,  you  can  force
       AutoClass  to  search  with a fixed number of classes specified by fixed_j.  You will then
       need to run separate searches with each different fixed number of classes.

   CHANGING FILENAMES IN A SAVED CLASSIFICATION FILE
       AutoClass caches the data, header, and model file pathnames in  the  saved  classification
       structure  of  the  binary (".results-bin") or ASCII (".results") "results" files.  If the
       "results" and "search" files are moved to  a  different  directory  location,  the  search
       cannot  be  successfully  restarted  if  you  have  used  absolute  pathnames.  Thus it is
       advantageous to run invoke AutoClass in a parent directory of the data, header, and  model
       files,  so  that  relative pathnames can be used.  Since the pathnames cached will then be
       relative, the files can be moved to a different host  or  file  system  and  restarted  --
       providing the same relative pathname hierarchy exists.

       However,  since the ".results" file is ASCII text, those pathnames could be changed with a
       text editor (save_compact_p must be specified as false).

   SEARCH PARAMETERS
       The search is controlled by the ".s-params" file.  In this file, an empty line or  a  line
       starting  with  one  of  these  characters is treated as a comment: "#", "!", or ";".  The
       parameter name and its value can be separated by an equal sign, a space, or a tab:

            n_clsfs 1
            n_clsfs = 1
            n_clsfs<tab>1

       Spaces are ignored if "=" or "<tab>" are used as separators.  Note there are  no  trailing
       semicolons.

       The search parameters, with their default values, are as follows:

       rel_error = 0.01
              Specifies  the  relative  difference measure used by clsf-DS-%=, when deciding if a
              new clsf is a duplicate of an old one.

       start_j_list = 2, 3, 5, 7, 10, 15, 25
              Initially try these numbers of classes, so as not to narrow the search too quickly.
              The  state  of  this  list  is  saved in the <..>.search file and used on restarts,
              unless an override specification of start_j_list is made in the .s-params file  for
              the  restart run.  This list should bracket your expected number of classes, and by
              a wide margin!  "start_j_list = -999" specifies an  empty  list  (allowed  only  on
              restarts)

       n_classes_fn_type = "random_ln_normal"
              Once  start_j_list  is  exhausted,  AutoClass will call this function to decide how
              many classes to start with on the next try, based on the  10  best  classifications
              found so far.  Currently only "random_ln_normal" is available.

       fixed_j = 0
              When  fixed_j > 0, overrides start_j_list and n_classes_fn_type, and AutoClass will
              always use this value for the initial number of classes.

       min_report_period = 30
              Wait at least this time (in seconds) since last report  until  reporting  verbosely
              again.   Should  be  set  longer  than  the  expected  run  time  when checking for
              repeatability of results.  For repeatable  results,  also  see  force_new_search_p,
              start_fn_type  and  randomize_random_p.  NOTE:  At  least  one  of "interactive_p",
              "max_duration", and "max_n_tries" must be active.   Otherwise  AutoClass  will  run
              indefinitely.  See below.

       interactive_p = true
              When  false,  allows  run  to continue until otherwise halted.  When true, standard
              input is queried on each cycle for the quit character "q",  which,  when  detected,
              triggers an immediate halt.

       max_duration = 0
              When  =  0, allows run to continue until otherwise halted.  When > 0, specifies the
              maximum number of seconds to run.

       max_n_tries = 0
              When = 0, allows run to continue until otherwise halted.  When > 0,  specifies  the
              maximum number of tries to make.

       n_save = 2
              Save  this many clsfs to disk in the .results[-bin] and .search files.  if 0, don't
              save anything (no .search & .results[-bin] files).

       log_file_p = true
              If false, do not write a log file.

       search_file_p = true
              If false, do not write a search file.

       results_file_p = true
              If false, do not write a results file.

       min_save_period = 1800
              CPU crash protection.  This specifies the maximum time, in seconds, that  AutoClass
              will  run  before  it  saves  the  current results to disk.  The default time is 30
              minutes.

       max_n_store = 10
              Specifies the maximum number of classifications stored internally.

       n_final_summary = 10
              Specifies the number of trials to be printed out after search ends.

       start_fn_type = "random"
              One of {"random", "block"}.  This specifies the type of class initialization.   For
              normal  search,  use "random", which randomly selects instances to be initial class
              means, and adds appropriate variances. For  testing  with  repeatable  search,  use
              "block",  which  partitions the database into successive blocks of near equal size.
              For  repeatable  results,  also  see  force_new_search_p,  min_report_period,   and
              randomize_random_p.

       try_fn_type = "converge_search_3"
              One  of  {"converge_search_3",  "converge_search_4",  "converge"}.   These  specify
              alternate search stopping criteria.  "converge" merely tests the rate of change  of
              the  log_marginal  classification  probability  (clsf->log_a_x_h), without checking
              rate  of  change   of   individual   classes(see   halt_range   and   halt_factor).
              "converge_search_3"    and    "converge_search_4"    each    monitor    the   ratio
              class->log_a_w_s_h_j/class->w_j for all classes, and continue convergence until all
              pass  the  quiescence  criteria  for  n_average  cycles.  "converge_search_3" tests
              differences between successive  convergence  cycles  (see  rel_delta_range).   This
              provides  a  reasonable,  general  purpose  stopping criteria.  "converge_search_4"
              averages the ratio over "sigma_beta_n_values" cycles (see  cs4_delta_range).   This
              is preferred when converge_search_3 produces many similar classes.

       initial_cycles_p = true
              If  true,  perform  base_cycle  in  initialize_parameters.   false is used only for
              testing.

       save_compact_p = true
              true saves classifications as machine dependent binary (.results-bin & .chkpt-bin).
              false saves as ascii text (.results & .chkpt)

       read_compact_p = true
              true reads classifications as machine dependent binary (.results-bin & .chkpt-bin).
              false reads as ascii text (.results & .chkpt).

       randomize_random_p = true
              false seeds lrand48, the pseudo-random number function with 1  to  give  repeatable
              test  cases.   true  uses  universal  time  clock  as  the seed, giving semi-random
              searches.  For repeatable results, also see  force_new_search_p,  min_report_period
              and start_fn_type.

       n_data = 0
              With n_data = 0, the entire database is read from .db2.  With n_data > 0, only this
              number of data are read.

       halt_range = 0.5
              Passed to try_fn_type "converge".  With the "converge" try_fn_type, convergence  is
              halted  when  the  larger  of  halt_range  and (halt_factor * current_log_marginal)
              exceeds the difference  between  successive  cycle  values  of  the  classification
              log_marginal  (clsf->log_a_x_h).  Decreasing this value may tighten the convergence
              and increase the number of cycles.

       halt_factor = 0.0001
              Passed to try_fn_type "converge".  With the "converge" try_fn_type, convergence  is
              halted  when  the  larger  of  halt_range  and (halt_factor * current_log_marginal)
              exceeds the difference  between  successive  cycle  values  of  the  classification
              log_marginal  (clsf->log_a_x_h).  Decreasing this value may tighten the convergence
              and increase the number of cycles.

       rel_delta_range = 0.0025
              Passed to try function "converge_search_3", which monitors the ratio of log approx-
              marginal-likelihood  of  class  statistics  with-respect-to  the  class  hypothesis
              (class->log_a_w_s_h_j) divided by the class weight (class->w_j),  for  each  class.
              "converge_search_3"  halts  convergence when the difference between cycles, of this
              ratio, for every class, has been  exceeded  by  "rel_delta_range"  for  "n_average"
              cycles.   Decreasing  "rel_delta_range"  tightens the convergence and increases the
              number of cycles.

       cs4_delta_range = 0.0025
              Passed  to  try  function  "converge_search_4",  which  monitors   the   ratio   of
              (class->log_a_w_s_h_j)/(class->w_j),     for     each    class,    averaged    over
              "sigma_beta_n_values" convergence cycles.   "converge_search_4"  halts  convergence
              when   the  maximum  difference  in  average  values  of  this  ratio  falls  below
              "cs4_delta_range".   Decreasing  "cs4_delta_range"  tightens  the  convergence  and
              increases the number of cycles.

       n_average = 3
              Passed  to  try functions "converge_search_3" and "converge".  The number of cycles
              for which the convergence criterion must be satisfied for the trial to terminate.

       sigma_beta_n_values = 6
              Passed to try_fn_type "converge_search_4".  The number of past  values  to  use  in
              computing sigma^2 (noise) and beta^2 (signal).

       max_cycles = 200
              This  is  the  maximum  number  of  cycles  permitted  for any one convergence of a
              classification, regardless of any other stopping criteria.  This is very  dependent
              upon  your  database  and choice of model and convergence parameters, but should be
              about twice the average number of cycles reported in the screen dump and .log file

       converge_print_p = false
              If true, the selected try function will  print  to  the  screen  values  useful  in
              specifying   non-default   values  for  halt_range,  halt_factor,  rel_delta_range,
              n_average, sigma_beta_n_values, and range_factor.

       force_new_search_p = true
              If true, will ignore any previous search results, discarding the  existing  .search
              and  .results[-bin]  files  after confirmation by the user; if false, will continue
              the search using the existing .search and  .results[-bin]  files.   For  repeatable
              results, also see min_report_period, start_fn_type and randomize_random_p.

       checkpoint_p = false
              If   true,  checkpoints  of  the  current  classification  will  be  written  every
              "min_checkpoint_period" seconds, with file extension  .chkpt[-bin].  This  is  only
              useful for very large classifications

       min_checkpoint_period = 10800
              If  checkpoint_p = true, the checkpointed classification will be written this often
              - in seconds (default = 3 hours)

       reconverge_type = "
              Can be either "chkpt" or "results".  If "checkpoint_p" = true and "reconverge_type"
              =   "chkpt",   then   continue  convergence  of  the  classification  contained  in
              <...>.chkpt[-bin].  If "checkpoint_p " = false and "reconverge_type"  =  "results",
              continue convergence of the best classification contained in <...>.results[-bin].

       screen_output_p = true
              If  false, no output is directed to the screen.  Assuming log_file_p = true, output
              will be directed to the log file only.

       break_on_warnings_p = true
              The default value asks the user whether or not to continue,  when  data  definition
              warnings  are  found.  If specified as false, then AutoClass will continue, despite
              warnings -- the warning will continue to be output to  the  terminal  and  the  log
              file.

       free_storage_p = true
              The  default  value  tells AutoClass to free the majority of its allocated storage.
              This is not required, and in the case of the DEC Alpha causes core  dump  [is  this
              still true?].  If specified as false, AutoClass will not attempt to free storage.

   HOW TO GET AUTOCLASS C TO PRODUCE REPEATABLE RESULTS
       In  some  situations, repeatable classifications are required: comparing basic AutoClass C
       integrity on different platforms, porting AutoClass C to a new platform, etc.  In order to
       accomplish  this  two  things  are  necessary: 1) the same random number generator must be
       used, and 2) the search parameters must be specified properly.

       Random Number Generator. This implementation of AutoClass C uses the Unix  srand48/lrand48
       random  number generator which generates pseudo-random numbers using the well-known linear
       congruential algorithm and 48-bit integer arithmetic.   lrand48()  returns  non-  negative
       long integers uniformly distributed over the interval [0, 2**31].

       Search Parameters.  The following .s-params file parameters should be specified:

       force_new_search_p = true
       start_fn_type   "block"
       randomize_random_p = false
       ;; specify the number of trials you wish to run
       max_n_tries = 50
       ;; specify a time greater than duration of run
       min_report_period = 30000

       Note  that  no  current  best  classification  reports  will  be  produced.   Only a final
       classification summary will be output.

CHECKPOINTING

       With very large databases there is a significant probability of a system crash during  any
       one  classification  try.   Under  such  circumstances it is advisable to take the time to
       checkpoint the calculations for possible restart.

       Checkpointing is initiated by specifying "checkpoint_p = true" in  the  ".s-params"  file.
       This  causes  the  inner  convergence  step, to save a copy of the classification onto the
       checkpoint file each time the classification is updated, providing  a  certain  period  of
       time has elapsed.  The file extension is ".chkpt[-bin]".

       Each time a AutoClass completes a cycle, a "." is output to the screen to provide you with
       information to be used in setting the min_checkpoint_period value (default  10800  seconds
       or  3  hours).   There is obviously a trade-off between frequency of checkpointing and the
       probability that your machine may crash, since the repetitive writing  of  the  checkpoint
       file will slow the search process.

       Restarting AutoClass Search:

       To  recover  the  classification  and  continue  the  search after rebooting and reloading
       AutoClass,  specify  reconverge_type  =  "chkpt"  in   the   ".s-params"   file   (specify
       force_new_search_p as false).

       AutoClass  will  reload  the  appropriate  database and models, provided there has been no
       change  in  their  filenames  since  the  time  they  were  loaded  for  the  checkpointed
       classification  run.   The  ".s-params"  file contains any non-default arguments that were
       provided to the original call.

       In the beginning of a search, before start_j_list has been emptied, it will  be  necessary
       to  trim the original list to what would have remained in the crashed search.  This can be
       determined by looking at the ".log" file to determine what values were already  used.   If
       the  start_j_list  has been emptied, then an empty start_j_list should be specified in the
       ".s-params" file.  This is done either by

               start_j_list =

       or

               start_j_list = -9999

       Here is an a set of scripts to demonstrate check-pointing:

       autoclass -search data/glass/glassc.db2 data/glass/glass-3c.hd2 \
            data/glass/glass-mnc.model data/glass/glassc-chkpt.s-params

       Run 1)
         ## glassc-chkpt.s-params
         max_n_tries = 2
         force_new_search_p = true
         ## --------------------
         ;; run to completion

       Run 2)
         ## glassc-chkpt.s-params
         force_new_search_p = false
         max_n_tries = 10
         checkpoint_p = true
         min_checkpoint_period = 2
         ## --------------------
         ;; after 1 checkpoint, ctrl-C to simulate cpu crash

       Run 3)
         ## glassc-chkpt.s-params
         force_new_search_p = false
         max_n_tries = 1
         checkpoint_p = true
         min_checkpoint_period = 1
         reconverge_type = "chkpt"
         ## --------------------
         ;; checkpointed trial should finish

OUTPUT FILES

       The standard reports are

       1)     Attribute influence values: presents the relative influence or significance of  the
              data's   attributes   both  globally  (averaged  over  all  classes),  and  locally
              (specifically for each class). A heuristic for  relative  class  strength  is  also
              listed;

       2)     Cross-reference  by  case  (datum)  number: lists the primary class probability for
              each datum, ordered by case number.  When report_mode = "data",  additional  lesser
              class probabilities (greater than or equal to 0.001) are listed for each datum;

       3)     Cross-reference  by  class number: for each class the primary class probability and
              any lesser class probabilities (greater than or equal to 0.001) are listed for each
              datum  in  the class, ordered by case number. It is also possible to list, for each
              datum, the values of attributes, which you select.

       The attribute influence values  report  attempts  to  provide  relative  measures  of  the
       "influence"  of  the  data  attributes  on  the  classes found by the classification.  The
       normalized class strengths, the normalized attribute  influence  values  summed  over  all
       classes,  and  the individual influence values (I[jkl]) are all only relative measures and
       should be interpreted with  more  meaning  than  rank  ordering,  but  not  like  anything
       approaching absolute values.

       The  reports  are output to files whose names and pathnames are taken from the ".r-params"
       file pathname.  The report file types (extensions) are:

       influence values report
              "influ-o-text-n" or "influ-no-text-n"

       cross-reference by case
              "case-text-n"

       cross-reference by class
              "class-text-n"

       or, if report_mode is overridden to "data":

       influence values report
              "influ-o-data-n" or "influ-no-data-n"

       cross-reference by case
              "case-data-n"

       cross-reference by class
              "class-data-n"

       where n is the  classification  number  from  the  "results"  file.   The  first  or  best
       classification  is  numbered  1, the next best 2, etc.  The default is to generate reports
       only for the best classification in the "results" file.  You can produce reports for other
       saved  classifications  by  using  report  params  keywords  n_clsfs and clsf_n_list.  The
       "influ-o-text-n" file type is the default (order_attributes_by_influence_p  =  true),  and
       lists  each  class's  attributes in descending order of attribute influence value.  If the
       value of order_attributes_by_influence_p is overridden to be false in  the  <...>.r-params
       file,  then each class's attributes will be listed in ascending order by attribute number.
       The extension of the file generated will be "influ-no-text-n".   This  method  of  listing
       facilitates the visual comparison of attribute values between classes.

       For example, this command:

            autoclass -reports sample/imports-85c.results-bin
                 sample/imports-85c.search sample/imports-85c.r-params

       with this line in the ".r-params" file:

            xref_class_report_att_list = 2, 5, 6

       will generate these output files:

            imports-85.influ-o-text-1
            imports-85.case-text-1
            imports-85.class-text-1

       The  AutoClass  C reports provide the capability to compute sigma class contour values for
       specified pairs of real valued attributes, when generating  the  influence  values  report
       with  the  data  option  (report_mode  =  "data").  Note that sigma class contours are not
       generated from discrete type attributes.

       The sigma contours are the two  dimensional  equivalent  of  n-sigma  error  bars  in  one
       dimension.  Specifically, for two independent attributes the n-sigma contour is defined as
       the ellipse where

       ((x - xMean) / xSigma)^2 + ((y - yMean) / ySigma)^2 == n

       With covariant attributes, the n-sigma contours are defined identically,  in  the  rotated
       coordinate  system of the distribution's principle axes.  Thus independent attributes give
       ellipses oriented parallel with the attribute axes, while the axes of  sigma  contours  of
       covariant attributes are rotated about the center determined by the means.  In either case
       the sigma contour represents a line where the class probability is constant,  irrespective
       of any other class probabilities.

       With  three  or  more  attributes  the  n-sigma  contours become k-dimensional ellipsoidal
       surfaces.  This code takes advantage of the fact that the parallel  projection  of  an  n-
       dimensional ellipsoid, onto any 2-dim plane, is bounded by an ellipse.  In this simplified
       case of projecting the single sigma ellipsoid onto the coordinate planes, it is also  true
       that  the 2-dim covariances of this ellipse are equal to the corresponding elements of the
       n-dim ellipsoid's covariances.  The Eigen-system of the 2-dim covariance  then  gives  the
       variances  w.r.t. the principal components of the eclipse, and the rotation that aligns it
       with the data.  This represents the best way to display a  distribution  in  the  marginal
       plane.

       To  get  contour  values, set the keyword sigma_contours_att_list to a list of real valued
       attribute indices (from .hd2 file), and request an influence values report with  the  data
       option.  For example,

            report_mode = "data"
            sigma_contours_att_list = 3, 4, 5, 8, 15

   OUTPUT REPORT PARAMETERS
       The  contents  of the output report are controlled by the ".r-params" file.  In this file,
       an empty line or a line starting with one of these characters is  treated  as  a  comment:
       "#",  "!",  or ";".  The parameter name and its value can be separated by an equal sign, a
       space, or a tab:

            n_clsfs 1
            n_clsfs = 1
            n_clsfs<tab>1

       Spaces are ignored if "=" or "<tab>" are used as separators.  Note there are  no  trailing
       semicolons.

       The following are the allowed parameters and their default values:

       n_clsfs = 1
              number  of  clsfs in the .results file for which to generate reports, starting with
              the first or "best".

       clsf_n_list =
              if specified, this is a one-based index list of clsfs in  the  clsf  sequence  read
              from the .results file.  It overrides "n_clsfs".  For example:

                   clsf_n_list = 1, 2

              will produce the same output as

                   n_clsfs = 2

              but

                   clsf_n_list = 2

              will only output the "second best" classification report.

       report_type =
              type   of   reports   to   generate:  "all",  "influence_values",  "xref_case",  or
              "xref_class".

       report_mode =
              mode of reports to generate. "text" is formatted text layout.  "data" is  numerical
              -- suitable for further processing.

       comment_data_headers_p = false
              the default value does not insert # in column 1 of most report_mode = "data" header
              lines.  If specified as true, the comment character will be inserted in most header
              lines.

       num_atts_to_list =
              if  specified, the number of attributes to list in influence values report.  if not
              specified, all attributes will be listed.  (e.g. "num_atts_to_list = 5")

       xref_class_report_att_list =
              if specified, a list of attribute numbers (zero-based), whose values will be output
              in the "xref_class" report along with the case probabilities.  if not specified, no
              attributes values will be output.  (e.g. "xref_class_report_att_list = 1, 2, 3")

       order_attributes_by_influence_p = true
              The default value lists each class's attributes in descending  order  of  attribute
              influence  value,  and  uses  ".influ-o-text-n" as the influence values report file
              type.  If specified as false, then  each  class's  attributes  will  be  listed  in
              ascending  order  by attribute number.  The extension of the file generated will be
              "influ-no-text-n".

       break_on_warnings_p = true
              The default value asks the user whether to continue or  not  when  data  definition
              warnings  are  found.  If specified as false, then AutoClass will continue, despite
              warnings -- the warning will continue to be output to the terminal.

       free_storage_p = true
              The default value tells AutoClass to free the majority of  its  allocated  storage.
              This  is not required, and in the case of the DEC Alpha causes a core dump [is this
              still true?].  If specified as false, AutoClass will not attempt to free storage.

       max_num_xref_class_probs = 5
              Determines how many lessor class probabilities will be printed  for  the  case  and
              class  cross-reference  reports.   The  default is to print the most probable class
              probability value and up to 4 lessor class prob- ibilities.  Note this is true  for
              both  the  "text"  and  "data" class cross-reference reports, but only true for the
              "data" case cross- reference report.  The "text" case cross-reference  report  only
              has the most probable class probability.

       sigma_contours_att_list =
              If  specified,  a list of real valued attribute indices (from .hd2 file) will be to
              compute sigma class contour values, when generating influence  values  report  with
              the  data  option (report_mode = "data").  If not specified, there will be no sigma
              class contour output.  (e.g. "sigma_contours_att_list = 3, 4, 5, 8, 15")

INTERPRETATION OF AUTOCLASS RESULTS

   WHAT HAVE YOU GOT?
       Now you have run AutoClass on your  data  set  --  what  have  you  got?   Typically,  the
       AutoClass search procedure finds many classifications, but only saves the few best.  These
       are now available for inspection and interpretation.  The most important indicator of  the
       relative  merits  of  these alternative classifications is Log total posterior probability
       value.  Note that since the probability lies  between  1  and  0,  the  corresponding  Log
       probability  is  negative  and  ranges from 0 to negative infinity. The difference between
       these Log probability values raised to the power e gives the relative probability  of  the
       alternatives  classifications.  So a difference of, say 100, implies one classification is
       e^100 ~= 10^43 more likely than the other.  However, these numbers can be very misleading,
       since  they  give  the  relative  probability  of  alternative  classifications  under the
       AutoClass assumptions.

   ASSUMPTIONS
       Specifically, the most important AutoClass assumptions are the use of  normal  models  for
       real  variables,  and  the assumption of independence of attributes within a class.  Since
       these assumptions are often violated in practice, the difference in posterior  probability
       of  alternative  classifications  can  be partly due to one classification being closer to
       satisfying  the  assumptions  than  another,  rather  than  to  a   real   difference   in
       classification   quality.   Another  source  of  uncertainty  about  the  utility  of  Log
       probability values is that they do not take into account any specific prior knowledge  the
       user  may have about the domain.  This means that it is often worth looking at alternative
       classifications to see if you can interpret them, but it is worth starting from  the  most
       probable  first.  Note that if the Log probability value is much greater than that for the
       one class case, it is saying that there is overwhelming evidence for some structure in the
       data, and part of this structure has been captured by the AutoClass classification.

   INFLUENCE REPORT
       So  you have now picked a classification you want to examine, based on its Log probability
       value; how do you examine it?  The first thing to do is to generate an "influence"  report
       on   the   classification   using   the   report   generation   facilities  documented  in
       /usr/share/doc/autoclass/reports-c.text.  An influence report is designed to summarize the
       important information buried in the AutoClass data structures.

       The  first part of this report gives the heuristic class "strengths".  Class "strength" is
       here defined as the geometric mean probability that any  instance  "belonging  to"  class,
       would  have been generated from the class probability model.  It thus provides a heuristic
       measure of how strongly each class predicts "its" instances.

       The second part is a listing of the overall "influence" of each of the attributes used  in
       the  classification.   These  give a rough heuristic measure of the relative importance of
       each  attribute  in  the  classification.   Attribute  "influence  values"  are  a   class
       probability  weighted  average  of  the  "influence"  of each attribute in the classes, as
       described below.

       The next part of the report is a summary description of each of the classes.  The  classes
       are  arbitrarily  numbered  from  0 up to n, in order of descending class weight.  A class
       weight of say 34.1 means that the weighted sum of membership probabilities  for  class  is
       34.1.   Note  that  a class weight of 34 does not necessarily mean that 34 cases belong to
       that class, since many cases may have only partial membership in that class.  Within  each
       class, attributes or attribute sets are ordered by the "influence" of their model term.

   CROSS ENTROPY
       A  commonly  used  measure  of the divergence between two probability distributions is the
       cross entropy: the sum over all possible values x, of  P(x|c...)*log[P(x|c...)/P(x|g...)],
       where  c...  and  g...  define  the  distributions.   It  ranges  from zero, for identical
       distributions, to infinite for distributions placing probability 1 on differing values  of
       an  attribute.  With conditionally independent terms in the probability distributions, the
       cross entropy can be factored to a sum over these terms.  These factors provide a  measure
       of   the   corresponding   modeled   attribute's  influence  in  differentiating  the  two
       distributions.

       We define the modeled term's "influence" on a class to be the cross entropy term  for  the
       class   distribution   w.r.t.   the   global   class  distribution  of  the  single  class
       classification.  "Influence" is thus a measure  of  how  strongly  the  model  term  helps
       differentiate  the  class from the whole data set.  With independently modeled attributes,
       the influence can legitimately be ascribed to the attribute itself.   With  correlated  or
       covariant  attributes  sets, the cross entropy factor is a function of the entire set, and
       we distribute the influence value equally over the modeled attributes.

   ATTRIBUTE INFLUENCE VALUES
       In the "influence" report on each class, the attribute parameters for that class are given
       in order of highest influence value for the model term attribute sets.  Only the first few
       attribute sets usually have significant influence values.  If  an  influence  value  drops
       below  about  20%  of  the  highest  value,  then  it is probably not significant, but all
       attribute sets are listed for completeness.  In addition to the influence value  for  each
       attribute  set,  the  values of the attribute set parameters in that class are given along
       with the corresponding "global" values.  The global values are computed directly from  the
       data  independent  of  the  classification.   For  example, if the class mean of attribute
       "temperature" is 90 with standard deviation of 2.5, but the  global  mean  is  68  with  a
       standard  deviation  of 16.3, then this class has selected out cases with much higher than
       average temperature, and a rather  small  spread  in  this  high  range.   Similarly,  for
       discrete  attribute  sets,  the  probability of each outcome in that class is given, along
       with the corresponding global probability -- ordered by  its  significance:  the  absolute
       value of (log {<local-probability> / <global-probability>}).  The sign of the significance
       value shows the direction of change from the global  class.   This  information  gives  an
       overview of how each class differs from the average for all the data, in order of the most
       significant differences.

   CLASS AND CASE REPORTS
       Having gained a description of the classes from the "influence" report, you  may  want  to
       follow-up  to see which classes your favorite cases ended up in.  Conversely, you may want
       to see which cases belong to  a  particular  class.   For  this  kind  of  cross-reference
       information  two  complementary reports can be generated.  These are more fully documented
       in /usr/share/doc/autoclass/reports-c.text. The "class" report, lists all the cases  which
       have  significant  membership in each class and the degree to which each such case belongs
       to that class.  Cases whose class membership is less than 90% in the  current  class  have
       their  other  class  membership  listed  as well.  The cases within a class are ordered in
       increasing case number.  The alternative "cases" report states which class (or classes)  a
       case  belongs  to,  and  the membership probability in the most probable class.  These two
       reports allow you to find which cases belong to which classes or the other way around.  If
       nearly  every  case  has close to 99% membership in a single class, then it means that the
       classes are well separated, while a high degree of  cross-membership  indicates  that  the
       classes are heavily overlapped.  Highly overlapped classes are an indication that the idea
       of classification is breaking down and that groups of mutually highly overlapped  classes,
       a kind of meta class, is probably a better way of understanding the data.

   COMPARING CLASS WEIGHTS AND CLASS/CASE REPORT ASSIGNMENTS
       The class weight given as the class probability parameter, is essentially the sum over all
       data instances, of the normalized probability that the instance is a member of the  class.
       It  is  probably  an  error  on  our  part that we format this number as an integer in the
       report, rather than emphasizing its real nature.  You will  find  the  actual  real  value
       recorded as the w_j parameter in the class_DS structures on any .results[-bin] file.

       The  .case  and  .class reports give probabilities that cases are members of classes.  Any
       assignment of cases to classes requires  some  decision  rule.   The  maximum  probability
       assignment  rule is often implicitly assumed, but it cannot be expected that the resulting
       partition  sizes  will  equal  the  class  weights  unless  nearly  all  class  membership
       probabilities  are  effectively  one  or  zero.   With  non-1/0  membership probabilities,
       matching the class weights requires summing the probabilities.

       In addition, there is the question of completeness of the  EM  (expectation  maximization)
       convergence.   EM  alternates  between  estimating  class  parameters and estimating class
       membership probabilities.  These estimates converge on  each  other,  but  never  actually
       meet.   AutoClass  implements  several  convergence  algorithms  with  alternate  stopping
       criteria using appropriate parameters in the .s-params  file.   Proper  setting  of  these
       parameters,   to   get   reasonably   complete   and  efficient  convergence  may  require
       experimentation.

   ALTERNATIVE CLASSIFICATIONS
       In summary, the various reports that can be generated  give  you  a  way  of  viewing  the
       current  classification.  It is usually a good idea to look at alternative classifications
       even  though  they  do  not  have  the  minimum  Log  probability  values.   These   other
       classifications  usually  have  classes that correspond closely to strong classes in other
       classifications, but can differ in the weak classes.  The "strength" of a class  within  a
       classification  can  usually  be  judged  by  how dramatically the highest influence value
       attributes in the class differ from the corresponding global attributes.  If none  of  the
       classifications  seem  quite satisfactory, it is always possible to run AutoClass again to
       generate new classifications.

   WHAT NEXT?
       Finally, the question of what to do after you  have  found  an  insightful  classification
       arises.   Usually,  classification is a preliminary data analysis step for examining a set
       of cases (things, examples, etc.) to see if they can be grouped so  that  members  of  the
       group  are  "similar"  to  each  other.   AutoClass gives such a grouping without the user
       having to define a similarity measure.  The built-in "similarity" measure  is  the  mutual
       predictiveness  of  the  cases.  The next step is to try to "explain" why some objects are
       more like others than those in a different group.  Usually, domain knowledge  suggests  an
       answer.  For example, a classification of people based on income, buying habits, location,
       age, etc., may  reveal  particular  social  classes  that  were  not  obvious  before  the
       classification  analysis.   To  obtain  further  information  about  such classes, further
       information, such as number of cars, what TV shows are watched, etc.,  would  reveal  even
       more  information.   Longitudinal  studies would give information about how social classes
       arise and what influences their attitudes -- all of which is going way beyond the  initial
       classification.

PREDICTIONS

       Classifications  can be used to predict class membership for new cases.  So in addition to
       possibly giving you some insight into the structure behind your  data,  you  can  now  use
       AutoClass directly to make predictions, and compare AutoClass to other learning systems.

       This  technique  for  predicting  class  probabilities  is  applicable  to all attributes,
       regardless of data type/sub_type or likelihood model term type.

       In the event that the class membership of a data case does not exceed 0.0099999 for any of
       the  "training"  classes,  the following message will appear in the screen output for each
       case:

               xref_get_data: case_num xxx => class 9999

       Class 9999 members will appear in the "case" and "class" cross-reference  reports  with  a
       class membership of 1.0.

       Cautionary Points:

       The  usual way of using AutoClass is to put all of your data in a data_file, describe that
       data with model and header files, and  run  "autoclass  -search".   Now,  instead  of  one
       data_file you will have two, a training_data_file and a test_data_file.

       It  is most important that both databases have the same AutoClass internal representation.
       Should this not be true, AutoClass will exit, or possibly in in  some  situations,  crash.
       The  prediction  mode  is  designed  to  hopefully direct the user into conforming to this
       requirement.

       Preparation:

       Prediction requires having a training classification and a test  database.   The  training
       classification  is  generated  by  the  running  of  "autoclass  -search"  on the training
       data_file ("data/soybean/soyc.db2"), for example:

           autoclass -search data/soybean/soyc.db2 data/soybean/soyc.hd2
               data/soybean/soyc.model data/soybean/soyc.s-params

       This will produce "soyc.results-bin" and "soyc.search".  Then create a "reports" parameter
       file,  such  as  "soyc.r-params"  (see  /usr/share/doc/autoclass/reports-c.text),  and run
       AutoClass in "reports" mode, such as:

           autoclass -reports data/soybean/soyc.results-bin
               data/soybean/soyc.search data/soybean/soyc.r-params

       This will generate class and case cross-reference files, and  an  influence  values  file.
       The file names are based on the ".r-params" file name:

               data/soybean/soyc.class-text-1
               data/soybean/soyc.case-text-1
               data/soybean/soyc.influ-text-1

       These  will describe the classes found in the training_data_file.  Now this classification
       can be used to predict the probabilistic class  membership  of  the  test_data_file  cases
       ("data/soybean/soyc-predict.db2") in the training_data_file classes.

           autoclass -predict data/soybean/soyc-predict.db2
               data/soybean/soyc.results-bin data/soybean/soyc.search
               data/soybean/soyc.r-params

       This  will  generate  class  and  case  cross-reference files for the test_data_file cases
       predicting their probabilistic class memberships in the training_data_file  classes.   The
       file names are based on the ".db2" file name:

               data/soybean/soyc-predict.class-text-1
               data/soybean/soyc-predict.case-text-1

SEE ALSO

       AutoClass is documented fully here:

       /usr/share/doc/autoclass/introduction-c.text Guide to the documentation

       /usr/share/doc/autoclass/preparation-c.text How to prepare data for use by AutoClass

       /usr/share/doc/autoclass/search-c.text How to run AutoClass to find classifications.

       /usr/share/doc/autoclass/reports-c.text How to examine the classification in various ways.

       /usr/share/doc/autoclass/interpretation-c.text How to interpret AutoClass results.

       /usr/share/doc/autoclass/checkpoint-c.text Protocols for running a checkpointed search.

       /usr/share/doc/autoclass/prediction-c.text Use classifications to predict class membership
       for new cases.

       These provide supporting documentation:

       /usr/share/doc/autoclass/classes-c.text What classification is all about, for beginners.

       /usr/share/doc/autoclass/models-c.text   Brief   descriptions   of    the    model    term
       implementations.

       The mathematical theory behind AutoClass is explained in these documents:

       /usr/share/doc/autoclass/kdd-95.ps  Postscript  file  containing:  P. Cheeseman, J. Stutz,
       "Bayesian Classification (AutoClass): Theory  and  Results",  in  "Advances  in  Knowledge
       Discovery  and Data Mining", Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, &
       Ramasamy Uthurusamy, Eds. The AAAI Press, Menlo Park, expected fall 1995.

       /usr/share/doc/autoclass/tr-fia-90-12-7-01.ps Postscript file containing:  R.  Hanson,  J.
       Stutz,  P.  Cheeseman,  "Bayesian Classification Theory", Technical Report FIA-90-12-7-01,
       NASA Ames Research Center, Artificial Intelligence Branch, May 1991 (The figures  are  not
       included,  since  they were inserted by "cut-and-paste" methods into the original "camera-
       ready" copy.)

AUTHORS

       Dr. Peter Cheeseman
       Principal Investigator - NASA Ames, Computational Sciences Division
       cheesem@ptolemy.arc.nasa.gov

       John Stutz
       Research Programmer - NASA Ames, Computational Sciences Division
       stutz@ptolemy.arc.nasa.gov

       Will Taylor
       Support Programmer - NASA Ames, Computational Sciences Division
       taylor@ptolemy.arc.nasa.gov

SEE ALSO

       multimix(1).

                                         December 9, 2001                            AUTOCLASS(1)