Provided by: autoclass_3.3.6.dfsg.1-1build1_amd64 bug

NAME

       autoclass - automatically discover classes in data

SYNOPSIS

       autoclass -search data_file header_file model_file s_param_file
       autoclass -report results_file search_file r_params_file
       autoclass -predict results_file search_file results_file

DESCRIPTION

       AutoClass  solves  the problem of automatic discovery of classes in data (sometimes called clustering, or
       unsupervised learning), as distinct from the generation  of  class  descriptions  from  labeled  examples
       (called  supervised  learning).   It  aims  to  discover the "natural" classes in the data.  AutoClass is
       applicable to observations of things that can be described by a set of attributes, without  referring  to
       other  things.   The  data values corresponding to each attribute are limited to be either numbers or the
       elements of a fixed set of symbols.  With numeric data, a measurement error must be provided.

       AutoClass is looking for the best classification(s) of  the  data  it  can  find.   A  classification  is
       composed of:

       1)     A  set  of classes, each of which is described by a set of class parameters, which specify how the
              class is distributed along the various attributes.  For example, "height normally distributed with
              mean 4.67 ft and standard deviation .32 ft",

       2)     A set of class weights, describing what percentage of cases are likely to be in each class.

       3)     A  probabilistic  assignment  of  cases  in  the  data  to these classes.  I.e. for each case, the
              relative probability that it is a member of each class.

       As a strictly Bayesian system (accept no substitutes!), the quality measure AutoClass uses is  the  total
       probability  that,  had you known nothing about your data or its domain, you would have found this set of
       data generated by this underlying model.  This includes the prior probability that the "world" would have
       chosen  this  number  of classes, this set of relative class weights, and this set of parameters for each
       class, and the likelihood that such a set of classes would have generated this  set  of  values  for  the
       attributes in the data cases.

       These  probabilities  are typically very small, in the range of e^-30000, and so are usually expressed in
       exponential notation.

       When run with the -search command, AutoClass searches for a classification.  The required  arguments  are
       the  paths  to  the  four input files, which supply the data, the data format, the desired classification
       model, and the search parameters, respectively.

       By default, AutoClass writes intermediate results in a binary file.  With the -report command,  AutoClass
       generates an ASCII report.  The arguments are the full path names of the .results, .search, and .r-params
       files.

       When run with the -predict command, AutoClass predicts the class membership of a "test" data set based on
       classes found in a "training" data set (see "PREDICTIONS" below).

INPUT FILES

       An  AutoClass data set resides in two files.  There is a header file (file type "hd2") that describes the
       specific data format and attribute definitions.  The actual data values are in a  data  file  (file  type
       "db2").   We  use  two files to allow editing of data descriptions without having to deal with the entire
       data set.  This makes it easy to experiment with different descriptions of the database without having to
       reproduce the data set.  Internally, an AutoClass database structure is identified by its header and data
       files, and the number of data loaded.

       For more detailed information on the formats of these  files,  see  /usr/share/doc/autoclass/preparation-
       c.text.

   DATA FILE
       The  data file contains a sequence of data objects (datum or case) terminated by the end of the file. The
       number of values for each data object must be equal to the number of attributes  defined  in  the  header
       file.   Data  objects  must  be  groups of tokens delimited by "new-line".  Attributes are typed as REAL,
       DISCRETE, or DUMMY.  Real attribute values are numbers,  either  integer  or  floating  point.   Discrete
       attribute  values  can  be  strings,  symbols,  or integers.  A dummy attribute value can be any of these
       types.  Dummys are read in but otherwise ignored -- they will  be  set  to  zeros  in  the  the  internal
       database.   Thus  the  actual  values  will  not  be  available  for use in report output.  To have these
       attribute values available, use either type REAL or type DISCRETE, and define their model type as  IGNORE
       in  the  .model  file.   Missing values for any attribute type may be represented by either "?", or other
       token specified in the header file.  All are translated to a special unique value after  being  read,  so
       this symbol is effectively reserved for unknown/missing values.

       For example:
             white       38.991306 0.54248405  2 2 1
             red         25.254923 0.5010235   9 2 1
             yellow      32.407973 ?           8 2 1
             all_white   28.953982 0.5267696   0 1 1

   HEADER FILE
       The  header  file specifies the data file format, and the definitions of the data attributes.  The header
       file functional specifications consists of two parts -- the data set  format  definition  specifications,
       and the attribute descriptors. ";" in column 1 identifies a comment.

       A header file follows this general format:

           ;; num_db2_format_defs value (number of format def lines
           ;; that follow), range of n is 1 -> 5
           num_db2_format_defs n
           ;; number_of_attributes token and value required
           number_of_attributes <as required>
           ;; following are optional - default values are specified
           separator_char  ' '
           comment_char    ';'
           unknown_token   '?'
           separator_char  ','

           ;; attribute descriptors
           ;; <zero-based att#>  <att_type>  <att_sub_type>  <att_description>
           ;; <att_param_pairs>

       Each attribute descriptor is a line of:

             Attribute index (zero based, beginning in column 1)
             Attribute type.  See below.
             Attribute subtype.  See below
             Attribute description: symbol (no embedded blanks) or
                   string; <= 40 characters
             Specific property and value pairs.
                   Currently available combinations:

                type           subtype         property type(s)
                ----           --------        ---------------
                dummy          none/nil        --
                discrete       nominal         range
                real           location        error
                real           scalar          zero_point rel_error

       The  ERROR  property should represent your best estimate of the average error expected in the measurement
       and recording of that real attribute.  Lacking better information, the error can  be  taken  as  1/2  the
       minimum  possible  difference  between  measured  values.   It  can  be argued that real values are often
       truncated, so that smaller errors may be justified, particularly for generated data.  But AutoClass  only
       sees  the  recorded  values.   So  it  needs  the  error  in  the recorded values, rather than the actual
       measurement error.  Setting this error much smaller than the minimum expressible difference  implies  the
       possibility  of values that cannot be expressed in the data.  Worse, it implies that two identical values
       must represent measurements that were much closer than they might actually  have  been.   This  leads  to
       over-fitting of the classification.

       The  REL_ERROR  property  is  used for SCALAR reals when the error is proportional to the measured value.
       The ERROR property is not supported.

       AutoClass uses the error as a lower bound on the width  of  the  normal  distribution.   So  small  error
       estimates  tend  to give narrower peaks and to increase both the number of classes and the classification
       probability.  Broad error estimates tend to limit the number of classes.

       The scalar ZERO_POINT property is the smallest value that the measurement process  could  have  produced.
       This is often 0.0, or less by some error range.  Similarly, the bounded real's min and max properties are
       exclusive bounds on the attributes generating process.  For a calculated percentage these  would  be  0-e
       and  100+e,  where  e is an error value.  The discrete attribute's range is the number of possible values
       the attribute can take on.  This range must include unknown as a value when such values occur.

       Header File Example:

       !#; AutoClass C header file -- extension .hd2
       !#; the following chars in column 1 make the line a comment:
       !#; '!', '#', ';', ' ', and '\n' (empty line)

       ;#! num_db2_format_defs <num of def lines -- min 1, max 4>
       num_db2_format_defs 2
       ;; required
       number_of_attributes 7
       ;; optional - default values are specified
       ;; separator_char  ' '
       ;; comment_char    ';'
       ;; unknown_token   '?'
       separator_char     ','

       ;; <zero-based att#>  <att_type>  <att_sub_type>  <att_description>
       <att_param_pairs>
       0 dummy nil       "True class, range = 1 - 3"
       1 real location "X location, m. in range of 25.0 - 40.0" error .25
       2 real location "Y location, m. in range of 0.5 - 0.7" error .05
       3 real scalar   "Weight, kg. in range of 5.0 - 10.0" zero_point 0.0
       rel_error .001
       4 discrete nominal  "Truth value, range = 1 - 2" range 2
       5 discrete nominal  "Color of foobar, 10 values" range 10
       6 discrete nominal  Spectral_color_group range 6

   MODEL FILE
       A classification of a data set is made  with  respect  to  a  model  which  specifies  the  form  of  the
       probability  distribution function for classes in that data set.  Normally the model structure is defined
       in a model file (file type "model"), containing one or more  models.   Internally,  a  model  is  defined
       relative  to  a  particular  database.   Thus it is identified by the corresponding database, the model's
       model file and its sequential position in the file.

       Each model is specified by one or more model group definition lines.  Each model  group  line  associates
       attribute indices with a model term type.

       Here is an example model file:

       # AutoClass C model file -- extension .model
       model_index 0 7
       ignore 0
       single_normal_cn 3
       single_normal_cn 17 18 21
       multi_normal_cn 1 2
       multi_normal_cn 8 9 10
       multi_normal_cn 11 12 13
       single_multinomial default

       Here,  the  first  line is a comment.  The following characters in column 1 make the line a comment: `!',
       `#', ` ', `;', and `\n' (empty line).

       The tokens "model_index n m" must appear on the first  non-comment  line,  and  precede  the  model  term
       definition  lines.  n  is  the  zero-based  model index, typically 0 where there is only one model -- the
       majority of search situations.  m is the number of model term definition lines that follow.

       The last seven lines are model group lines.  Each model group line consists of:

       A model term type (one of single_multinomial, single_normal_cm, single_normal_cn, multi_normal_cn, or
           ignore).

       A list of attribute indices (the attribute set list), or the symbol default.  Attribute indices are zero-
           based.  Single model terms may have one or more attribute indices on each line, while multi model
           terms require two or more attribute indices per line.  An attribute index must not appear more than
           once in a model list.

       Notes:

       1)     At least one model definition is required (model_index token).

       2)     There may be multiple entries in a model for any model term type.

       3)     Model term types currently consist of:

              single_multinomial
                     models discrete attributes as multinomials, with missing values.

              single_normal_cn
                     models real valued attributes as normals; no missing values.

              single_normal_cm
                     models real valued attributes with missing values.

              multi_normal_cn
                     is a covariant normal model without missing values.

              ignore allows the model to ignore one or more attributes.  ignore is not  a  valid  default  model
                     term type.

              See the documentation in models-c.text for further information about specific model terms.

       4)     Single_normal_cn,  single_normal_cm,  and  multi_normal_cn  modeled  data, whose subtype is scalar
              (value distribution is away from 0.0, and is  thus  not  a  "normal"  distribution)  will  be  log
              transformed  and  modeled  with  the  log-normal model.  For data whose subtype is location (value
              distribution is around 0.0), no transform is done, and the normal model is used.

SEARCHING

       AutoClass, when invoked in the "search" mode will check the validity of the set of data,  header,  model,
       and  search  parameter  files.  Errors will stop the search from starting, and warnings will ask the user
       whether to continue.  A history of the error and warning messages is saved, by default, in the log file.

       Once you have succeeded in describing your data with a  header  file  and  model  file  that  passes  the
       AUTOCLASS  -SEARCH <...> input checks, you will have entered the search domain where AutoClass classifies
       your data.  (At last!)

       The main function to use in finding a good classification of your data is AUTOCLASS -SEARCH, and using it
       will take most of the computation time.  Searches are invoked with:

       autoclass -search <.db2 file path> <.hd2 file path>
            <.model file path> <.s-params file path>

       All  files  must  be  specified  as fully qualified relative or absolute pathnames.  File name extensions
       (file types) for all files are forced to canonical values required by the AutoClass program:

               data file   ("ascii")   db2
               data file   ("binary")  db2-bin
               header file             hd2
               model file              model
               search params file      s-params

       The sample-run (/usr/share/doc/autoclass/examples/) that comes with AutoClass shows some sample searches,
       and  browsing  these  is probably the fastest way to get familiar with how to do searches.  The test data
       sets located under /usr/share/doc/autoclass/examples/ will show  you  some  other  header  (.hd2),  model
       (.model),  and  search params (.s-params) file setups.  The remainder of this section describes how to do
       searches in somewhat more detail.

       The bold faced tokens below are generally search params file parameters.  For more information on the  s-
       params file, see SEARCH PARAMETERS below, or /usr/share/doc/autoclass/search-c.text.gz.

   WHAT RESULTS ARE
       AutoClass  is  looking  for  the  best  classification(s)  of  the data it can find.  A classification is
       composed of:

       1)     a set of classes, each of which is described by a set of class parameters, which specify  how  the
              class is distributed along the various attributes.  For example, "height normally distributed with
              mean 4.67 ft and standard deviation .32 ft",

       2)     a set of class weights, describing what percentage of cases are likely to be in each class.

       3)     a probabilistic assignment of cases in the data  to  these  classes.   I.e.  for  each  case,  the
              relative probability that it is a member of each class.

       As  a  strictly Bayesian system (accept no substitutes!), the quality measure AutoClass uses is the total
       probability that, had you known nothing about your data or its domain, you would have found this  set  of
       data generated by this underlying model.  This includes the prior probability that the "world" would have
       chosen this number of classes, this set of relative class weights, and this set of  parameters  for  each
       class,  and  the  likelihood  that  such a set of classes would have generated this set of values for the
       attributes in the data cases.

       These probabilities are typically very small, in the range of e^-30000, and so are usually  expressed  in
       exponential notation.

   WHAT RESULTS MEAN
       It is important to remember that all of these probabilities are GIVEN that the real model is in the model
       family that AutoClass has restricted its attention to.  If AutoClass is looking for Gaussian classes  and
       the  real  classes  are  Poisson,  then the fact that AutoClass found 5 Gaussian classes may not say much
       about how many Poisson classes there really are.

       The relative probability between different classifications found can be very large, like e^1000,  so  the
       very  best classification found is usually overwhelmingly more probable than the rest (and overwhelmingly
       less probable than any better classifications as yet undiscovered).  If AutoClass should manage  to  find
       two  classifications  that are within about exp(5-10) of each other (i.e. within 100 to 10,000 times more
       probable) then you should consider them to be about equally probable, as our computation is  usually  not
       more accurate than this (and sometimes much less).

   HOW IT WORKS
       AutoClass  repeatedly  creates  a  random  classification  and  then  tries  to  massage this into a high
       probability classification though local changes, until it converges to some  "local  maximum".   It  then
       remembers  what  it  found  and  starts over again, continuing until you tell it to stop.  Each effort is
       called a "try", and the computed probability is intended to cover the whole  volume  in  parameter  space
       around this maximum, rather than just the peak.

       The standard approach to massaging is to

       1)     Compute  the  probabilistic  class memberships of cases using the class parameters and the implied
              relative likelihoods.

       2)     Using the new class members, compute class statistics (like mean) and revise the class parameters.

       and  repeat  till  they   stop   changing.    There   are   three   available   convergence   algorithms:
       "converge_search_3" (the default), "converge_search_4" and "converge".  Their specification is controlled
       by search params file parameter try_fn_type.

   WHEN TO STOP
       You can tell AUTOCLASS -SEARCH to stop by:  1)  giving  a  max_duration  (in  seconds)  argument  at  the
       beginning;  2)  giving  a  max_n_tries  (an integer) argument at the beginning; or 3) by typing a "q" and
       <return> after you have seen enough tries.  The max_duration and max_n_tries arguments are useful if  you
       desire  to  run AUTOCLASS -SEARCH in batch mode.  If you are restarting AUTOCLASS -SEARCH from a previous
       search, the value of max_n_tries you provide, for instance 3, will tell the program  to  compute  3  more
       tries  in  addition  to  however many it has already done.  The same incremental behavior is exhibited by
       max_duration.

       Deciding when to stop is a judgment call and it's  up  to  you.   Since  the  search  includes  a  random
       component, there's always the chance that if you let it keep going it will find something better.  So you
       need to trade off how much better it might be with how long it might take to find it.  The search  status
       reports  that are printed when a new best classification is found are intended to provide you information
       to help you make this tradeoff.

       One clear sign that you should probably stop is if most of the classifications found  are  duplicates  of
       previous  ones (flagged by "dup" as they are found).  This should only happen for very small sets of data
       or when fixing a very small number of classes, like two.

       Our experience is that for moderately large to extremely large data sets (~200 to ~10,000 datum),  it  is
       necessary to run AutoClass for at least 50 trials.

   WHAT GETS RETURNED
       Just  before returning, AUTOCLASS -SEARCH will give short descriptions of the best classifications found.
       How many will be described can be controlled with n_final_summary.

       By default AUTOCLASS -SEARCH will write out a number of files, both at the end  and  periodically  during
       the  search  (in  case  your system crashes before it finishes).  These files will all have the same name
       (taken from the search params pathname [<name>.s-params]), and differ only in their file extensions.   If
       your  search  runs  are  very  long  and there is a possibility that your machine may crash, you can have
       intermediate "results" files written out.  These can be used to restart your search run with minimum loss
       of search effort.  See the documentation file /usr/share/doc/autoclass/checkpoint-c.text.

       A  ".log"  file  will hold a listing of most of what was printed to the screen during the run, unless you
       set log_file_p to false to say you want no such foolishness.  Unless results_file_p is  false,  a  binary
       ".results-bin"  file  (the  default) or an ASCII ".results" text file, will hold the best classifications
       that were returned, and unless search_file_p is false, a ".search" file  will  hold  the  record  of  the
       search tries. save_compact_p controls whether the "results" files are saved as binary or ASCII text.

       If  the C global variable "G_safe_file_writing_p" is defined as TRUE in "autoclass-c/prog/globals.c", the
       names of "results" files (those that contain  the  saved  classifications)  are  modified  internally  to
       account  for redundant file writing.  If the search params file name is "my_saved_clsfs" you will see the
       following "results" file names (ignoring directories and pathnames for this example)

         save_compact_p = true --
         "my_saved_clsfs.results-bin"     - completely written file
         "my_saved_clsfs.results-tmp-bin" - partially written file, renamed
                             when complete

         save_compact_p = false --
         "my_saved_clsfs.results"    - completely written file
         "my_saved_clsfs.results-tmp"  - partially written file, renamed
                             when complete

       If check pointing is being done, these additional names will appear

         save_compact_p = true --
         "my_saved_clsfs.chkpt-bin"  - completely written checkpoint file
         "my_saved_clsfs.chkpt-tmp-bin" - partially written checkpoint file,
                                renamed when complete
         save_compact_p = false --
         "my_saved_clsfs.chkpt" - completely written checkpoint file
         "my_saved_clsfs.chkpt-tmp"    - partially written checkpoint file,
                                renamed when complete

   HOW TO GET STARTED
       The way to invoke AUTOCLASS -SEARCH is:

       autoclass -search <.db2 file path> <.hd2 file path>
            <.model file path> <.s-params file path>

       To restart a previous search, specify that force_new_search_p has the value false in  the  search  params
       file,  since  its  default  is  true.  Specifying false tells AUTOCLASS -SEARCH to try to find a previous
       compatible search (<...>.results[-bin] & <...>.search) to continue from, and will  restart  using  it  if
       found.  To force a new search instead of restarting an old one, give the parameter force_new_search_p the
       value of true, or use the default.  If there is an existing search (<...>.results[-bin] &  <...>.search),
       the user will be asked to confirm continuation since continuation will discard the existing search.

       If  a  previous  search  is continued, the message "RESTARTING SEARCH" will be given instead of the usual
       "BEGINNING SEARCH".  It is generally better to continue a previous search than to start a new one, unless
       you are trying a significantly different search method, in which case statistics from the previous search
       may mislead the current one.

   STATUS REPORTS
       A running commentary on the search will be printed to the screen and to the log file  (unless  log_file_p
       is false).  Note that the ".log" file will contain a listing of all default search params values, and the
       values of all params that are overridden.

       After each try a very short report  (only  a  few  characters  long)  is  given.   After  each  new  best
       classification,  a  longer  report  is  given,  but  no  more often than min_report_period (default is 30
       seconds).

   SEARCH VARIATIONS
       AUTOCLASS -SEARCH by default uses a certain standard search  method  or  "try  function"  (try_fn_type  =
       "converge_search_3").   Two  others  are  also  available: "converge_search_4" and "converge").  They are
       provided in case your problem is one that may happen to benefit from them.  In general the default method
       will  result  in  finding better classifications at the expense of a longer search time.  The default was
       chosen so as to be robust, giving even performance across many problems.  The alternatives to the default
       may do better on some problems, but may do substantially worse on others.

       "converge_search_3"  uses an absolute stopping criterion (rel_delta_range, default value of 0.0025) which
       tests the variation of each class of the delta of the log approximate-marginal-likelihood  of  the  class
       statistics  with-respect-to  the  class  hypothesis  (class->log_a_w_s_h_j)  divided  by the class weight
       (class->w_j) between successive convergence cycles.  Increasing this value loosens  the  convergence  and
       reduces the number of cycles.  Decreasing this value tightens the convergence and increases the number of
       cycles. n_average (default value of 3) specifies how  many  successive  cycles  must  meet  the  stopping
       criterion before the trial terminates.

       "converge_search_4"  uses an absolute stopping criterion (cs4_delta_range, default value of 0.0025) which
       tests the variation of each class of the slope for each class of log  approximate-marginal-likelihood  of
       the  class  statistics  with-respect-to  the class hypothesis (class->log_a_w_s_h_j) divided by the class
       weight (class->w_j) over sigma_beta_n_values (default value 6) convergence cycles.  Increasing the  value
       of  cs4_delta_range  loosens  the  convergence  and  reduces the number of cycles.  Decreasing this value
       tightens the convergence and increases the number of cycles.  Computationally, this try function is  more
       expensive  than  "converge_search_3",  but  may  prove useful if the computational "noise" is significant
       compared to the variations in the computed  values.   Key  calculations  are  done  in  double  precision
       floating  point,  and  for  the  largest data base we have tested so far ( 5,420 cases of 93 attributes),
       computational noise has not been a problem, although the value of max_cycles needed to  be  increased  to
       400.

       "converge"  uses  one  of  two absolute stopping criterion which test the variation of the classification
       (clsf) log_marginal (clsf->log_a_x_h) delta  between  successive  convergence  cycles.   The  largest  of
       halt_range  (default  value  0.5)  and halt_factor * current_clsf_log_marginal) is used (default value of
       halt_factor is 0.0001).  Increasing these values loosens  the  convergence  and  reduces  the  number  of
       cycles.   Decreasing these values tightens the convergence and increases the number of cycles.  n_average
       (default value of 3) specifies how  many  cycles  must  meet  the  stopping  criteria  before  the  trial
       terminates.   This  is a very approximate stopping criterion, but will give you some feel for the kind of
       classifications to expect.  It would be useful for "exploratory" searches of a data base.

       The purpose of reconverge_type = "chkpt" is to complete an interrupted classification by continuing  from
       its  last checkpoint.  The purpose of reconverge_type = "results" is to attempt further refinement of the
       best  completed  classification  using   a   different   value   of   try_fn_type   ("converge_search_3",
       "converge_search_4",  "converge").   If  max_n_tries  is  greater  than  1,  then in each case, after the
       reconvergence has completed, AutoClass will perform further search trials based on the  parameter  values
       in the <...>.s-params file.

       With  the  use  of  reconverge_type  (  default  value ""), you may apply more than one try function to a
       classification.  Say you generate several exploratory trials using try_fn_type = "converge", and quit the
       search  saving  .search  and  .results[-bin] files.  Then you can begin another search with try_fn_type =
       "converge_search_3", reconverge_type = "results", and max_n_tries = 1.  This will result in  the  further
       convergence  of  the  best  classification  generated  with  try_fn_type = "converge", with try_fn_type =
       "converge_search_3".  When AutoClass completes this search try,  you  will  have  an  additional  refined
       classification.

       A  good  way  to  verify  that  any  of  the  alternate  try_fun_type  are  generating  a  well converged
       classification is to run AutoClass  in  prediction  mode  on  the  same  data  used  for  generating  the
       classification.   Then generate and compare the corresponding case or class cross reference files for the
       original classification and the prediction.  Small differences between these files are  to  be  expected,
       while  large differences indicate incomplete convergence.  Differences between such file pairs should, on
       average and modulo class deletions, decrease monotonically with further convergence.

       The standard way to create a random classification to begin a try is with the default value  of  "random"
       for  start_fn_type.   At  this  point  there  are  no alternatives.  Specifying "block" for start_fn_type
       produces repeatable non-random searches.  That is how the <..>.s-params files in the  autoclass-c/data/..
       sub-directories are specified.  This is how development testing is done.

       max_cycles  controls  the maximum number of convergence cycles that will be performed in any one trial by
       the convergence functions.  Its default value is 200.  The screen output shows a period  (".")  for  each
       cycle  completed.  If  your  search trials run for 200 cycles, then either your data base is very complex
       (increase the value), or the try_fn_type is not adequate for situation  (try  another  of  the  available
       ones, and use converge_print_p to get more information on what is going on).

       Specifying  converge_print_p to be true will generate a brief print-out for each cycle which will provide
       information  so  that  you  can  modify  the  default  values  of   rel_delta_range   &   n_average   for
       "converge_search_3";  cs4_delta_range  &  sigma_beta_n_values  for  "converge_search_4";  and halt_range,
       halt_factor, and n_average for "converge".  Their default values are given in the <..>.s-params files  in
       the autoclass-c/data/..  sub-directories.

   HOW MANY CLASSES?
       Each  new  try  begins  with  a  certain  number of classes and may end up with a smaller number, as some
       classes may drop out of the convergence.  In general, you want to begin  the  try  with  some  number  of
       classes that previous tries have indicated look promising, and you want to be sure you are fishing around
       elsewhere in case you missed something before.

       n_classes_fn_type = "random_ln_normal" is the default way to make this choice.  It fits a log  normal  to
       the  number  of  classes  (usually called "j" for short) of the 10 best classifications found so far, and
       randomly selects from that.  There is currently no alternative.

       To start the game off, the default is to go down start_j_list for the first few tries, and then switch to
       n_classes_fn_type.   If you believe that the probable number of classes in your data base is say 75, then
       instead of using the default value of start_j_list (2, 3, 5, 7, 10, 15, 25), specify something  like  50,
       60, 70, 80, 90, 100.

       If  one wants to always look for, say, three classes, one can use fixed_j and override the above.  Search
       status reports will describe what the current method for choosing j is.

   DO I HAVE ENOUGH MEMORY AND DISK SPACE?
       Internally, the storage requirements in the current system are of order n_classes_per_clsf  *  (n_data  +
       n_stored_clsfs  * n_attributes * n_attribute_values).  This depends on the number of cases, the number of
       attributes, the values per attribute (use 2 if a real value), and the number  of  classifications  stored
       away  for  comparison  to see if others are duplicates -- controlled by max_n_store (default value = 10).
       The search process does not itself consume significant memory, but storage of the results may do so.

       AutoClass C is configured to handle a maximum of 999 attributes.  If you attempt to run  with  more  than
       that  you  will  get  array  bound  violations.   In  that case, change these configuration parameters in
       prog/autoclass.h and recompile AutoClass C:

       #define ALL_ATTRIBUTES                  999
       #define VERY_LONG_STRING_LENGTH         20000
       #define VERY_LONG_TOKEN_LENGTH          500

       For example, these values will handle several thousand attributes:

       #define ALL_ATTRIBUTES                  9999
       #define VERY_LONG_STRING_LENGTH         50000
       #define VERY_LONG_TOKEN_LENGTH          50000

       Disk space taken up by the "log" file will of course depend  on  the  duration  of  the  search.   n_save
       (default  value  =  2) determines how many best classifications are saved into the ".results[-bin]" file.
       save_compact_p controls whether the "results" and "checkpoint" files are saved as binary.   Binary  files
       are  faster  and  more compact, but are not portable.  The default value of save_compact_p is true, which
       causes binary files to be written.

       If the time taken to save the "results" files is a problem, consider increasing min_save_period  (default
       value  =  1800 seconds or 30 minutes).  Files are saved to disk this often if there is anything different
       to report.

   JUST HOW SLOW IS IT?
       Compute time is of order n_data * n_attributes * n_classes * n_tries * converge_cycles_per_try. The major
       uncertainties  in this are the number of basic back and forth cycles till convergence in each try, and of
       course the number of tries.  The  number  of  cycles  per  trial  is  typically  10-100  for  try_fn_type
       "converge", and 10-200+ for "converge_search_3" and "converge_search-4".  The maximum number is specified
       by max_n_tries (default value = 200).  The number of trials is up to you  and  your  available  computing
       resources.

       The  running time of very large data sets will be quite uncertain.  We advise that a few small scale test
       runs be made on your system to determine a baseline.  Specify n_data to limit how many data  vectors  are
       read.   Given  a  very  large  quantity  of data, AutoClass may find its most probable classifications at
       upwards of a hundred classes, and this will require that start_j_list  be  specified  appropriately  (See
       above  section  HOW  MANY  CLASSES?).  If you are quite certain that you only want a few classes, you can
       force AutoClass to search with a fixed number of classes specified by fixed_j.  You will then need to run
       separate searches with each different fixed number of classes.

   CHANGING FILENAMES IN A SAVED CLASSIFICATION FILE
       AutoClass  caches the data, header, and model file pathnames in the saved classification structure of the
       binary (".results-bin") or ASCII (".results") "results" files.  If the "results" and "search"  files  are
       moved  to  a  different  directory location, the search cannot be successfully restarted if you have used
       absolute pathnames.  Thus it is advantageous to run invoke AutoClass in a parent directory of  the  data,
       header, and model files, so that relative pathnames can be used.  Since the pathnames cached will then be
       relative, the files can be moved to a different host or file system and restarted -- providing  the  same
       relative pathname hierarchy exists.

       However,  since  the  ".results"  file is ASCII text, those pathnames could be changed with a text editor
       (save_compact_p must be specified as false).

   SEARCH PARAMETERS
       The search is controlled by the ".s-params" file.  In this file, an empty line or a  line  starting  with
       one  of these characters is treated as a comment: "#", "!", or ";".  The parameter name and its value can
       be separated by an equal sign, a space, or a tab:

            n_clsfs 1
            n_clsfs = 1
            n_clsfs<tab>1

       Spaces are ignored if "=" or "<tab>" are used as separators.  Note there are no trailing semicolons.

       The search parameters, with their default values, are as follows:

       rel_error = 0.01
              Specifies the relative difference measure used by clsf-DS-%=, when deciding if a  new  clsf  is  a
              duplicate of an old one.

       start_j_list = 2, 3, 5, 7, 10, 15, 25
              Initially  try these numbers of classes, so as not to narrow the search too quickly.  The state of
              this list is saved in the <..>.search file and used on restarts, unless an override  specification
              of  start_j_list is made in the .s-params file for the restart run.  This list should bracket your
              expected number of classes, and by a wide margin!  "start_j_list = -999" specifies an  empty  list
              (allowed only on restarts)

       n_classes_fn_type = "random_ln_normal"
              Once  start_j_list  is  exhausted, AutoClass will call this function to decide how many classes to
              start with on the next try, based on the 10 best classifications found  so  far.   Currently  only
              "random_ln_normal" is available.

       fixed_j = 0
              When fixed_j > 0, overrides start_j_list and n_classes_fn_type, and AutoClass will always use this
              value for the initial number of classes.

       min_report_period = 30
              Wait at least this time (in seconds) since last report until reporting verbosely again.  Should be
              set  longer than the expected run time when checking for repeatability of results.  For repeatable
              results, also see force_new_search_p, start_fn_type and randomize_random_p. NOTE: At least one  of
              "interactive_p",  "max_duration",  and "max_n_tries" must be active.  Otherwise AutoClass will run
              indefinitely.  See below.

       interactive_p = true
              When false, allows run to continue until otherwise halted.  When true, standard input  is  queried
              on each cycle for the quit character "q", which, when detected, triggers an immediate halt.

       max_duration = 0
              When  =  0, allows run to continue until otherwise halted.  When > 0, specifies the maximum number
              of seconds to run.

       max_n_tries = 0
              When = 0, allows run to continue until otherwise halted.  When > 0, specifies the  maximum  number
              of tries to make.

       n_save = 2
              Save  this  many clsfs to disk in the .results[-bin] and .search files.  if 0, don't save anything
              (no .search & .results[-bin] files).

       log_file_p = true
              If false, do not write a log file.

       search_file_p = true
              If false, do not write a search file.

       results_file_p = true
              If false, do not write a results file.

       min_save_period = 1800
              CPU crash protection.  This specifies the maximum time, in seconds, that AutoClass will run before
              it saves the current results to disk.  The default time is 30 minutes.

       max_n_store = 10
              Specifies the maximum number of classifications stored internally.

       n_final_summary = 10
              Specifies the number of trials to be printed out after search ends.

       start_fn_type = "random"
              One  of {"random", "block"}.  This specifies the type of class initialization.  For normal search,
              use "random", which randomly selects instances to be initial class  means,  and  adds  appropriate
              variances.  For  testing  with  repeatable search, use "block", which partitions the database into
              successive blocks of near equal  size.   For  repeatable  results,  also  see  force_new_search_p,
              min_report_period, and randomize_random_p.

       try_fn_type = "converge_search_3"
              One  of  {"converge_search_3",  "converge_search_4",  "converge"}.  These specify alternate search
              stopping criteria.  "converge" merely tests the rate of change of the log_marginal  classification
              probability   (clsf->log_a_x_h),  without  checking  rate  of  change  of  individual  classes(see
              halt_range and halt_factor).  "converge_search_3" and "converge_search_4" each monitor  the  ratio
              class->log_a_w_s_h_j/class->w_j  for  all  classes,  and  continue  convergence until all pass the
              quiescence  criteria  for  n_average  cycles.   "converge_search_3"  tests   differences   between
              successive  convergence cycles (see rel_delta_range).  This provides a reasonable, general purpose
              stopping criteria.  "converge_search_4" averages the ratio over "sigma_beta_n_values" cycles  (see
              cs4_delta_range).  This is preferred when converge_search_3 produces many similar classes.

       initial_cycles_p = true
              If true, perform base_cycle in initialize_parameters.  false is used only for testing.

       save_compact_p = true
              true  saves  classifications as machine dependent binary (.results-bin & .chkpt-bin).  false saves
              as ascii text (.results & .chkpt)

       read_compact_p = true
              true reads classifications as machine dependent binary (.results-bin & .chkpt-bin).   false  reads
              as ascii text (.results & .chkpt).

       randomize_random_p = true
              false seeds lrand48, the pseudo-random number function with 1 to give repeatable test cases.  true
              uses universal time clock as the seed, giving semi-random searches.  For repeatable results,  also
              see force_new_search_p, min_report_period and start_fn_type.

       n_data = 0
              With n_data = 0, the entire database is read from .db2.  With n_data > 0, only this number of data
              are read.

       halt_range = 0.5
              Passed to try_fn_type "converge".  With the "converge" try_fn_type, convergence is halted when the
              larger  of  halt_range  and  (halt_factor  *  current_log_marginal) exceeds the difference between
              successive cycle values of the classification  log_marginal  (clsf->log_a_x_h).   Decreasing  this
              value may tighten the convergence and increase the number of cycles.

       halt_factor = 0.0001
              Passed to try_fn_type "converge".  With the "converge" try_fn_type, convergence is halted when the
              larger of halt_range and (halt_factor  *  current_log_marginal)  exceeds  the  difference  between
              successive  cycle  values  of  the classification log_marginal (clsf->log_a_x_h).  Decreasing this
              value may tighten the convergence and increase the number of cycles.

       rel_delta_range = 0.0025
              Passed to try function "converge_search_3", which  monitors  the  ratio  of  log  approx-marginal-
              likelihood of class statistics with-respect-to the class hypothesis (class->log_a_w_s_h_j) divided
              by the class weight (class->w_j), for each class.  "converge_search_3" halts convergence when  the
              difference  between cycles, of this ratio, for every class, has been exceeded by "rel_delta_range"
              for "n_average" cycles.  Decreasing "rel_delta_range" tightens the convergence and  increases  the
              number of cycles.

       cs4_delta_range = 0.0025
              Passed    to    try    function    "converge_search_4",    which    monitors    the    ratio    of
              (class->log_a_w_s_h_j)/(class->w_j),  for  each   class,   averaged   over   "sigma_beta_n_values"
              convergence  cycles.  "converge_search_4" halts convergence when the maximum difference in average
              values of this ratio falls below "cs4_delta_range".   Decreasing  "cs4_delta_range"  tightens  the
              convergence and increases the number of cycles.

       n_average = 3
              Passed  to  try  functions "converge_search_3" and "converge".  The number of cycles for which the
              convergence criterion must be satisfied for the trial to terminate.

       sigma_beta_n_values = 6
              Passed to try_fn_type "converge_search_4".  The number of past values to use in computing  sigma^2
              (noise) and beta^2 (signal).

       max_cycles = 200
              This  is  the  maximum  number  of  cycles  permitted for any one convergence of a classification,
              regardless of any other stopping criteria.  This is very dependent upon your database  and  choice
              of  model  and  convergence  parameters,  but  should  be about twice the average number of cycles
              reported in the screen dump and .log file

       converge_print_p = false
              If true, the selected try function will print to the  screen  values  useful  in  specifying  non-
              default  values  for halt_range, halt_factor, rel_delta_range, n_average, sigma_beta_n_values, and
              range_factor.

       force_new_search_p = true
              If  true,  will  ignore  any  previous  search  results,  discarding  the  existing  .search   and
              .results[-bin]  files after confirmation by the user; if false, will continue the search using the
              existing .search and .results[-bin] files.  For repeatable results,  also  see  min_report_period,
              start_fn_type and randomize_random_p.

       checkpoint_p = false
              If  true,  checkpoints of the current classification will be written every "min_checkpoint_period"
              seconds, with file extension .chkpt[-bin]. This is only useful for very large classifications

       min_checkpoint_period = 10800
              If checkpoint_p = true, the checkpointed classification will be written this often  -  in  seconds
              (default = 3 hours)

       reconverge_type = "
              Can  be  either  "chkpt"  or "results".  If "checkpoint_p" = true and "reconverge_type" = "chkpt",
              then continue convergence of the classification contained in <...>.chkpt[-bin].  If  "checkpoint_p
              "  =  false  and  "reconverge_type"  =  "results", continue convergence of the best classification
              contained in <...>.results[-bin].

       screen_output_p = true
              If false, no output is directed to the  screen.   Assuming  log_file_p  =  true,  output  will  be
              directed to the log file only.

       break_on_warnings_p = true
              The  default  value  asks  the  user whether or not to continue, when data definition warnings are
              found.  If specified as false, then AutoClass will continue, despite warnings -- the warning  will
              continue to be output to the terminal and the log file.

       free_storage_p = true
              The  default  value  tells  AutoClass  to free the majority of its allocated storage.  This is not
              required, and in the case of the DEC Alpha causes core dump [is this still true?].   If  specified
              as false, AutoClass will not attempt to free storage.

   HOW TO GET AUTOCLASS C TO PRODUCE REPEATABLE RESULTS
       In  some  situations,  repeatable  classifications are required: comparing basic AutoClass C integrity on
       different platforms, porting AutoClass C to a new platform, etc.  In order to accomplish this two  things
       are  necessary:  1)  the  same random number generator must be used, and 2) the search parameters must be
       specified properly.

       Random Number Generator. This implementation of AutoClass C uses the Unix srand48/lrand48  random  number
       generator  which  generates  pseudo-random numbers using the well-known linear congruential algorithm and
       48-bit integer arithmetic.  lrand48() returns non- negative long integers uniformly distributed over  the
       interval [0, 2**31].

       Search Parameters.  The following .s-params file parameters should be specified:

       force_new_search_p = true
       start_fn_type   "block"
       randomize_random_p = false
       ;; specify the number of trials you wish to run
       max_n_tries = 50
       ;; specify a time greater than duration of run
       min_report_period = 30000

       Note  that  no current best classification reports will be produced.  Only a final classification summary
       will be output.

CHECKPOINTING

       With very large databases  there  is  a  significant  probability  of  a  system  crash  during  any  one
       classification  try.   Under  such  circumstances  it  is  advisable  to  take the time to checkpoint the
       calculations for possible restart.

       Checkpointing is initiated by specifying "checkpoint_p = true" in the ".s-params" file.  This causes  the
       inner  convergence  step,  to  save  a  copy of the classification onto the checkpoint file each time the
       classification is updated, providing a certain period  of  time  has  elapsed.   The  file  extension  is
       ".chkpt[-bin]".

       Each time a AutoClass completes a cycle, a "." is output to the screen to provide you with information to
       be used in setting the min_checkpoint_period  value  (default  10800  seconds  or  3  hours).   There  is
       obviously a trade-off between frequency of checkpointing and the probability that your machine may crash,
       since the repetitive writing of the checkpoint file will slow the search process.

       Restarting AutoClass Search:

       To recover the classification and continue the search after rebooting and  reloading  AutoClass,  specify
       reconverge_type = "chkpt" in the ".s-params" file (specify force_new_search_p as false).

       AutoClass  will  reload  the  appropriate database and models, provided there has been no change in their
       filenames since the time they were loaded for the checkpointed classification run.  The ".s-params"  file
       contains any non-default arguments that were provided to the original call.

       In  the  beginning  of  a  search, before start_j_list has been emptied, it will be necessary to trim the
       original list to what would have remained in the crashed search.  This can be determined  by  looking  at
       the  ".log"  file to determine what values were already used.  If the start_j_list has been emptied, then
       an empty start_j_list should be specified in the ".s-params" file.  This is done either by

               start_j_list =

       or

               start_j_list = -9999

       Here is an a set of scripts to demonstrate check-pointing:

       autoclass -search data/glass/glassc.db2 data/glass/glass-3c.hd2 \
            data/glass/glass-mnc.model data/glass/glassc-chkpt.s-params

       Run 1)
         ## glassc-chkpt.s-params
         max_n_tries = 2
         force_new_search_p = true
         ## --------------------
         ;; run to completion

       Run 2)
         ## glassc-chkpt.s-params
         force_new_search_p = false
         max_n_tries = 10
         checkpoint_p = true
         min_checkpoint_period = 2
         ## --------------------
         ;; after 1 checkpoint, ctrl-C to simulate cpu crash

       Run 3)
         ## glassc-chkpt.s-params
         force_new_search_p = false
         max_n_tries = 1
         checkpoint_p = true
         min_checkpoint_period = 1
         reconverge_type = "chkpt"
         ## --------------------
         ;; checkpointed trial should finish

OUTPUT FILES

       The standard reports are

       1)     Attribute influence values:  presents  the  relative  influence  or  significance  of  the  data's
              attributes both globally (averaged over all classes), and locally (specifically for each class). A
              heuristic for relative class strength is also listed;

       2)     Cross-reference by case (datum) number: lists  the  primary  class  probability  for  each  datum,
              ordered by case number.  When report_mode = "data", additional lesser class probabilities (greater
              than or equal to 0.001) are listed for each datum;

       3)     Cross-reference by class number: for each class the primary class probability and any lesser class
              probabilities  (greater than or equal to 0.001) are listed for each datum in the class, ordered by
              case number. It is also possible to list, for each datum, the  values  of  attributes,  which  you
              select.

       The  attribute  influence  values  report attempts to provide relative measures of the "influence" of the
       data attributes on the classes  found  by  the  classification.   The  normalized  class  strengths,  the
       normalized  attribute  influence  values  summed  over  all  classes, and the individual influence values
       (I[jkl]) are all only relative measures and should be interpreted with more meaning than  rank  ordering,
       but not like anything approaching absolute values.

       The  reports  are output to files whose names and pathnames are taken from the ".r-params" file pathname.
       The report file types (extensions) are:

       influence values report
              "influ-o-text-n" or "influ-no-text-n"

       cross-reference by case
              "case-text-n"

       cross-reference by class
              "class-text-n"

       or, if report_mode is overridden to "data":

       influence values report
              "influ-o-data-n" or "influ-no-data-n"

       cross-reference by case
              "case-data-n"

       cross-reference by class
              "class-data-n"

       where n is the classification number from the "results"  file.   The  first  or  best  classification  is
       numbered 1, the next best 2, etc.  The default is to generate reports only for the best classification in
       the "results" file.  You can produce reports for other  saved  classifications  by  using  report  params
       keywords    n_clsfs    and    clsf_n_list.     The    "influ-o-text-n"   file   type   is   the   default
       (order_attributes_by_influence_p = true), and lists  each  class's  attributes  in  descending  order  of
       attribute  influence value.  If the value of order_attributes_by_influence_p is overridden to be false in
       the <...>.r-params file, then each class's attributes will be listed  in  ascending  order  by  attribute
       number.   The  extension  of  the  file  generated  will  be  "influ-no-text-n".   This method of listing
       facilitates the visual comparison of attribute values between classes.

       For example, this command:

            autoclass -reports sample/imports-85c.results-bin
                 sample/imports-85c.search sample/imports-85c.r-params

       with this line in the ".r-params" file:

            xref_class_report_att_list = 2, 5, 6

       will generate these output files:

            imports-85.influ-o-text-1
            imports-85.case-text-1
            imports-85.class-text-1

       The AutoClass C reports provide the capability to compute sigma class contour values for specified  pairs
       of  real valued attributes, when generating the influence values report with the data option (report_mode
       = "data").  Note that sigma class contours are not generated from discrete type attributes.

       The sigma contours  are  the  two  dimensional  equivalent  of  n-sigma  error  bars  in  one  dimension.
       Specifically, for two independent attributes the n-sigma contour is defined as the ellipse where

       ((x - xMean) / xSigma)^2 + ((y - yMean) / ySigma)^2 == n

       With covariant attributes, the n-sigma contours are defined identically, in the rotated coordinate system
       of the distribution's principle axes.  Thus independent attributes give ellipses oriented  parallel  with
       the attribute axes, while the axes of sigma contours of covariant attributes are rotated about the center
       determined by the means.  In either case the sigma contour represents a line where the class  probability
       is constant, irrespective of any other class probabilities.

       With  three or more attributes the n-sigma contours become k-dimensional ellipsoidal surfaces.  This code
       takes advantage of the fact that the parallel projection of an n-dimensional ellipsoid,  onto  any  2-dim
       plane,  is  bounded by an ellipse.  In this simplified case of projecting the single sigma ellipsoid onto
       the coordinate planes, it is also true that the 2-dim covariances  of  this  ellipse  are  equal  to  the
       corresponding  elements  of  the n-dim ellipsoid's covariances.  The Eigen-system of the 2-dim covariance
       then gives the variances w.r.t. the principal components of the eclipse, and the rotation that aligns  it
       with the data.  This represents the best way to display a distribution in the marginal plane.

       To get contour values, set the keyword sigma_contours_att_list to a list of real valued attribute indices
       (from .hd2 file), and request an influence values report with the data option.  For example,

            report_mode = "data"
            sigma_contours_att_list = 3, 4, 5, 8, 15

   OUTPUT REPORT PARAMETERS
       The contents of the output report are controlled by the ".r-params" file.  In this file, an empty line or
       a  line  starting  with one of these characters is treated as a comment: "#", "!", or ";".  The parameter
       name and its value can be separated by an equal sign, a space, or a tab:

            n_clsfs 1
            n_clsfs = 1
            n_clsfs<tab>1

       Spaces are ignored if "=" or "<tab>" are used as separators.  Note there are no trailing semicolons.

       The following are the allowed parameters and their default values:

       n_clsfs = 1
              number of clsfs in the .results file for which to generate reports, starting  with  the  first  or
              "best".

       clsf_n_list =
              if  specified, this is a one-based index list of clsfs in the clsf sequence read from the .results
              file.  It overrides "n_clsfs".  For example:

                   clsf_n_list = 1, 2

              will produce the same output as

                   n_clsfs = 2

              but

                   clsf_n_list = 2

              will only output the "second best" classification report.

       report_type =
              type of reports to generate: "all", "influence_values", "xref_case", or "xref_class".

       report_mode =
              mode of reports to generate. "text" is formatted text layout.  "data" is numerical -- suitable for
              further processing.

       comment_data_headers_p = false
              the  default  value  does  not insert # in column 1 of most report_mode = "data" header lines.  If
              specified as true, the comment character will be inserted in most header lines.

       num_atts_to_list =
              if specified, the number of attributes to list in influence values report.  if not specified,  all
              attributes will be listed.  (e.g. "num_atts_to_list = 5")

       xref_class_report_att_list =
              if  specified,  a  list  of  attribute  numbers  (zero-based),  whose values will be output in the
              "xref_class" report along with the case probabilities.  if not  specified,  no  attributes  values
              will be output.  (e.g. "xref_class_report_att_list = 1, 2, 3")

       order_attributes_by_influence_p = true
              The  default value lists each class's attributes in descending order of attribute influence value,
              and uses ".influ-o-text-n" as the influence values report file type.  If specified as false,  then
              each  class's  attributes will be listed in ascending order by attribute number.  The extension of
              the file generated will be "influ-no-text-n".

       break_on_warnings_p = true
              The default value asks the user whether to continue or  not  when  data  definition  warnings  are
              found.   If specified as false, then AutoClass will continue, despite warnings -- the warning will
              continue to be output to the terminal.

       free_storage_p = true
              The default value tells AutoClass to free the majority of its  allocated  storage.   This  is  not
              required, and in the case of the DEC Alpha causes a core dump [is this still true?].  If specified
              as false, AutoClass will not attempt to free storage.

       max_num_xref_class_probs = 5
              Determines how many lessor class probabilities will be printed  for  the  case  and  class  cross-
              reference  reports.  The default is to print the most probable class probability value and up to 4
              lessor class prob- ibilities.  Note this is true for both  the  "text"  and  "data"  class  cross-
              reference  reports,  but  only  true for the "data" case cross- reference report.  The "text" case
              cross-reference report only has the most probable class probability.

       sigma_contours_att_list =
              If specified, a list of real valued attribute indices (from .hd2 file) will be  to  compute  sigma
              class  contour values, when generating influence values report with the data option (report_mode =
              "data").   If  not  specified,  there  will   be   no   sigma   class   contour   output.    (e.g.
              "sigma_contours_att_list = 3, 4, 5, 8, 15")

INTERPRETATION OF AUTOCLASS RESULTS

   WHAT HAVE YOU GOT?
       Now  you  have  run  AutoClass  on  your  data set -- what have you got?  Typically, the AutoClass search
       procedure finds many classifications, but  only  saves  the  few  best.   These  are  now  available  for
       inspection  and interpretation.  The most important indicator of the relative merits of these alternative
       classifications is Log total posterior probability value.  Note that since the probability lies between 1
       and  0,  the  corresponding  Log  probability  is  negative  and  ranges from 0 to negative infinity. The
       difference between these Log probability values raised to the power e gives the relative  probability  of
       the  alternatives  classifications.   So a difference of, say 100, implies one classification is e^100 ~=
       10^43 more likely than the other.  However, these numbers can be very misleading,  since  they  give  the
       relative probability of alternative classifications under the AutoClass assumptions.

   ASSUMPTIONS
       Specifically,  the  most important AutoClass assumptions are the use of normal models for real variables,
       and the assumption of independence of attributes within a  class.   Since  these  assumptions  are  often
       violated  in  practice,  the  difference  in  posterior probability of alternative classifications can be
       partly due to one classification being closer to satisfying the assumptions than another, rather than  to
       a  real  difference  in  classification  quality.  Another source of uncertainty about the utility of Log
       probability values is that they do not take into account any specific prior knowledge the user  may  have
       about the domain.  This means that it is often worth looking at alternative classifications to see if you
       can interpret them, but it is worth starting from  the  most  probable  first.   Note  that  if  the  Log
       probability  value  is  much  greater  than  that  for  the  one  class  case, it is saying that there is
       overwhelming evidence for some structure in the data, and part of this structure has been captured by the
       AutoClass classification.

   INFLUENCE REPORT
       So  you  have now picked a classification you want to examine, based on its Log probability value; how do
       you examine it?  The first thing to do is to generate an "influence" report on the  classification  using
       the  report  generation  facilities  documented in /usr/share/doc/autoclass/reports-c.text.  An influence
       report is designed to summarize the important information buried in the AutoClass data structures.

       The first part of this report gives the heuristic class "strengths".  Class "strength" is here defined as
       the geometric mean probability that any instance "belonging to" class, would have been generated from the
       class probability model.  It thus provides a heuristic measure of how strongly each class predicts  "its"
       instances.

       The  second  part  is  a  listing  of  the  overall  "influence"  of  each  of the attributes used in the
       classification.  These give a rough heuristic measure of the relative importance of each attribute in the
       classification.  Attribute "influence values" are a class probability weighted average of the "influence"
       of each attribute in the classes, as described below.

       The next part of the report is a summary description of each of the classes.  The classes are arbitrarily
       numbered  from 0 up to n, in order of descending class weight.  A class weight of say 34.1 means that the
       weighted sum of membership probabilities for class is 34.1.  Note that a class  weight  of  34  does  not
       necessarily mean that 34 cases belong to that class, since many cases may have only partial membership in
       that class.  Within each class, attributes or attribute sets are ordered  by  the  "influence"  of  their
       model term.

   CROSS ENTROPY
       A commonly used measure of the divergence between two probability distributions is the cross entropy: the
       sum over all possible values x, of P(x|c...)*log[P(x|c...)/P(x|g...)], where c...  and  g...  define  the
       distributions.   It  ranges from zero, for identical distributions, to infinite for distributions placing
       probability 1 on differing  values  of  an  attribute.   With  conditionally  independent  terms  in  the
       probability  distributions,  the  cross entropy can be factored to a sum over these terms.  These factors
       provide a measure  of  the  corresponding  modeled  attribute's  influence  in  differentiating  the  two
       distributions.

       We  define  the  modeled  term's  "influence"  on  a  class  to  be  the cross entropy term for the class
       distribution w.r.t. the global class distribution of the single  class  classification.   "Influence"  is
       thus  a  measure  of  how  strongly the model term helps differentiate the class from the whole data set.
       With independently modeled attributes, the influence  can  legitimately  be  ascribed  to  the  attribute
       itself.   With  correlated  or  covariant  attributes sets, the cross entropy factor is a function of the
       entire set, and we distribute the influence value equally over the modeled attributes.

   ATTRIBUTE INFLUENCE VALUES
       In the "influence" report on each class, the attribute parameters for that class are given  in  order  of
       highest  influence  value  for  the model term attribute sets.  Only the first few attribute sets usually
       have significant influence values.  If an influence value drops below about 20%  of  the  highest  value,
       then  it is probably not significant, but all attribute sets are listed for completeness.  In addition to
       the influence value for each attribute set, the values of the attribute set parameters in that class  are
       given  along  with  the  corresponding "global" values.  The global values are computed directly from the
       data independent of the classification.  For example, if the class mean of attribute "temperature" is  90
       with  standard  deviation  of 2.5, but the global mean is 68 with a standard deviation of 16.3, then this
       class has selected out cases with much higher than average temperature, and a rather small spread in this
       high  range.   Similarly,  for  discrete attribute sets, the probability of each outcome in that class is
       given, along with the corresponding global probability -- ordered by its significance: the absolute value
       of  (log  {<local-probability>  /  <global-probability>}).   The sign of the significance value shows the
       direction of change from the global class.  This information gives an overview of how each class  differs
       from the average for all the data, in order of the most significant differences.

   CLASS AND CASE REPORTS
       Having  gained a description of the classes from the "influence" report, you may want to follow-up to see
       which classes your favorite cases ended up in.  Conversely, you may want to see which cases belong  to  a
       particular  class.   For  this  kind  of  cross-reference  information  two  complementary reports can be
       generated.  These are more  fully  documented  in  /usr/share/doc/autoclass/reports-c.text.  The  "class"
       report,  lists all the cases which have significant membership in each class and the degree to which each
       such case belongs to that class.  Cases whose class membership is less than 90% in the current class have
       their  other  class  membership  listed as well.  The cases within a class are ordered in increasing case
       number.  The alternative "cases" report states which class (or  classes)  a  case  belongs  to,  and  the
       membership  probability  in  the  most  probable  class.  These two reports allow you to find which cases
       belong to which classes or the other way around.  If nearly every case has close to 99% membership  in  a
       single  class, then it means that the classes are well separated, while a high degree of cross-membership
       indicates that the classes are heavily overlapped.  Highly overlapped classes are an indication that  the
       idea  of classification is breaking down and that groups of mutually highly overlapped classes, a kind of
       meta class, is probably a better way of understanding the data.

   COMPARING CLASS WEIGHTS AND CLASS/CASE REPORT ASSIGNMENTS
       The class weight given as the  class  probability  parameter,  is  essentially  the  sum  over  all  data
       instances,  of  the normalized probability that the instance is a member of the class.  It is probably an
       error on our part that we format this number as an integer in the report,  rather  than  emphasizing  its
       real  nature.   You  will  find  the  actual  real  value  recorded  as the w_j parameter in the class_DS
       structures on any .results[-bin] file.

       The .case and .class reports give probabilities that cases are members of  classes.   Any  assignment  of
       cases  to  classes  requires  some  decision  rule.   The  maximum  probability  assignment rule is often
       implicitly assumed, but it cannot be expected that the resulting partition sizes  will  equal  the  class
       weights  unless  nearly  all  class  membership  probabilities are effectively one or zero.  With non-1/0
       membership probabilities, matching the class weights requires summing the probabilities.

       In addition, there is the question of completeness of the EM (expectation maximization) convergence.   EM
       alternates  between  estimating  class  parameters  and estimating class membership probabilities.  These
       estimates converge on each other, but never actually  meet.   AutoClass  implements  several  convergence
       algorithms  with  alternate stopping criteria using appropriate parameters in the .s-params file.  Proper
       setting  of  these  parameters,  to  get  reasonably  complete  and  efficient  convergence  may  require
       experimentation.

   ALTERNATIVE CLASSIFICATIONS
       In  summary,  the  various  reports  that  can  be  generated  give  you  a  way  of  viewing the current
       classification.  It is usually a good idea to look at alternative classifications even though they do not
       have  the  minimum  Log  probability  values.   These  other  classifications  usually  have classes that
       correspond closely to strong classes in other classifications, but can differ in the weak  classes.   The
       "strength"  of  a  class  within  a  classification can usually be judged by how dramatically the highest
       influence value attributes in the class differ from the corresponding global attributes.  If none of  the
       classifications  seem  quite  satisfactory,  it is always possible to run AutoClass again to generate new
       classifications.

   WHAT NEXT?
       Finally, the question of what to do after you have found an insightful classification  arises.   Usually,
       classification  is a preliminary data analysis step for examining a set of cases (things, examples, etc.)
       to see if they can be grouped so that members of the group are "similar" to each other.  AutoClass  gives
       such  a  grouping  without  the  user  having  to define a similarity measure.  The built-in "similarity"
       measure is the mutual predictiveness of the cases.  The next step is to try to "explain" why some objects
       are more like others than those in a different group.  Usually, domain knowledge suggests an answer.  For
       example, a classification of people based on income, buying  habits,  location,  age,  etc.,  may  reveal
       particular  social  classes  that were not obvious before the classification analysis.  To obtain further
       information about such classes, further information, such as number of cars, what TV shows  are  watched,
       etc.,  would  reveal even more information.  Longitudinal studies would give information about how social
       classes arise and what influences their attitudes -- all  of  which  is  going  way  beyond  the  initial
       classification.

PREDICTIONS

       Classifications can be used to predict class membership for new cases.  So in addition to possibly giving
       you some insight into the structure behind your  data,  you  can  now  use  AutoClass  directly  to  make
       predictions, and compare AutoClass to other learning systems.

       This  technique  for  predicting  class probabilities is applicable to all attributes, regardless of data
       type/sub_type or likelihood model term type.

       In the event that the class membership of a data case does not exceed 0.0099999 for any of the "training"
       classes, the following message will appear in the screen output for each case:

               xref_get_data: case_num xxx => class 9999

       Class  9999 members will appear in the "case" and "class" cross-reference reports with a class membership
       of 1.0.

       Cautionary Points:

       The usual way of using AutoClass is to put all of your data in a data_file, describe that data with model
       and  header  files,  and  run  "autoclass  -search".   Now, instead of one data_file you will have two, a
       training_data_file and a test_data_file.

       It is most important that both databases have the same AutoClass internal  representation.   Should  this
       not  be  true,  AutoClass  will  exit,  or possibly in in some situations, crash.  The prediction mode is
       designed to hopefully direct the user into conforming to this requirement.

       Preparation:

       Prediction requires having a training classification and a test database.  The training classification is
       generated  by the running of "autoclass -search" on the training data_file ("data/soybean/soyc.db2"), for
       example:

           autoclass -search data/soybean/soyc.db2 data/soybean/soyc.hd2
               data/soybean/soyc.model data/soybean/soyc.s-params

       This will produce "soyc.results-bin" and "soyc.search".  Then create a "reports" parameter file, such  as
       "soyc.r-params"  (see /usr/share/doc/autoclass/reports-c.text), and run AutoClass in "reports" mode, such
       as:

           autoclass -reports data/soybean/soyc.results-bin
               data/soybean/soyc.search data/soybean/soyc.r-params

       This will generate class and case cross-reference files, and an influence values file.   The  file  names
       are based on the ".r-params" file name:

               data/soybean/soyc.class-text-1
               data/soybean/soyc.case-text-1
               data/soybean/soyc.influ-text-1

       These  will describe the classes found in the training_data_file.  Now this classification can be used to
       predict the probabilistic class membership of the test_data_file cases  ("data/soybean/soyc-predict.db2")
       in the training_data_file classes.

           autoclass -predict data/soybean/soyc-predict.db2
               data/soybean/soyc.results-bin data/soybean/soyc.search
               data/soybean/soyc.r-params

       This  will  generate  class  and case cross-reference files for the test_data_file cases predicting their
       probabilistic class memberships in the training_data_file classes.  The  file  names  are  based  on  the
       ".db2" file name:

               data/soybean/soyc-predict.class-text-1
               data/soybean/soyc-predict.case-text-1

SEE ALSO

       AutoClass is documented fully here:

       /usr/share/doc/autoclass/introduction-c.text Guide to the documentation

       /usr/share/doc/autoclass/preparation-c.text How to prepare data for use by AutoClass

       /usr/share/doc/autoclass/search-c.text How to run AutoClass to find classifications.

       /usr/share/doc/autoclass/reports-c.text How to examine the classification in various ways.

       /usr/share/doc/autoclass/interpretation-c.text How to interpret AutoClass results.

       /usr/share/doc/autoclass/checkpoint-c.text Protocols for running a checkpointed search.

       /usr/share/doc/autoclass/prediction-c.text Use classifications to predict class membership for new cases.

       These provide supporting documentation:

       /usr/share/doc/autoclass/classes-c.text What classification is all about, for beginners.

       /usr/share/doc/autoclass/models-c.text Brief descriptions of the model term implementations.

       The mathematical theory behind AutoClass is explained in these documents:

       /usr/share/doc/autoclass/kdd-95.ps   Postscript  file  containing:  P.  Cheeseman,  J.  Stutz,  "Bayesian
       Classification (AutoClass): Theory and Results", in "Advances in Knowledge Discovery  and  Data  Mining",
       Usama  M.  Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, & Ramasamy Uthurusamy, Eds. The AAAI Press,
       Menlo Park, expected fall 1995.

       /usr/share/doc/autoclass/tr-fia-90-12-7-01.ps  Postscript  file  containing:  R.  Hanson,  J.  Stutz,  P.
       Cheeseman,  "Bayesian Classification Theory", Technical Report FIA-90-12-7-01, NASA Ames Research Center,
       Artificial Intelligence Branch, May 1991 (The figures are not included, since they were inserted by "cut-
       and-paste" methods into the original "camera-ready" copy.)

AUTHORS

       Dr. Peter Cheeseman
       Principal Investigator - NASA Ames, Computational Sciences Division
       cheesem@ptolemy.arc.nasa.gov

       John Stutz
       Research Programmer - NASA Ames, Computational Sciences Division
       stutz@ptolemy.arc.nasa.gov

       Will Taylor
       Support Programmer - NASA Ames, Computational Sciences Division
       taylor@ptolemy.arc.nasa.gov

SEE ALSO

       multimix(1).

                                                December 9, 2001                                    AUTOCLASS(1)