Ubuntu Manpage: mlpack_kmeans - k-means clustering

NAME

       mlpack_kmeans - k-means clustering

SYNOPSIS

        mlpack_kmeans [-h] [-v] -c int -i string [-a string] [-e] [-C string] [-P] [-I string] [-l] [-m int] [-o string] [-p double] [-r] [-S int] [-s int] -V

DESCRIPTION

This program performs K-Means clustering on the given dataset, storing the learned cluster
assignments either as a column of labels in the file containing the input dataset or in a
separate file. Empty clusters are not allowed by default; when a cluster becomes empty,
the point furthest from the centroid of the cluster with maximum variance is taken to fill
that cluster.

Optionally, the Bradley and Fayyad approach ("Refining initial points for k-means
clustering", 1998) can be used to select initial points by specifying the --refined_start
(-r) option. This approach works by taking random samples of the dataset; to specify the
number of samples, the --samples parameter is used, and to specify the percentage of the
dataset to be used in each sample, the --percentage parameter is used (it should be a
value between 0.0 and 1.0).

There are several options available for the algorithm used for each Lloyd iteration,
specified with the --algorithm (-a) option. The standard O(kN) approach can be used
('naive'). Other options include the Pelleg-Moore tree-based algorithm ('pelleg-moore'),
Elkan's triangle-inequality based algorithm ('elkan'), Hamerly's modification to Elkan's
algorithm ('hamerly'), the dual-tree k-means algorithm ('dualtree'), and the dual-tree k-
means algorithm using the cover tree ('dualtree-covertree').

As of October 2014, the --overclustering option has been removed. If you want this support
back, let us know -- file a bug at https://github.com/mlpack/mlpack/ or get in touch
through another means.

REQUIRED OPTIONS

       --clusters (-c) [int]
              Number of clusters to find (0 autodetects from initial centroids).

       --input_file (-i) [string]
              Input dataset to perform clustering on.

OPTIONS

       --algorithm (-a) [string]
              Algorithm  to  use  for  the  Lloyd  iteration  ('naive',  'pelleg-moore', 'elkan',
              'hamerly', 'dualtree', or 'dualtree-covertree'). Default value 'naive'.

       --allow_empty_clusters (-e)
              Allow empty clusters to be created.

       --centroid_file (-C) [string]
              If specified, the centroids of each cluster will be  written  to  the  given  file.
              Default value ''.

       --help (-h)
              Default help info.

       --in_place (-P)
              If  specified, a column containing the learned cluster assignments will be added to
              the input dataset file. In this case, --outputFile is overridden.

       --info [string]
              Get help on a specific module or option.  Default value ''.

       --initial_centroids (-I) [string]
              Start with the specified initial centroids.  Default value ''.

       --labels_only (-l)
              Only output labels into output file.

       --max_iterations (-m) [int]
              Maximum number of iterations before K-Means terminates. Default value 1000.

       --output_file (-o) [string]
              File to write output labels or labeled data to.  Default value ''.

       --percentage (-p) [double]
              Percentage  of  dataset  to  use  for  each  refined  start  sampling   (use   when
              --refined_start is specified). Default value 0.02.

       --refined_start (-r)
              Use  the  refined  initial  point  strategy by Bradley and Fayyad to choose initial
              points.

       --samplings (-S) [int]
              Number of samplings to perform for refined start

       (use when --refined_start is specified).
              Default value 100.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --verbose (-v)
              Display informational messages and the full list of parameters and  timers  at  the
              end of execution.

       --version (-V)
              Display the version of mlpack.

ADDITIONAL INFORMATION

       For  further  information,  including  relevant papers, citations, and theory, consult the
       documentation found at http://www.mlpack.org or included with your DISTRIBUTION OF MLPACK.

                                                                                 mlpack_kmeans(1)