Ubuntu Manpage: mlpack_kmeans - k-means clustering

NAME

       mlpack_kmeans - k-means clustering

SYNOPSIS

        mlpack_kmeans -c int -i string [-a string] [-e bool] [-P bool] [-I string] [-E bool] [-l bool] [-m int] [-p double] [-r bool] [-S int] [-s int] [-V bool] [-C string] [-o string] [-h -v]

DESCRIPTION

       This  program  performs  K-Means  clustering  on  the  given  dataset.  It can return the learned cluster
       assignments, and the centroids of the clusters. Empty clusters are not allowed by default; when a cluster
       becomes empty, the point furthest from the centroid of the cluster with maximum variance is taken to fill
       that cluster.

       Optionally, the Bradley and Fayyad approach ("Refining initial points for k-means clustering", 1998)  can
       be  used to select initial points by specifying the '--refined_start (-r)' parameter. This approach works
       by taking random samplings of the dataset; to specify the number of  samplings,  the  '--samplings  (-S)'
       parameter  is  used,  and  to  specify  the  percentage  of  the  dataset  to be used in each sample, the
       '--percentage (-p)' parameter is used (it should be a value between 0.0 and 1.0).

       There are several options available for the algorithm used for each Lloyd iteration, specified  with  the
       '--algorithm  (-a)'  option. The standard O(kN) approach can be used ('naive'). Other options include the
       Pelleg-Moore  tree-based  algorithm  ('pelleg-moore'),  Elkan's   triangle-inequality   based   algorithm
       ('elkan'),  Hamerly's  modification  to  Elkan's  algorithm  ('hamerly'), the dual-tree k-means algorithm
       ('dualtree'), and the dual-tree k-means algorithm using the cover tree ('dualtree-covertree').

       The behavior for when an empty cluster is encountered can be modified  with  the  ’--allow_empty_clusters
       (-e)'  option.  When  this  option  is specified and there is a cluster owning no points at the end of an
       iteration, that cluster's centroid will simply remain in its position from the previous iteration. If the
       '--kill_empty_clusters  (-E)'  option  is  specified, then when a cluster owns no points at the end of an
       iteration, the cluster centroid is simply filled with DBL_MAX, killing it and effectively reducing k  for
       the  rest of the computation. Note that the default option when neither empty cluster option is specified
       can be time-consuming  to  calculate;  therefore,  specifying  either  of  these  parameters  will  often
       accelerate runtime.

       Initial  clustering assignments may be specified using the ’--initial_centroids_file (-I)' parameter, and
       the maximum number of iterations may be specified with the '--max_iterations (-m)' parameter.

       As an example, to use Hamerly's algorithm  to  perform  k-means  clustering  with  k=10  on  the  dataset
       'data.csv',   saving   the   centroids   to  'centroids.csv'  and  the  assignments  for  each  point  to
       'assignments.csv', the following command could be used:

       $  mlpack_kmeans  --input_file  data.csv  --clusters  10  --output_file  assignments.csv  --centroid_file
       centroids.csv

       To  run  k-means on that same dataset with initial centroids specified in ’initial.csv' with a maximum of
       500 iterations, storing the output centroids in 'final.csv' the following command may be used:

       $ mlpack_kmeans --input_file data.csv --initial_centroids_file initial.csv --clusters 10 --max_iterations
       500 --centroid_file final.csv

REQUIRED INPUT OPTIONS

       --clusters (-c) [int]
              Number of clusters to find (0 autodetects from initial centroids).

       --input_file (-i) [string]
              Input dataset to perform clustering on.

OPTIONAL INPUT OPTIONS

       --algorithm (-a) [string]
              Algorithm to use for the Lloyd iteration ('naive', 'pelleg-moore', 'elkan', 'hamerly', 'dualtree',
              or 'dualtree-covertree'). Default value 'naive'.

       --allow_empty_clusters (-e) [bool]
              Allow empty clusters to be persist.

       --help (-h) [bool]
              Default help info.

       --in_place (-P) [bool]
              If specified, a column containing the learned cluster assignments  will  be  added  to  the  input
              dataset file. In this case, --output_file is overridden. (Do not use in Python.)

       --info [string]
              Print help on a specific option. Default value ''.

       --initial_centroids_file (-I) [string]
              Start with the specified initial centroids.

       --kill_empty_clusters (-E) [bool]
              Remove empty clusters when they occur.

       --labels_only (-l) [bool]
              Only output labels into output file.

       --max_iterations (-m) [int]
              Maximum number of iterations before k-means terminates. Default value 1000.

       --percentage (-p) [double]
              Percentage  of  dataset  to  use  for  each  refined  start  sampling (use when --refined_start is
              specified). Default value 0.02.

       --refined_start (-r) [bool]
              Use the refined initial point strategy by Bradley and Fayyad to choose initial points.

       --samplings (-S) [int]
              Number of samplings to perform for refined start

       (use when --refined_start is specified).
              Default value 100.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --verbose (-v) [bool]
              Display informational messages and the full list of parameters and timers at the end of execution.

       --version (-V) [bool]
              Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

       --centroid_file (-C) [string]
              If specified, the centroids of each cluster will be written to the given file.

       --output_file (-o) [string]
              Matrix to store output labels or labeled data to.

ADDITIONAL INFORMATION

       For further information, including relevant papers, citations,  and  theory,  consult  the  documentation
       found at http://www.mlpack.org or included with your distribution of mlpack.