Ubuntu Manpage: mlpack_kmeans - k-means clustering

Provided by: mlpack-bin_4.1.0-1ubuntu1_amd64

NAME

       mlpack_kmeans - k-means clustering

SYNOPSIS

        mlpack_kmeans -c int -i unknown [-a string] [-e bool] [-P bool] [-I unknown] [-E bool] [-K bool] [-l bool] [-m int] [-p double] [-r bool] [-S int] [-s int] [-V bool] [-C unknown] [-o unknown] [-h -v]

DESCRIPTION

       This  program  performs K-Means clustering on the given dataset. It can return the learned
       cluster assignments, and the centroids of the clusters. Empty clusters are not allowed  by
       default; when a cluster becomes empty, the point furthest from the centroid of the cluster
       with maximum variance is taken to fill that cluster.

       Optionally, the strategy to choose initial  centroids  can  be  specified.  The  k-means++
       algorithm  can  be  used  to  choose  initial centroids with the ’--kmeans_plus_plus (-K)'
       parameter.  The  Bradley  and  Fayyad  approach  ("Refining  initial  points  for  k-means
       clustering", 1998) can be used to select initial points by specifying the '--refined_start
       (-r)' parameter. This approach works by taking random samplings of the dataset; to specify
       the  number  of  samplings,  the  '--samplings (-S)' parameter is used, and to specify the
       percentage of the dataset to be used in each sample, the '--percentage (-p)' parameter  is
       used (it should be a value between 0.0 and 1.0).

       There  are  several  options  available  for  the algorithm used for each Lloyd iteration,
       specified with the '--algorithm (-a)' option. The standard  O(kN)  approach  can  be  used
       ('naive').  Other  options include the Pelleg-Moore tree-based algorithm ('pelleg-moore'),
       Elkan's triangle-inequality based algorithm ('elkan'), Hamerly's modification  to  Elkan's
       algorithm  ('hamerly'), the dual-tree k-means algorithm ('dualtree'), and the dual-tree k-
       means algorithm using the cover tree ('dualtree-covertree').

       The behavior  for  when  an  empty  cluster  is  encountered  can  be  modified  with  the
       ’--allow_empty_clusters (-e)' option. When this option is specified and there is a cluster
       owning no points at the end of an iteration, that cluster's centroid will simply remain in
       its  position  from  the previous iteration. If the '--kill_empty_clusters (-E)' option is
       specified, then when a cluster owns no points at the end  of  an  iteration,  the  cluster
       centroid is simply filled with DBL_MAX, killing it and effectively reducing k for the rest
       of the computation. Note that the default option when  neither  empty  cluster  option  is
       specified  can  be  time-consuming  to  calculate;  therefore,  specifying either of these
       parameters will often accelerate runtime.

       Initial clustering assignments may be specified using the ’--initial_centroids_file  (-I)'
       parameter,   and   the   maximum   number   of   iterations  may  be  specified  with  the
       '--max_iterations (-m)' parameter.

       As an example, to use Hamerly's algorithm to perform k-means clustering with k=10  on  the
       dataset  'data.csv',  saving the centroids to 'centroids.csv' and the assignments for each
       point to 'assignments.csv', the following command could be used:

       $  mlpack_kmeans  --input_file  data.csv  --clusters  10   --output_file   assignments.csv
       --centroid_file centroids.csv

       To run k-means on that same dataset with initial centroids specified in ’initial.csv' with
       a maximum of 500 iterations, storing the output centroids  in  'final.csv'  the  following
       command may be used:

       $  mlpack_kmeans  --input_file data.csv --initial_centroids_file initial.csv --clusters 10
       --max_iterations 500 --centroid_file final.csv

REQUIRED INPUT OPTIONS

       --clusters (-c) [int]
              Number of clusters to find (0 autodetects from initial centroids).

       --input_file (-i) [unknown]
              Input dataset to perform clustering on.

OPTIONAL INPUT OPTIONS

       --algorithm (-a) [string]
              Algorithm to  use  for  the  Lloyd  iteration  ('naive',  'pelleg-moore',  'elkan',
              'hamerly', 'dualtree', or 'dualtree-covertree'). Default value 'naive'.

       --allow_empty_clusters (-e) [bool]
              Allow empty clusters to be persist.

       --help (-h) [bool]
              Default help info.

       --in_place (-P) [bool]
              If  specified, a column containing the learned cluster assignments will be added to
              the input dataset file. In this case, --output_file is overridden. (Do not  use  in
              Python.)

       --info [string]
              Print help on a specific option. Default value ''.

       --initial_centroids_file (-I) [unknown]
              Start with the specified initial centroids.

       --kill_empty_clusters (-E) [bool]
              Remove empty clusters when they occur.

       --kmeans_plus_plus (-K) [bool]
              Use the k-means++ initialization strategy to choose initial points.

       --labels_only (-l) [bool]
              Only output labels into output file.

       --max_iterations (-m) [int]
              Maximum number of iterations before k-means terminates. Default value 1000.

       --percentage (-p) [double]
              Percentage   of   dataset  to  use  for  each  refined  start  sampling  (use  when
              --refined_start is specified). Default value 0.02.

       --refined_start (-r) [bool]
              Use the refined initial point strategy by Bradley  and  Fayyad  to  choose  initial
              points.

       --samplings (-S) [int]
              Number of samplings to perform for refined start

       (use when --refined_start is specified).
              Default value 100.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --verbose (-v) [bool]
              Display  informational  messages  and the full list of parameters and timers at the
              end of execution.

       --version (-V) [bool]
              Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

       --centroid_file (-C) [unknown]
              If specified, the centroids of each cluster will be  written  to  the  given  file.
              --output_file (-o) [unknown] Matrix to store output labels or labeled data to.

ADDITIONAL INFORMATION

       For  further  information,  including  relevant papers, citations, and theory, consult the
       documentation found at http://www.mlpack.org or included with your distribution of mlpack.