Ubuntu Manpage: mlpack_kmeans - k-means clustering

Provided by: mlpack-bin_4.5.1-1build2_amd64

NAME

       mlpack_kmeans - k-means clustering

SYNOPSIS

        mlpack_kmeans -c int -i unknown [-a string] [-e bool] [-P bool] [-I unknown] [-E bool] [-K bool] [-l bool] [-m int] [-p double] [-r bool] [-S int] [-s int] [-V bool] [-C unknown] [-o unknown] [-h -v]

DESCRIPTION

       This  program  performs  K-Means  clustering  on  the  given  dataset.  It can return the learned cluster
       assignments, and the centroids of the clusters. Empty clusters are not allowed by default; when a cluster
       becomes empty, the point furthest from the centroid of the cluster with maximum variance is taken to fill
       that cluster.

       Optionally, the strategy to choose initial centroids can be specified. The  k-means++  algorithm  can  be
       used  to  choose  initial  centroids with the ’--kmeans_plus_plus (-K)' parameter. The Bradley and Fayyad
       approach ("Refining initial points for k-means clustering", 1998) can be used to select initial points by
       specifying the '--refined_start (-r)' parameter. This approach works by taking random  samplings  of  the
       dataset; to specify the number of samplings, the '--samplings (-S)' parameter is used, and to specify the
       percentage of the dataset to be used in each sample, the '--percentage (-p)' parameter is used (it should
       be a value between 0.0 and 1.0).

       There  are  several options available for the algorithm used for each Lloyd iteration, specified with the
       '--algorithm (-a)' option. The standard O(kN) approach can be used ('naive'). Other options  include  the
       Pelleg-Moore   tree-based   algorithm   ('pelleg-moore'),  Elkan's  triangle-inequality  based  algorithm
       ('elkan'), Hamerly's modification to Elkan's  algorithm  ('hamerly'),  the  dual-tree  k-means  algorithm
       ('dualtree'), and the dual-tree k-means algorithm using the cover tree ('dualtree-covertree').

       The  behavior  for  when an empty cluster is encountered can be modified with the ’--allow_empty_clusters
       (-e)' option. When this option is specified and there is a cluster owning no points  at  the  end  of  an
       iteration, that cluster's centroid will simply remain in its position from the previous iteration. If the
       '--kill_empty_clusters  (-E)'  option  is  specified, then when a cluster owns no points at the end of an
       iteration, the cluster centroid is simply filled with DBL_MAX, killing it and effectively reducing k  for
       the  rest of the computation. Note that the default option when neither empty cluster option is specified
       can be time-consuming  to  calculate;  therefore,  specifying  either  of  these  parameters  will  often
       accelerate runtime.

       Initial  clustering assignments may be specified using the ’--initial_centroids_file (-I)' parameter, and
       the maximum number of iterations may be specified with the '--max_iterations (-m)' parameter.

       As an example, to use Hamerly's algorithm  to  perform  k-means  clustering  with  k=10  on  the  dataset
       'data.csv',   saving   the   centroids   to  'centroids.csv'  and  the  assignments  for  each  point  to
       'assignments.csv', the following command could be used:

       $  mlpack_kmeans  --input_file  data.csv  --clusters  10  --output_file  assignments.csv  --centroid_file
       centroids.csv

       To  run  k-means on that same dataset with initial centroids specified in ’initial.csv' with a maximum of
       500 iterations, storing the output centroids in 'final.csv' the following command may be used:

       $ mlpack_kmeans --input_file data.csv --initial_centroids_file initial.csv --clusters 10 --max_iterations
       500 --centroid_file final.csv

REQUIRED INPUT OPTIONS

       --clusters (-c) [int]
              Number of clusters to find (0 autodetects from initial centroids).

       --input_file (-i) [unknown]
              Input dataset to perform clustering on.

OPTIONAL INPUT OPTIONS

       --algorithm (-a) [string]
              Algorithm to use for the Lloyd iteration ('naive', 'pelleg-moore', 'elkan', 'hamerly', 'dualtree',
              or 'dualtree-covertree'). Default value 'naive'.

       --allow_empty_clusters (-e) [bool]
              Allow empty clusters to be persist.

       --help (-h) [bool]
              Default help info.

       --in_place (-P) [bool]
              If specified, a column containing the learned cluster assignments  will  be  added  to  the  input
              dataset file. In this case, --output_file is overridden. (Do not use in Python.)

       --info [string]
              Print help on a specific option. Default value ''.

       --initial_centroids_file (-I) [unknown]
              Start with the specified initial centroids.

       --kill_empty_clusters (-E) [bool]
              Remove empty clusters when they occur.

       --kmeans_plus_plus (-K) [bool]
              Use the k-means++ initialization strategy to choose initial points.

       --labels_only (-l) [bool]
              Only output labels into output file.

       --max_iterations (-m) [int]
              Maximum number of iterations before k-means terminates. Default value 1000.

       --percentage (-p) [double]
              Percentage  of  dataset  to  use  for  each  refined  start  sampling (use when --refined_start is
              specified). Default value 0.02.

       --refined_start (-r) [bool]
              Use the refined initial point strategy by Bradley and Fayyad to choose initial points.

       --samplings (-S) [int]
              Number of samplings to perform for refined start

       (use when --refined_start is specified).
              Default value 100.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --verbose (-v) [bool]
              Display informational messages and the full list of parameters and timers at the end of execution.

       --version (-V) [bool]
              Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

       --centroid_file (-C) [unknown]
              If specified, the centroids of each cluster will be written to the given file.  --output_file (-o)
              [unknown] Matrix to store output labels or labeled data to.

ADDITIONAL INFORMATION

       For further information, including relevant papers, citations,  and  theory,  consult  the  documentation
       found at http://www.mlpack.org or included with your distribution of mlpack.

mlpack-4.5.1                                     29 January 2025                                mlpack_kmeans(1)