Provided by: mlpack-bin_3.0.4-1_amd64 bug


       mlpack_kmeans - k-means clustering


        mlpack_kmeans -c int -i string [-a string] [-e bool] [-P bool] [-I string] [-E bool] [-l bool] [-m int] [-p double] [-r bool] [-S int] [-s int] [-V bool] [-C string] [-o string] [-h -v]


       This  program  performs K-Means clustering on the given dataset. It can return the learned
       cluster assignments, and the centroids of the clusters. Empty clusters are not allowed  by
       default; when a cluster becomes empty, the point furthest from the centroid of the cluster
       with maximum variance is taken to fill that cluster.

       Optionally, the  Bradley  and  Fayyad  approach  ("Refining  initial  points  for  k-means
       clustering", 1998) can be used to select initial points by specifying the '--refined_start
       (-r)' parameter. This approach works by taking random samplings of the dataset; to specify
       the  number  of  samplings,  the  '--samplings (-S)' parameter is used, and to specify the
       percentage of the dataset to be used in each sample, the '--percentage (-p)' parameter  is
       used (it should be a value between 0.0 and 1.0).

       There  are  several  options  available  for  the algorithm used for each Lloyd iteration,
       specified with the '--algorithm (-a)' option. The standard  O(kN)  approach  can  be  used
       ('naive').  Other  options include the Pelleg-Moore tree-based algorithm ('pelleg-moore'),
       Elkan's triangle-inequality based algorithm ('elkan'), Hamerly's modification  to  Elkan's
       algorithm  ('hamerly'), the dual-tree k-means algorithm ('dualtree'), and the dual-tree k-
       means algorithm using the cover tree ('dualtree-covertree').

       The behavior  for  when  an  empty  cluster  is  encountered  can  be  modified  with  the
       ’--allow_empty_clusters (-e)' option. When this option is specified and there is a cluster
       owning no points at the end of an iteration, that cluster's centroid will simply remain in
       its  position  from  the previous iteration. If the '--kill_empty_clusters (-E)' option is
       specified, then when a cluster owns no points at the end  of  an  iteration,  the  cluster
       centroid is simply filled with DBL_MAX, killing it and effectively reducing k for the rest
       of the computation. Note that the default option when  neither  empty  cluster  option  is
       specified  can  be  time-consuming  to  calculate;  therefore,  specifying either of these
       parameters will often accelerate runtime.

       Initial clustering assignments may be specified using the ’--initial_centroids_file  (-I)'
       parameter,   and   the   maximum   number   of   iterations  may  be  specified  with  the
       '--max_iterations (-m)' parameter.

       As an example, to use Hamerly's algorithm to perform k-means clustering with k=10  on  the
       dataset  'data.csv',  saving the centroids to 'centroids.csv' and the assignments for each
       point to 'assignments.csv', the following command could be used:

       $ kmeans --input_file data.csv --clusters 10 --output_file assignments.csv --centroid_file

       To run k-means on that same dataset with initial centroids specified in ’initial.csv' with
       a maximum of 500 iterations, storing the output centroids  in  'final.csv'  the  following
       command may be used:

       $   kmeans   --input_file  data.csv  --initial_centroids_file  initial.csv  --clusters  10
       --max_iterations 500 --centroid_file final.csv


       --clusters (-c) [int]
              Number of clusters to find (0 autodetects from initial centroids).

       --input_file (-i) [string]
              Input dataset to perform clustering on.


       --algorithm (-a) [string]
              Algorithm to  use  for  the  Lloyd  iteration  ('naive',  'pelleg-moore',  'elkan',
              'hamerly', 'dualtree', or 'dualtree-covertree'). Default value 'naive'.

       --allow_empty_clusters (-e) [bool]
              Allow empty clusters to be persist.

       --help (-h) [bool]
              Default help info.

       --in_place (-P) [bool]
              If  specified, a column containing the learned cluster assignments will be added to
              the input dataset file. In this case, --output_file is overridden. (Do not  use  in

       --info [string]
              Get help on a specific module or option.  Default value ''.

       --initial_centroids_file (-I) [string]
              Start with the specified initial centroids.  Default value ''.

       --kill_empty_clusters (-E) [bool]
              Remove empty clusters when they occur.

       --labels_only (-l) [bool]
              Only output labels into output file.

       --max_iterations (-m) [int]
              Maximum number of iterations before k-means terminates. Default value 1000.

       --percentage (-p) [double]
              Percentage   of   dataset  to  use  for  each  refined  start  sampling  (use  when
              --refined_start is specified). Default value 0.02.

       --refined_start (-r) [bool]
              Use the refined initial point strategy by Bradley  and  Fayyad  to  choose  initial

       --samplings (-S) [int]
              Number of samplings to perform for refined start

       (use when --refined_start is specified).
              Default value 100.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --verbose (-v) [bool]
              Display  informational  messages  and the full list of parameters and timers at the
              end of execution.

       --version (-V) [bool]
              Display the version of mlpack.


       --centroid_file (-C) [string]
              If specified, the centroids of each cluster will be  written  to  the  given  file.
              Default value ''.

       --output_file (-o) [string]
              Matrix to store output labels or labeled data to. Default value ''.


       For  further  information,  including  relevant papers, citations, and theory, consult the
       documentation found at or included with your distribution of mlpack.