Ubuntu Manpage: mlpack_kmeans - k-means clustering

Provided by: mlpack-bin_2.2.5-1build1_amd64

NAME

       mlpack_kmeans - k-means clustering

SYNOPSIS

        mlpack_kmeans [-h] [-v]

DESCRIPTION

       This program performs K-Means clustering on the given dataset, storing the learned cluster
       assignments either as a column of labels in the file containing the input dataset or in  a
       separate  file.  Empty  clusters are not allowed by default; when a cluster becomes empty,
       the point furthest from the centroid of the cluster with maximum variance is taken to fill
       that cluster.

       Optionally,  the  Bradley  and  Fayyad  approach  ("Refining  initial  points  for k-means
       clustering", 1998) can be used to select initial points by specifying the  --refined_start
       (-r)  option.  This approach works by taking random samples of the dataset; to specify the
       number of samples, the --samples parameter is used, and to specify the percentage  of  the
       dataset  to  be  used  in  each sample, the --percentage parameter is used (it should be a
       value between 0.0 and 1.0).

       There are several options available for the  algorithm  used  for  each  Lloyd  iteration,
       specified  with  the  --algorithm  (-a)  option.  The  standard O(kN) approach can be used
       ('naive'). Other options include the Pelleg-Moore tree-based  algorithm  ('pelleg-moore'),
       Elkan's  triangle-inequality  based algorithm ('elkan'), Hamerly's modification to Elkan's
       algorithm ('hamerly'), the dual-tree k-means algorithm ('dualtree'), and the dual-tree  k-
       means algorithm using the cover tree ('dualtree-covertree').

       The  behavior  for  when  an  empty  cluster  is  encountered  can  be  modified  with the
       --allow_empty_clusters (-e) option. When this option is specified and there is  a  cluster
       owning no points at the end of an iteration, that cluster's centroid will simply remain in
       its position from the previous iteration. If  the  --kill_empty_clusters  (-E)  option  is
       specified,  then  when  a  cluster  owns no points at the end of an iteration, the cluster
       centroid is simply filled with DBL_MAX, killing it and effectively reducing k for the rest
       of  the  computation.  Note  that  the default option when neither empty cluster option is
       specified can be time-consuming to calculate; therefore, specifying -e or  -E  will  often
       accelerate runtime.

       As of October 2014, the --overclustering option has been removed. If you want this support
       back, let us know---file a  bug  at  https://github.com/mlpack/mlpack/  or  get  in  touch
       through another means.

REQUIRED INPUT OPTIONS

       --clusters (-c) [int]
              Number of clusters to find (0 autodetects from initial centroids).

       --input_file (-i) [string]
              Input dataset to perform clustering on.

OPTIONAL INPUT OPTIONS

       --algorithm (-a) [string]
              Algorithm  to  use  for  the  Lloyd  iteration  ('naive',  'pelleg-moore', 'elkan',
              'hamerly', ’dualtree', or 'dualtree-covertree'). Default value 'naive'.

       --allow_empty_clusters (-e)
              Allow empty clusters to be persist.

       --help (-h)
              Default help info.

       --in_place (-P)
              If specified, a column containing the learned cluster assignments will be added  to
              the input dataset file. In this case, --outputFile is overridden.

       --info [string]
              Get  help  on  a specific module or option.  Default value ''.  --initial_centroids
              (-I) [string] Start with the specified initial centroids.  Default value ''.

       --kill_empty_clusters (-E)
              Remove empty clusters when they occur.

       --labels_only (-l)
              Only output labels into output file.

       --max_iterations (-m) [int]
              Maximum number of iterations before k-means terminates. Default value 1000.

       --percentage (-p) [double]
              Percentage  of  dataset  to  use  for  each  refined  start  sampling   (use   when
              --refined_start is specified). Default value 0.02.

       --refined_start (-r)
              Use  the  refined  initial  point  strategy by Bradley and Fayyad to choose initial
              points.

       --samplings (-S) [int]
              Number of samplings to perform for  refined  start  (use  when  --refined_start  is
              specified).  Default value 100.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --verbose (-v)
              Display  informational  messages  and the full list of parameters and timers at the
              end of execution.

       --version (-V)
              Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

       --centroid_file (-C) [string] If specified, the centroids of each cluster will be  written
       to the given file. Default value ’'.

       --output_file (-o) [string]
              File to write output labels or labeled data to.  Default value ''.

ADDITIONAL INFORMATION

       For  further  information,  including  relevant papers, citations, and theory, For further
       information, including relevant papers, citations, and theory, consult  the  documentation
       found  at  http://www.mlpack.org  or included with your consult the documentation found at
       http://www.mlpack.org or included with  your  DISTRIBUTION  OF  MLPACK.   DISTRIBUTION  OF
       MLPACK.

                                                                  mlpack_kmeans(16 November 2017)