Ubuntu Manpage: mlpack_kmeans - k-means clustering

name
synopsis
description
required input options
optional input options
optional output options
additional information
additional information

Provided by: mlpack-bin_2.2.5-1build1_amd64

NAME

       mlpack_kmeans - k-means clustering

SYNOPSIS

        mlpack_kmeans [-h] [-v]

DESCRIPTION

       This  program  performs  K-Means clustering on the given dataset, storing the learned cluster assignments
       either as a column of labels in the file containing the input  dataset  or  in  a  separate  file.  Empty
       clusters  are  not allowed by default; when a cluster becomes empty, the point furthest from the centroid
       of the cluster with maximum variance is taken to fill that cluster.

       Optionally, the Bradley and Fayyad approach ("Refining initial points for k-means clustering", 1998)  can
       be  used  to  select initial points by specifying the --refined_start (-r) option. This approach works by
       taking random samples of the dataset; to specify the number of samples, the --samples parameter is  used,
       and  to  specify  the  percentage of the dataset to be used in each sample, the --percentage parameter is
       used (it should be a value between 0.0 and 1.0).

       There are several options available for the algorithm used for each Lloyd iteration, specified  with  the
       --algorithm  (-a)  option.  The  standard O(kN) approach can be used ('naive'). Other options include the
       Pelleg-Moore  tree-based  algorithm  ('pelleg-moore'),  Elkan's   triangle-inequality   based   algorithm
       ('elkan'),  Hamerly's  modification  to  Elkan's  algorithm  ('hamerly'), the dual-tree k-means algorithm
       ('dualtree'), and the dual-tree k-means algorithm using the cover tree ('dualtree-covertree').

       The behavior for when an empty cluster is encountered can be  modified  with  the  --allow_empty_clusters
       (-e)  option.  When  this  option  is  specified and there is a cluster owning no points at the end of an
       iteration, that cluster's centroid will simply remain in its position from the previous iteration. If the
       --kill_empty_clusters  (-E)  option  is  specified,  then  when a cluster owns no points at the end of an
       iteration, the cluster centroid is simply filled with DBL_MAX, killing it and effectively reducing k  for
       the  rest of the computation. Note that the default option when neither empty cluster option is specified
       can be time-consuming to calculate; therefore, specifying -e or -E will often accelerate runtime.

       As of October 2014, the --overclustering option has been removed. If you want this support back,  let  us
       know---file a bug at https://github.com/mlpack/mlpack/ or get in touch through another means.

REQUIRED INPUT OPTIONS

       --clusters (-c) [int]
              Number of clusters to find (0 autodetects from initial centroids).

       --input_file (-i) [string]
              Input dataset to perform clustering on.

OPTIONAL INPUT OPTIONS

       --algorithm (-a) [string]
              Algorithm to use for the Lloyd iteration ('naive', 'pelleg-moore', 'elkan', 'hamerly', ’dualtree',
              or 'dualtree-covertree'). Default value 'naive'.

       --allow_empty_clusters (-e)
              Allow empty clusters to be persist.

       --help (-h)
              Default help info.

       --in_place (-P)
              If specified, a column containing the learned cluster assignments  will  be  added  to  the  input
              dataset file. In this case, --outputFile is overridden.

       --info [string]
              Get  help  on  a  specific module or option.  Default value ''.  --initial_centroids (-I) [string]
              Start with the specified initial centroids.  Default value ''.

       --kill_empty_clusters (-E)
              Remove empty clusters when they occur.

       --labels_only (-l)
              Only output labels into output file.

       --max_iterations (-m) [int]
              Maximum number of iterations before k-means terminates. Default value 1000.

       --percentage (-p) [double]
              Percentage of dataset to use  for  each  refined  start  sampling  (use  when  --refined_start  is
              specified). Default value 0.02.

       --refined_start (-r)
              Use the refined initial point strategy by Bradley and Fayyad to choose initial points.

       --samplings (-S) [int]
              Number of samplings to perform for refined start (use when --refined_start is specified).  Default
              value 100.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --verbose (-v)
              Display informational messages and the full list of parameters and timers at the end of execution.

       --version (-V)
              Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

       --centroid_file (-C) [string] If specified, the centroids of each cluster will be written  to  the  given
       file. Default value ’'.

       --output_file (-o) [string]
              File to write output labels or labeled data to.  Default value ''.

ADDITIONAL INFORMATION

       For  further  information,  including  relevant  papers,  citations, and theory, For further information,
       including   relevant   papers,   citations,   and   theory,   consult   the   documentation   found    at
       http://www.mlpack.org  or  included with your consult the documentation found at http://www.mlpack.org or
       included with your DISTRIBUTION OF MLPACK.  DISTRIBUTION OF MLPACK.

                                                                                 mlpack_kmeans(16 November 2017)