Ubuntu Manpage: mlpack_nca - neighborhood components analysis (nca)

Provided by: mlpack-bin_2.2.5-1build1_amd64

NAME

       mlpack_nca - neighborhood components analysis (nca)

SYNOPSIS

        mlpack_nca [-h] [-v]

DESCRIPTION

This program implements Neighborhood Components Analysis, both a linear dimensionality
reduction technique and a distance learning technique. The method seeks to improve k-
nearest-neighbor classification on a dataset by scaling the dimensions. The method is
nonparametric, and does not require a value of k. It works by using stochastic ("soft")
neighbor assignments and using optimization techniques over the gradient of the accuracy
of the neighbor assignments.

To work, this algorithm needs labeled data. It can be given as the last row of the input
dataset (--input_file), or alternatively in a separate file (--labels_file).

This implementation of NCA uses stochastic gradient descent, mini-batch stochastic
gradient descent, or the L_BFGS optimizer. These optimizers do not guarantee global
convergence for a nonconvex objective function (NCA's objective function is nonconvex), so
the final results could depend on the random seed or other optimizer parameters.

Stochastic gradient descent, specified by --optimizer "sgd", depends primarily on two
parameters: the step size (--step_size) and the maximum number of iterations
(--max_iterations). In addition, a normalized starting point can be used (--normalize),
which is necessary if many warnings of the form ’Denominator of p_i is 0!' are given.
Tuning the step size can be a tedious affair. In general, the step size is too large if
the objective is not mostly uniformly decreasing, or if zero-valued denominator warnings
are being issued. The step size is too small if the objective is changing very slowly.
Setting the termination condition can be done easily once a good step size parameter is
found; either increase the maximum iterations to a large number and allow SGD to find a
minimum, or set the maximum iterations to 0 (allowing infinite iterations) and set the
tolerance (--tolerance) to define the maximum allowed difference between objectives for
SGD to terminate. Be careful---setting the tolerance instead of the maximum iterations can
take a very long time and may actually never converge due to the properties of the SGD
optimizer. Note that a single iteration of SGD refers to a single point, so to take a
single pass over the dataset, set --max_iterations equal to the number of points in the
dataset.

The mini-batch SGD optimizer, specified by --optimizer "minibatch-sgd", has the same
parameters as SGD, but the batch size may also be specified with the --batch_size (-b)
option. Each iteration of mini-batch SGD refers to a single mini-batch.

The L-BFGS optimizer, specified by --optimizer "lbfgs", uses a back-tracking line search
algorithm to minimize a function. The following parameters are used by L-BFGS: --num_basis
(specifies the number of memory points used by L-BFGS), --max_iterations,
--armijo_constant, --wolfe, --tolerance (the optimization is terminated when the gradient
norm is below this value), --max_line_search_trials, --min_step and --max_step (which both
refer to the line search routine). For more details on the L-BFGS optimizer, consult
either the mlpack L-BFGS documentation (in lbfgs.hpp) or the vast set of published
literature on L-BFGS.

By default, the SGD optimizer is used.

REQUIRED INPUT OPTIONS

       --input_file (-i) [string]
              Input dataset to run NCA on.

OPTIONAL INPUT OPTIONS

       --armijo_constant (-A) [double] Armijo constant for L-BFGS. Default value 0.0001.

       --batch_size (-b) [int]
              Batch size for mini-batch SGD. Default value

              50.

       --help (-h)
              Default help info.

       --info [string]
              Get help on a specific module or option.  Default value ''.

       --labels_file (-l) [string]
              File of labels for input dataset. Default value ’'.

       --linear_scan (-L)
              Don't shuffle the order in which data points are visited for SGD or mini-batch SGD.

       --max_iterations (-n) [int]
              Maximum number of iterations for SGD or L-BFGS  (0  indicates  no  limit).  Default
              value  500000.   --max_line_search_trials  (-T) [int] Maximum number of line search
              trials for L-BFGS.  Default value 50.

       --max_step (-M) [double]
              Maximum step of line search for L-BFGS. Default value 1e+20.

       --min_step (-m) [double]
              Minimum step of line search for L-BFGS. Default value 1e-20.

       --normalize (-N)
              Use a normalized starting point for optimization. This is useful  for  when  points
              are far apart, or when SGD is returning NaN.

       --num_basis (-B) [int]
              Number of memory points to be stored for L-BFGS.  Default value 5.

       --optimizer (-O) [string]
              Optimizer to use; 'sgd', 'minibatch-sgd', or ’lbfgs'. Default value 'sgd'.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --step_size (-a) [double]
              Step size for stochastic gradient descent (alpha). Default value 0.01.

       --tolerance (-t) [double]
              Maximum tolerance for termination of SGD or L-BFGS. Default value 1e-07.

       --verbose (-v)
              Display  informational  messages  and the full list of parameters and timers at the
              end of execution.

       --version (-V)
              Display the version of mlpack.

       --wolfe (-w) [double]
              Wolfe condition parameter for L-BFGS. Default value 0.9.

OPTIONAL OUTPUT OPTIONS

       --output_file (-o) [string]
              Output file for learned distance matrix.  Default value ''.

ADDITIONAL INFORMATION

       For further information, including relevant papers, citations,  and  theory,  For  further
       information,  including  relevant papers, citations, and theory, consult the documentation
       found at http://www.mlpack.org or included with your consult the  documentation  found  at
       http://www.mlpack.org  or  included  with  your  DISTRIBUTION  OF MLPACK.  DISTRIBUTION OF
       MLPACK.

                                                                     mlpack_nca(16 November 2017)