Ubuntu Manpage: mlpack_nca - neighborhood components analysis (nca)

NAME

       mlpack_nca - neighborhood components analysis (nca)

SYNOPSIS

        mlpack_nca [-h] [-v] -i string -o string [-A double] [-l string] [-L] [-n int] [-T int] [-M double] [-m double] [-N] [-B int] [-O string] [-s int] [-a double] [-t double] [-V] [-w double]

DESCRIPTION

This program implements Neighborhood Components Analysis, both a linear dimensionality
reduction technique and a distance learning technique. The method seeks to improve k-
nearest-neighbor classification on a dataset by scaling the dimensions. The method is
nonparametric, and does not require a value of k. It works by using stochastic ("soft")
neighbor assignments and using optimization techniques over the gradient of the accuracy
of the neighbor assignments.

To work, this algorithm needs labeled data. It can be given as the last row of the input
dataset (--input_file), or alternatively in a separate file (--labels_file).

This implementation of NCA uses either stochastic gradient descent or the L_BFGS
optimizer. Both of these optimizers do not guarantee global convergence for a nonconvex
objective function (NCA's objective function is nonconvex), so the final results could
depend on the random seed or other optimizer parameters.

Stochastic gradient descent, specified by --optimizer "sgd", depends primarily on two
parameters: the step size (--step_size) and the maximum number of iterations
(--max_iterations). In addition, a normalized starting point can be used (--normalize),
which is necessary if many warnings of the form ’Denominator of p_i is 0!' are given.
Tuning the step size can be a tedious affair. In general, the step size is too large if
the objective is not mostly uniformly decreasing, or if zero-valued denominator warnings
are being issued. The step size is too small if the objective is changing very slowly.
Setting the termination condition can be done easily once a good step size parameter is
found; either increase the maximum iterations to a large number and allow SGD to find a
minimum, or set the maximum iterations to 0 (allowing infinite iterations) and set the
tolerance (--tolerance) to define the maximum allowed difference between objectives for
SGD to terminate. Be careful -- setting the tolerance instead of the maximum iterations
can take a very long time and may actually never converge due to the properties of the SGD
optimizer.

The L-BFGS optimizer, specified by --optimizer "lbfgs", uses a back-tracking line search
algorithm to minimize a function. The following parameters are used by L-BFGS: --num_basis
(specifies the number of memory points used by L-BFGS), --max_iterations,
--armijo_constant, --wolfe, --tolerance (the optimization is terminated when the gradient
norm is below this value), --max_line_search_trials, --min_step and --max_step (which both
refer to the line search routine). For more details on the L-BFGS optimizer, consult
either the mlpack L-BFGS documentation (in lbfgs.hpp) or the vast set of published
literature on L-BFGS.

By default, the SGD optimizer is used.

REQUIRED OPTIONS

       --input_file (-i) [string]
              Input dataset to run NCA on.

       --output_file (-o) [string]
              Output file for learned distance matrix.

OPTIONS

       --armijo_constant (-A) [double]
              Armijo constant for L-BFGS. Default value 0.0001.

       --help (-h)
              Default help info.

       --info [string]
              Get help on a specific module or option.  Default value ''.

       --labels_file (-l) [string]
              File of labels for input dataset. Default value ''.

       --linear_scan (-L)
              Don't shuffle the order in which data points are visited for SGD.

       --max_iterations (-n) [int]
              Maximum number of iterations for SGD or L-BFGS  (0  indicates  no  limit).  Default
              value 500000.

       --max_line_search_trials (-T) [int]
              Maximum number of line search trials for L-BFGS. Default value 50.

       --max_step (-M) [double]
              Maximum step of line search for L-BFGS. Default value 1e+20.

       --min_step (-m) [double]
              Minimum step of line search for L-BFGS. Default value 1e-20.

       --normalize (-N)
              Use  a  normalized  starting point for optimization. This is useful for when points
              are far apart, or when SGD is returning NaN.

       --num_basis (-B) [int]
              Number of memory points to be stored for L-BFGS. Default value 5.

       --optimizer (-O) [string]
              Optimizer to use; "sgd" or "lbfgs". Default value 'sgd'.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --step_size (-a) [double]
              Step size for stochastic gradient descent (alpha). Default value 0.01.

       --tolerance (-t) [double]
              Maximum tolerance for termination of SGD or L-BFGS. Default value 1e-07.

       --verbose (-v)
              Display informational messages and the full list of parameters and  timers  at  the
              end of execution.

       --version (-V)
              Display the version of mlpack.

       --wolfe (-w) [double]
              Wolfe condition parameter for L-BFGS. Default value 0.9.

ADDITIONAL INFORMATION

       For  further  information,  including  relevant papers, citations, and theory, consult the
       documentation found at http://www.mlpack.org or included with your DISTRIBUTION OF MLPACK.

                                                                                    mlpack_nca(1)