Ubuntu Manpage: mlpack_nca - neighborhood components analysis (nca)

name
synopsis
description
required options
options
additional information

NAME

       mlpack_nca - neighborhood components analysis (nca)

SYNOPSIS

        mlpack_nca [-h] [-v] -i string -o string [-A double] [-l string] [-L] [-n int] [-T int] [-M double] [-m double] [-N] [-B int] [-O string] [-s int] [-a double] [-t double] [-V] [-w double]

DESCRIPTION

This program implements Neighborhood Components Analysis, both a linear dimensionality reduction
technique and a distance learning technique. The method seeks to improve k-nearest-neighbor
classification on a dataset by scaling the dimensions. The method is nonparametric, and does not require
a value of k. It works by using stochastic ("soft") neighbor assignments and using optimization
techniques over the gradient of the accuracy of the neighbor assignments.

To work, this algorithm needs labeled data. It can be given as the last row of the input dataset
(--input_file), or alternatively in a separate file (--labels_file).

This implementation of NCA uses either stochastic gradient descent or the L_BFGS optimizer. Both of these
optimizers do not guarantee global convergence for a nonconvex objective function (NCA's objective
function is nonconvex), so the final results could depend on the random seed or other optimizer
parameters.

Stochastic gradient descent, specified by --optimizer "sgd", depends primarily on two parameters: the
step size (--step_size) and the maximum number of iterations (--max_iterations). In addition, a
normalized starting point can be used (--normalize), which is necessary if many warnings of the form
’Denominator of p_i is 0!' are given. Tuning the step size can be a tedious affair. In general, the step
size is too large if the objective is not mostly uniformly decreasing, or if zero-valued denominator
warnings are being issued. The step size is too small if the objective is changing very slowly. Setting
the termination condition can be done easily once a good step size parameter is found; either increase
the maximum iterations to a large number and allow SGD to find a minimum, or set the maximum iterations
to 0 (allowing infinite iterations) and set the tolerance (--tolerance) to define the maximum allowed
difference between objectives for SGD to terminate. Be careful -- setting the tolerance instead of the
maximum iterations can take a very long time and may actually never converge due to the properties of the
SGD optimizer.

The L-BFGS optimizer, specified by --optimizer "lbfgs", uses a back-tracking line search algorithm to
minimize a function. The following parameters are used by L-BFGS: --num_basis (specifies the number of
memory points used by L-BFGS), --max_iterations, --armijo_constant, --wolfe, --tolerance (the
optimization is terminated when the gradient norm is below this value), --max_line_search_trials,
--min_step and --max_step (which both refer to the line search routine). For more details on the L-BFGS
optimizer, consult either the mlpack L-BFGS documentation (in lbfgs.hpp) or the vast set of published
literature on L-BFGS.

By default, the SGD optimizer is used.

REQUIRED OPTIONS

       --input_file (-i) [string]
              Input dataset to run NCA on.

       --output_file (-o) [string]
              Output file for learned distance matrix.

OPTIONS

       --armijo_constant (-A) [double]
              Armijo constant for L-BFGS. Default value 0.0001.

       --help (-h)
              Default help info.

       --info [string]
              Get help on a specific module or option.  Default value ''.

       --labels_file (-l) [string]
              File of labels for input dataset. Default value ''.

       --linear_scan (-L)
              Don't shuffle the order in which data points are visited for SGD.

       --max_iterations (-n) [int]
              Maximum number of iterations for SGD or L-BFGS (0 indicates no limit). Default value 500000.

       --max_line_search_trials (-T) [int]
              Maximum number of line search trials for L-BFGS. Default value 50.

       --max_step (-M) [double]
              Maximum step of line search for L-BFGS. Default value 1e+20.

       --min_step (-m) [double]
              Minimum step of line search for L-BFGS. Default value 1e-20.

       --normalize (-N)
              Use a normalized starting point for optimization. This is useful for when points are far apart, or
              when SGD is returning NaN.

       --num_basis (-B) [int]
              Number of memory points to be stored for L-BFGS. Default value 5.

       --optimizer (-O) [string]
              Optimizer to use; "sgd" or "lbfgs". Default value 'sgd'.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --step_size (-a) [double]
              Step size for stochastic gradient descent (alpha). Default value 0.01.

       --tolerance (-t) [double]
              Maximum tolerance for termination of SGD or L-BFGS. Default value 1e-07.

       --verbose (-v)
              Display informational messages and the full list of parameters and timers at the end of execution.

       --version (-V)
              Display the version of mlpack.

       --wolfe (-w) [double]
              Wolfe condition parameter for L-BFGS. Default value 0.9.

ADDITIONAL INFORMATION

       For  further  information,  including  relevant  papers, citations, and theory, consult the documentation
       found at http://www.mlpack.org or included with your DISTRIBUTION OF MLPACK.

                                                                                                   mlpack_nca(1)