Ubuntu Manpage: mlpack_lmnn - large margin nearest neighbors (lmnn)

NAME

       mlpack_lmnn - large margin nearest neighbors (lmnn)

SYNOPSIS

        mlpack_lmnn -i string [-b int] [-C bool] [-d string] [-k int] [-l string] [-L bool] [-n int] [-N bool] [-O string] [-p int] [-P bool] [-R int] [-A int] [-r double] [-s int] [-a double] [-t double] [-V bool] [-c string] [-o string] [-D string] [-h -v]

DESCRIPTION

       This program implements Large Margin Nearest Neighbors, a distance learning technique. The
       method seeks to  improve  k-nearest-neighbor  classification  on  a  dataset.  The  method
       employes  the  strategy  of  reducing  distance between similar labeled data points (a.k.a
       target neighbors) and  increasing  distance  between  differently  labeled  points  (a.k.a
       impostors)  using  standard  optimization  techniques  over  the  gradient of the distance
       between data points.

       To work, this algorithm needs labeled data. It can be given as the last row of  the  input
       dataset  (specified  with  '--input_file  (-i)'),  or  alternatively  as a separate matrix
       (specified with '--labels_file (-l)'). Additionally, a  starting  point  for  optimization
       (specified  with '--distance_file (-d)'can be given, having (r x d) dimensionality. Here r
       should  satisfy  1  <=  r  <=  d,  Consequently  a  Low-Rank  matrix  will  be  optimized.
       Alternatively,  Low-Rank  distance can be learned by specifying the '--rank (-A)'parameter
       (A Low-Rank matrix with uniformly distributed values will  be  used  as  initial  learning
       point).

       The  program  also requires number of targets neighbors to work with ( specified with '--k
       (-k)'), A regularization parameter can also be passed, It acts as a trade of  between  the
       pulling  and  pushing  terms  (specified  with ’--regularization (-r)'), In addition, this
       implementation of LMNN includes a parameter to decide the interval after  which  impostors
       must be re-calculated (specified with '--range (-R)').

       Output can either be the learned distance matrix (specified with ’--output_file (-o)'), or
       the  transformed  dataset  (specified  with  ’--transformed_data_file  (-D)'),  or   both.
       Additionally  mean-centered  dataset  (specified  with '--centered_data_file (-c)') can be
       accessed given mean-centering  (specified  with  '--center  (-C)')  is  performed  on  the
       dataset.   Accuracy  on  initial  dataset  and final transformed dataset can be printed by
       specifying the '--print_accuracy (-P)'parameter.

       This implementation of LMNN uses AdaGrad, BigBatch_SGD, stochastic gradient descent, mini-
       batch stochastic gradient descent, or the L_BFGS optimizer.

       AdaGrad,  specified  by  the  value  'adagrad'  for the parameter '--optimizer (-O)', uses
       maximum of past  squared  gradients.  It  primarily  on  six  parameters:  the  step  size
       (specified  with '--step_size (-a)'), the batch size (specified with '--batch_size (-b)'),
       the maximum number of passes (specified with ’--passes (-p)').  Inaddition,  a  normalized
       starting point can be used by specifying the '--normalize (-N)' parameter.

       BigBatch_SGD, specified by the value 'bbsgd' for the parameter '--optimizer (-O)', depends
       primarily on four parameters: the step size (specified with ’--step_size (-a)'), the batch
       size  (specified  with  '--batch_size (-b)'), the maximum number of passes (specified with
       '--passes (-p)'). In addition, a normalized starting point can be used by  specifying  the
       '--normalize (-N)' parameter.

       Stochastic  gradient  descent, specified by the value 'sgd' for the parameter ’--optimizer
       (-O)', depends primarily on three parameters: the step size (specified  with  '--step_size
       (-a)'),  the  batch  size  (specified with ’--batch_size (-b)'), and the maximum number of
       passes (specified with ’--passes (-p)'). In addition, a normalized starting point  can  be
       used  by  specifying  the '--normalize (-N)' parameter. Furthermore, mean-centering can be
       performed on the dataset by specifying the '--center (-C)'parameter.

       The L-BFGS optimizer, specified by the value 'lbfgs' for the parameter ’--optimizer (-O)',
       uses  a  back-tracking  line  search  algorithm  to  minimize  a  function.  The following
       parameters  are  used  by  L-BFGS:   '--max_iterations   (-n)',   '--tolerance   (-t)'(the
       optimization  is  terminated when the gradient norm is below this value). For more details
       on the L-BFGS optimizer, consult either the mlpack L-BFGS documentation (in lbfgs.hpp)  or
       the  vast  set of published literature on L-BFGS. In addition, a normalized starting point
       can be used by specifying the '--normalize (-N)' parameter.

       By default, the AMSGrad optimizer is used.

       Example - Let's say we want to learn distance on iris dataset with number of targets as  3
       using BigBatch_SGD optimizer. A simple call for the same will look like:

       $ mlpack_mlpack_lmnn --input_file iris.csv --labels_file iris_labels.csv --k 3 --optimizer
       bbsgd --output_file output.csv

       An another program call making use of range & regularization parameter with dataset having
       labels as last column can be made as:

       $ mlpack_mlpack_lmnn --input_file letter_recognition.csv --k 5 --range 10 --regularization
       0.4 --output_file output.csv

REQUIRED INPUT OPTIONS

       --input_file (-i) [string]
              Input dataset to run LMNN on.

OPTIONAL INPUT OPTIONS

       --batch_size (-b) [int]
              Batch size for mini-batch SGD. Default value 50.

       --center (-C) [bool]
              Perform mean-centering on the dataset. It is useful when the centroid of  the  data
              is far from the origin.

       --distance_file (-d) [string]
              Initial distance matrix to be used as starting point

       --help (-h) [bool]
              Default help info.

       --info [string]
              Print help on a specific option. Default value ''.

       --k (-k) [int]
              Number of target neighbors to use for each datapoint. Default value 1.

       --labels_file (-l) [string]
              Labels for input dataset.

       --linear_scan (-L) [bool]
              Don't shuffle the order in which data points are visited for SGD or mini-batch SGD.

       --max_iterations (-n) [int]
              Maximum  number  of  iterations  for  L-BFGS  (0 indicates no limit). Default value
              100000.

       --normalize (-N) [bool]
              Use a normalized starting point for optimization. Itis useful for when  points  are
              far apart, or when SGD is returning NaN.

       --optimizer (-O) [string]
              Optimizer to use; 'amsgrad', 'bbsgd', 'sgd', or 'lbfgs'. Default value 'amsgrad'.

       --passes (-p) [int]
              Maximum  number  of  full  passes over dataset for AMSGrad, BB_SGD and SGD. Default
              value 50.  --print_accuracy (-P) [bool] Print accuracies on initial and transformed
              dataset

       --range (-R) [int]
              Number  of  iterations after which impostors needs to be recalculated Default value
              1.

       --rank (-A) [int]
              Rank of distance matrix to be optimized.  Default value 0.

       --regularization (-r) [double]
              Regularization for LMNN objective function  Default value 0.5.

       --seed (-s) [int]
              Random seed. If 0, 'std::time(NULL)' is used.  Default value 0.

       --step_size (-a) [double]
              Step size for AMSGrad, BB_SGD and SGD (alpha).  Default value 0.01.

       --tolerance (-t) [double]
              Maximum tolerance for termination of AMSGrad, BB_SGD, SGD or L-BFGS. Default  value
              1e-07.

       --verbose (-v) [bool]
              Display  informational  messages  and the full list of parameters and timers at the
              end of execution.

       --version (-V) [bool]
              Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

       --centered_data_file (-c) [string]
              Output matrix for mean-centered dataset.

       --output_file (-o) [string]
              Output matrix for learned distance matrix.

       --transformed_data_file (-D) [string]
              Output matrix for transformed dataset.

ADDITIONAL INFORMATION

       For further information, including relevant papers, citations,  and  theory,  consult  the
       documentation found at http://www.mlpack.org or included with your distribution of mlpack.