Ubuntu Manpage: mlpack_approx_kfn - approximate furthest neighbor search

NAME

       mlpack_approx_kfn - approximate furthest neighbor search

SYNOPSIS

        mlpack_approx_kfn [-a string] [-e bool] [-x string] [-m unknown] [-k int] [-p int] [-t int] [-q string] [-r string] [-V bool] [-d string] [-n string] [-M unknown] [-h -v]

DESCRIPTION

This program implements two strategies for furthest neighbor search. These strategies are:

• The 'qdafn' algorithm from "Approximate Furthest Neighbor in High Dimensions" by
R. Pagh, F. Silvestri, J. Sivertsen, and M. Skala, in Similarity Search and
Applications 2015 (SISAP).

• The 'DrusillaSelect' algorithm from "Fast approximate furthest neighbors with
data-dependent candidate selection", by R.R. Curtin and A.B. Gardner, in
Similarity Search and Applications 2016 (SISAP).

These two strategies give approximate results for the furthest neighbor search problem and
can be used as fast replacements for other furthest neighbor techniques such as those
found in the mlpack_kfn program. Note that typically, the 'ds' algorithm requires far
fewer tables and projections than the 'qdafn' algorithm.

Specify a reference set (set to search in) with '--reference_file (-r)', specify a query
set with '--query_file (-q)', and specify algorithm parameters with '--num_tables (-t)'
and '--num_projections (-p)' (or don't and defaults will be used). The algorithm to be
used (either 'ds'---the default---or ’qdafn') may be specified with '--algorithm (-a)'.
Also specify the number of neighbors to search for with '--k (-k)'.

If no query set is specified, the reference set will be used as the query set. The
'--output_model_file (-M)' output parameter may be used to store the built model, and an
input model may be loaded instead of specifying a reference set with the
'--input_model_file (-m)' option.

Results for each query point can be stored with the '--neighbors_file (-n)' and
'--distances_file (-d)' output parameters. Each row of these output matrices holds the k
distances or neighbor indices for each query point.

For example, to find the 5 approximate furthest neighbors with ’reference_set.csv' as the
reference set and 'query_set.csv' as the query set using DrusillaSelect, storing the
furthest neighbor indices to 'neighbors.csv' and the furthest neighbor distances to
'distances.csv', one could call

$ mlpack_approx_kfn --query_file query_set.csv --reference_file reference_set.csv --k 5
--algorithm ds --neighbors_file neighbors.csv --distances_file distances.csv

and to perform approximate all-furthest-neighbors search with k=1 on the set ’data.csv'
storing only the furthest neighbor distances to 'distances.csv', one could call

$ mlpack_approx_kfn --reference_file reference_set.csv --k 1 --distances_file
distances.csv

A trained model can be re-used. If a model has been previously saved to ’model.bin', then
we may find 3 approximate furthest neighbors on a query set ’new_query_set.csv' using that
model and store the furthest neighbor indices into 'neighbors.csv' by calling

$ mlpack_approx_kfn --input_model_file model.bin --query_file new_query_set.csv --k 3
--neighbors_file neighbors.csv

OPTIONAL INPUT OPTIONS

       --algorithm (-a) [string]
              Algorithm to use: 'ds' or 'qdafn'. Default value 'ds'.

       --calculate_error (-e) [bool]
              If set, calculate the average distance error for the first furthest neighbor only.

       --exact_distances_file (-x) [string]
              Matrix containing exact distances to furthest neighbors; this can be used to  avoid
              explicit calculation when --calculate_error is set.

       --help (-h) [bool]
              Default help info.

       --info [string]
              Print help on a specific option. Default value ''.

       --input_model_file (-m) [unknown]
              File containing input model.

       --k (-k) [int]
              Number  of  furthest  neighbors to search for.  Default value 0.  --num_projections
              (-p) [int] Number of projections to use in each hash table. Default value 5.

       --num_tables (-t) [int]
              Number of hash tables to use. Default value 5.

       --query_file (-q) [string]
              Matrix containing query points.

       --reference_file (-r) [string]
              Matrix containing the reference dataset.

       --verbose (-v) [bool]
              Display informational messages and the full list of parameters and  timers  at  the
              end of execution.

       --version (-V) [bool]
              Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

       --distances_file (-d) [string]
              Matrix to save furthest neighbor distances to.

       --neighbors_file (-n) [string]
              Matrix to save neighbor indices to.

       --output_model_file (-M) [unknown]
              File to save output model to.

ADDITIONAL INFORMATION

       For  further  information,  including  relevant papers, citations, and theory, consult the
       documentation found at http://www.mlpack.org or included with your distribution of mlpack.