lunar (1) mlpack_kde.1.gz

Provided by: mlpack-bin_3.4.2-7ubuntu1_amd64 bug

NAME

       mlpack_kde - kernel density estimation

SYNOPSIS

        mlpack_kde [-E double] [-a string] [-b double] [-s int] [-m unknown] [-k string] [-c double] [-C double] [-P double] [-S bool] [-q string] [-r string] [-e double] [-t string] [-V bool] [-M unknown] [-p string] [-h -v]

DESCRIPTION

       This  program  performs  a  Kernel  Density  Estimation.  KDE  is  a non-parametric way of
       estimating probability density function. For each query point the  program  will  estimate
       its  probability  density  by  applying  a  kernel  function  to each reference point. The
       computational complexity of this is O(N^2) where there are N query points and N  reference
       points,  but  this  implementation  will  typically  see  better performance as it uses an
       approximate dual or single tree algorithm for acceleration.

       Dual or single tree optimization avoids  many  barely  relevant  calculations  (as  kernel
       function  values  decrease  with  distance),  so it is an approximate computation. You can
       specify the maximum relative error tolerance for each query value with '--rel_error  (-e)'
       as  well  as  the  maximum absolute error tolerance with the parameter '--abs_error (-E)'.
       This program runs using an Euclidean metric. Kernel function can  be  selected  using  the
       '--kernel  (-k)'  option. You can also choose what which type of tree to use for the dual-
       tree algorithm with '--tree (-t)'. It is also possible to select whether to use  dual-tree
       algorithm or single-tree algorithm using the '--algorithm (-a)' option.

       Monte  Carlo  estimations  can  be  used  to accelerate the KDE estimate when the Gaussian
       Kernel is used. This provides a probabilistic guarantee on the the error of the  resulting
       KDE instead of an absolute guarantee.To enable Monte Carlo estimations, the '--monte_carlo
       (-S)' flag can be used, and success probability can  be  set  with  the  '--mc_probability
       (-P)' option. It is possible to set the initial sample size for the Monte Carlo estimation
       using ’--initial_sample_size (-s)'. This implementation will only consider a  node,  as  a
       candidate for the Monte Carlo estimation, if its number of descendant nodes is bigger than
       the initial sample size. This can be controlled using a coefficient that will multiply the
       initial  sample  size and can be set using ’--mc_entry_coef (-C)'. To avoid using the same
       amount of computations an exact approach  would  take,  this  program  recurses  the  tree
       whenever  a  fraction  of  the  amount  of  the node's descendant points have already been
       computed. This fraction is set using '--mc_break_coef (-c)'.

       For example, the following will run KDE using the data in 'ref_data.csv' for training  and
       the  data  in 'qu_data.csv' as query data. It will apply an Epanechnikov kernel with a 0.2
       bandwidth to each reference point and use a KD-Tree for the  dual-tree  optimization.  The
       returned predictions will be within 5% of the real KDE value for each query point.

       $  mlpack_kde  --reference_file  ref_data.csv  --query_file  qu_data.csv  --bandwidth  0.2
       --kernel epanechnikov --tree kd-tree --rel_error 0.05 --predictions_file out_data.csv

       the predicted density estimations will be stored in 'out_data.csv'.  If  no  '--query_file
       (-q)'  is  provided, then KDE will be computed on the ’--reference_file (-r)' dataset.  It
       is possible to select either a reference dataset or an input model but  not  both  at  the
       same  time.  If  an  input  model  is  selected  and  parameter  values  are not set (e.g.
       '--bandwidth (-b)') then default parameter values will be used.

       In addition to the last program  call,  it  is  also  possible  to  activate  Monte  Carlo
       estimations  if  a  Gaussian  kernel is used. This can provide faster results, but the KDE
       will only have a probabilistic guarantee of meeting the desired error bound (instead of an
       absolute  guarantee).  The  following  example will run KDE using a Monte Carlo estimation
       when possible. The results will be  within  a  5%  of  the  real  KDE  value  with  a  95%
       probability.  Initial  sample size for the Monte Carlo estimation will be 200 points and a
       node will be a candidate for the estimation only  when  it  contains  700  (i.e.  3.5*200)
       points.   If  a node contains 700 points and 420 (i.e. 0.6*700) have already been sampled,
       then the algorithm will recurse instead of keep sampling.

       $  mlpack_kde  --reference_file  ref_data.csv  --query_file  qu_data.csv  --bandwidth  0.2
       --kernel   gaussian   --tree  kd-tree  --rel_error  0.05  --predictions_file  out_data.csv
       --monte_carlo  --mc_probability  0.95  --initial_sample_size   200   --mc_entry_coef   3.5
       --mc_break_coef 0.6

OPTIONAL INPUT OPTIONS

       --abs_error (-E) [double]
              Relative error tolerance for the prediction.  Default value 0.

       --algorithm (-a) [string]
              Algorithm  to  use  for the prediction.('dual-tree', 'single-tree').  Default value
              'dual-tree'.

       --bandwidth (-b) [double]
              Bandwidth of the kernel. Default value 1.

       --help (-h) [bool]
              Default help info.

       --info [string]
              Print help on a specific option. Default value ''.

       --initial_sample_size (-s) [int]
              Initial sample size for Monte Carlo estimations. Default value 100.

       --input_model_file (-m) [unknown]
              Contains pre-trained KDE model.

       --kernel (-k) [string]
              Kernel  to  use  for  the  prediction.('gaussian',   'epanechnikov',   'laplacian',
              'spherical', 'triangular'). Default value 'gaussian'.

       --mc_break_coef (-c) [double]
              Controls  what  fraction  of  the amount of node's descendants is the limit for the
              sample size before it recurses. Default value 0.4.

       --mc_entry_coef (-C) [double]
              Controls how much larger does the amount of node descendants has to be compared  to
              the  initial  sample  size  in order to be a candidate for Monte Carlo estimations.
              Default value 3.

       --mc_probability (-P) [double]
              Probability of the estimation being bounded by  relative  error  when  using  Monte
              Carlo estimations. Default value 0.95.

       --monte_carlo (-S) [bool]
              Whether to use Monte Carlo estimations when possible.

       --query_file (-q) [string]
              Query dataset to KDE on.

       --reference_file (-r) [string]
              Input reference dataset use for KDE.

       --rel_error (-e) [double]
              Relative error tolerance for the prediction.  Default value 0.05.

       --tree (-t) [string]
              Tree to use for the prediction.('kd-tree', 'ball-tree', 'cover-tree', 'octree', 'r-
              tree').  Default value 'kd-tree'.

       --verbose (-v) [bool]
              Display informational messages and the full list of parameters and  timers  at  the
              end of execution.

       --version (-V) [bool]
              Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

       --output_model_file (-M) [unknown]
              If specified, the KDE model will be saved here.

       --predictions_file (-p) [string]
              Vector to store density predictions.

ADDITIONAL INFORMATION

       For  further  information,  including  relevant papers, citations, and theory, consult the
       documentation found at http://www.mlpack.org or included with your distribution of mlpack.