Ubuntu Manpage: lfit - general purpose evaluation and regression analysis tool

NAME

       lfit - general purpose evaluation and regression analysis tool

SYNOPSIS

       lfit [method of analysis] [options] <input> [-o, --output <output>]

DESCRIPTION

       The  program `lfit` is a standalone command line driven tool designed for both interactive
       and batch processed data analysis and regression. In principle, the program may run in two
       modes.  First,  `lfit`  supports  numerous regression analysis methods that can be used to
       search for "best fit" parameters of model functions in  order  to  model  the  input  data
       (which are read from one or more input files in tabulated form). Second, `lfit` is capable
       to read input data and performs various arithmetic operations as it is  specified  by  the
       user.  Basically  this  second  mode  is  used  to  evaluate  the model functions with the
       parameters presumably derived by the actual regression methods (and in order  to  complete
       this evaluation, only slight changes are needed in the command line invocation arguments).

OPTIONS

   General options:
       -h, --help
              Gives general summary about the command line options.

       --long-help, --help-long
              Gives a detailed list of command line options.

       --wiki-help, --help-wiki, --mediawiki-help, --help-mediawiki
              Gives a detailed list of command line options in Mediawiki format.

       --version, --version-short, --short-version
              Gives some version information about the program.

       --functions, --list-functions, --function-list
              Lists  the  available arithmetic operations and built-in functions supported by the
              program.

       --examples
              Prints some very basic examples for the program invocation.

   Common options for regression analysis:
       -v, --variable, --variables <list-of-variables>
              Comma-separated list of regression variables.  In  case  of  non-linear  regression
              analysis,  all  of  these  fit  variables  are expected to have some initial values
              (specified as <name>=<value>), otherwise the initial values are  set  to  be  zero.
              Note  that  in  the  case  of  some  of the regression/analysis methods, additional
              parameters should be assigned to these fit/regression variables.  See  the  section
              "Regression analysis methods" for additional details.

       -c, --column, --columns <independent>[:<column index>],...
              Comma-separated  list  of  independet  variable  names  as read from the subsequent
              columns of the primary input data file. If the independent  variables  are  not  in
              sequential  order  in the input file, the optional column indices should be defined
              for each variable, by separating the column index with a colon after  the  name  of
              the  variable. In the case of multiple input files and data blocks, the user should
              assign the individual independent variables and the  respective  column  names  and
              definitions for each file (see later, Sec. "Multiple data blocks").

       -f, --function <model function>
              Model  function  of  the analysis in a symbolic form. This expression for the model
              function  should  contain  built-in  arithmetic  operators,   built-in   functions,
              user-defined  macros  (see  -x,  --define) or functions provided by the dynamically
              loaded external modules (see -d, --dynamic). The model function can depend on  both
              the  fit/regression  variables  (see -v, --variables) and the independent variables
              read from the input file (see -c, --columns). In the case of multiple  input  files
              and  data  blocks,  the  user should assign the respective model functions for each
              data block (see later). Note that some of the analysis methods  expects  the  model
              function to be either differentiable or linear in the fit/regression variables. See
              "Regression analysis methods" later on about more details.

       -y, --dependent <dependent expression>
              The dependent variable of the regression analysis,  in  a  form  of  an  arithmetic
              expression.  This  expression  for  the  dependent  variable can depend only on the
              variables read from the input file (see -c, --columns). In  the  case  of  multiple
              input  files  and  data  blocks,  the  user  should assign the respective dependent
              expressions for each data block (see later).

       -o, --output <output file>
              Name  of  the  output  file  into  which  the  fit  results  (the  values  for  the
              fit/regression variables) are written.

   Common options for function evaluation:
       -f, --function <function to evaluate>[...]
              List  of  functions  to  be  evaluated. More expressions can be specified by either
              separating the subsequent  expressions  by  a  comma  or  by  specifying  more  -f,
              --function options in the command line.

       Note  that  the  two  basic  modes of `lfit` are distinguished only by the presence or the
       absence of the -y, --dependent command line argument. In  other  words,  there  isn't  any
       explicit  command  line  argument which specify the mode of `lfit`. If the -y, --dependent
       command line argument is omitted, `lfit` runs in function evaluation mode,  otherwise  the
       program runs in regression analysis mode.

       -o, --output <output file>
              Name  of  the  output  file  in  which  the  results of the function evaluation are
              written.

   Regression analysis methods:
       -L, --clls, --linear
              The default mode of `lfit`, the classical linear least squares (CLLS)  method.  The
              model   functions   specified   after  -f,  --function  are  expected  to  be  both
              differentiable and linear with respect to the fit/regression variables.  Otherwise,
              `lfit`  detects  the  non-differentiable  and  non-linear  property  of  the  model
              function(s) and refuses the analysis. In  this  case,  other  types  of  regression
              analysis   methods   can   be   applied  depending  our  needs,  for  instance  the
              Levenberg-Marquardtalgorithm  (NLLM,  see  -N,  --nllm)  or  the  downhill  simplex
              minimization (DHSX, see -D, --dhsx).

       -N, --nllm, --nonlinear
              This option implies a regression involving the nonlinear Levenberg-Marquardt (NLLM)
              minimization algorithm. The model function(s) specified after  -f,  --function  are
              expected  to  be  differentiable  with  respect  to  the  fit/regression variables.
              Otherwise, `lfit` detects the non-differentiable property and refuses the analysis.
              There  some fine-tune parameters of the Levenberg-Marquardt algorithm, see also the
              secion "Fine-tuning of regression analysis methods"  for  more  details  how  these
              additional  regression  parameters  can be set. Note that all of the fit/regression
              variables should have a proper initial value, defined in the command line  argument
              -v, --variable (see also there).

       -U, --lmnd
              Levenberg-Marquardt minimization with numerical partial derivatives (LMND). Same as
              the NLLM method, with the exception of that the partial derivatives  of  the  model
              function(s)  are  calculated  numerically.  Therefore,  the  model  function(s) may
              contain functions of which partial derivatives are not known in an  analytic  form.
              The  differences  used  in  the  computations  of the partial derivatives should be
              declared by the user, see also the command line option -q, --differences.

       -D, --dhsx, --downhill
              This option implies a regression involving the nonlinear  downhill  simplex  (DHSX)
              minimization  algorithm. The user should specify the proper inital values and their
              uncertainties as <name>=<initial>:<uncertainty>,  unless  the  "fisher"  option  is
              passed  to  the  -P,  --parameters  command line argument (see later in the section
              "Fine-tuning of regression analysis methods"). In the first case, the initial  size
              of  the  simplex  is  based  on the uncertainties provided by the user while in the
              second case, the initial simplex is derived from the eigenvalues  and  eigenvectors
              of   the   Fisher  covariance  matrix.  Note  that  the  model  functions  must  be
              differentiable in the latter case.

       -M, --mcmc
              This option implies the method  of  Markov  Chain  Monte-Carlo  (MCMC).  The  model
              function(s)  can  be  arbitrary in the point of differentiability. However, each of
              the  fit/regression  variables  must  have  an   initial   assumption   for   their
              uncertainties which must be specified via the command line argument -v, --variable.
              The user should specify the proper inital values  and  uncertainties  of  these  as
              <name>=<initial>:<uncertainty>.  In  the  actual  implementation  of  `lfit`,  each
              variable has an uncorrelated Gaussian a  priori  distribution  with  the  specified
              uncertainty.  The  MCMC  algorithm  has  some fine-tune parameters, see the section
              "Fine-tuning of regression analysis methods" for more details.

       -K, --mchi, --chi2
              With this option one can perform a "brute force" Chi^2 minimization  by  evaluating
              the value of the merit function of Chi^2 on a grid of the fit/regression variables.
              In this case the grid size and resolution must be  specified  in  a  specific  form
              after  the  -v, --variable command line argument. Namely each of the fit/regression
              variables  intended  to  be  varied   on   a   grid   must   have   a   format   of
              <name>=[<min>:<step>:<max>]  while  the  other ones specified as <name>=<value> are
              kept fixed. The output of this analysis will be a series of lines with N+1 columns,
              where the values of fit/regression variables are followed by the value of the merit
              function. Note that all of the declared fit/regression variables are written to the
              output,  including  the  ones  which  are  fixed  (therefore the output is somewhat
              redundant).

       -E, --emce
              This option implies the method of "refitting to synthetic  data  sets",  or  "error
              Monte-Carlo  estimation"  (EMCE).  This  method  must  have  a  primarily  assigned
              minimization algorithm (that can be any of the CLLS, NLLM or DHSX methods).  First,
              the program searches the best fit values for the fit/regression variables involving
              the assigned primary minimization algorithm and reports these best  fit  variables.
              Then,  additional  synthetic  data  sets  are generated around this set of best fit
              variables and the minimization is repeated involving the same primary  method.  The
              synthetic  data  sets are generated independently for each input data block, taking
              into account the fit residuals. The noise added to the best fit data  is  generated
              from the power spectrum of the residuals.

       -X, --xmmc
              This  option  implies  an improved/extended version of the Markov Chain Monte-Carlo
              analysis (XMMC). The major differences between the classic MCMC  and  XMMC  methods
              are  the  following.  1/  The  transition  distribution  is derived from the Fisher
              covariance matrix. 2/ The program performs an initial  minimization  of  the  merit
              function  involving  the  method  of downhill simplex. 3/ Various sanity checks are
              performed in order to verify the convergence of the Markov  chains  (including  the
              comparison  of the actual and theoretical transition probabilities, the computation
              of the autocorrelation lengths of  each  fit/regression  variable  series  and  the
              comparison of the statistical and Fisher covariance).

       -A, --fima
              Fisher  information  matrix  analysis  (FIMA).  With  this  analysis method one can
              estimate  the  uncertainties  and  correlations  of  the  fit/regression  variables
              involving  the  method of Fisher matrix analysis. This method does not minimize the
              merit functions by adjusting the fit/regression  variables,  instead,  the  initial
              values  (specified  after  the -v, --variables option) are expected to be the "best
              fit" ones.

   Fine-tuning of regression analysis methods:
       -e, --error <error expression>
              Expression for the  uncertainties.  Note  that  zero  or  negative  uncertainty  is
              equivalent  to  zero  weight,  i.e.  input  lines  with zero or negative errors are
              discarded from the fit.

       -w, --weight <weight expression>
              Expression for the weights. The weight is simply the reciprocal of the uncertainty.
              The  default  error/uncertainty (and therefore the weight) is unity. Note that most
              of the analysis/regression methods are rather sensitive to the uncertainties  since
              the merit function also depends on these.

       -P, --parameters <regression parameters>
              This  option  is  followed  by  a  set  of  optional  fine-tune parameters, that is
              different for each primary regression analysis method:

       default, defaults
              Use the default fine-tune parameters for the given regression method.

       clls, linear
              Use the classic linear least squares method as the primary  minimization  algorithm
              of  the  EMCE  method.  Like  in  the case of the CLLS regression analysis (see -L,
              --clls), the model function(s) must be both differentiable and linear with  respect
              to the fit/regression variables.

       nllm, nonlinear
              Use  the  non-linear  Levenberg-Marquardt  minimization  algorithm  as  the primary
              minimization algorithm of the EMCE method. Like in the case of the NLLM  regression
              analysis  (see  -N,  --nllm),  the  model  function(s)  must be differentiable with
              respect to the fit/regression variables.

       lmnd   Use the  non-linear  Levenberg-Marquardt  minimization  algorithm  as  the  primary
              minimization  algorithm  of  the  EMCE  method.  Like  in  the  case  of -U, --lmnd
              regression  method,  the  parametric  derivatives  of  the  model  function(s)  are
              calculated  by a numerical approximation (see also -U, --lmnd and -q, --differences
              for additional details).

       dhsx, downhill
              Use the downhill simplex (DHSX) minimization as the primary minimization  algorithm
              of  the  EMCE  method. Unless the additional 'fisher' option is specified directly,
              like in the default case of the DHSX regression method, the user should specify the
              uncertainties  of  the fit/regression variables that are used as an initial size of
              the simplex.

       mc, montecarlo
              Use a  primitive  Monte-Carlo  diffusion  minimization  technique  as  the  primary
              minimization   algorithm   of   the  EMCE  method.  The  user  should  specify  the
              uncertainties of the fit/regression variables which are then used to  generate  the
              Monte-Carlo  transitions. This primary minimization technique is rather nasty (very
              slow), so its usage is not recommended.

       fisher In the case of the DHSX regression method or in the case of the  EMCE  method  when
              the primary minimization is the downhill simplex algorithm, the initial size of the
              simplex is derived from the Fisher covariance approximation evaluated at the  point
              represented  by  the  initial  values  of  the  fit/regression variables. Since the
              derivation  of  the  Fisher  covariance  requires  the  knowledge  of  the  partial
              derivatives  of the model function(s) with respect to the fit/regression variables,
              the(se) model function(s) must be differentiable. On the other hand,  the  user  do
              not  have  to  specify  the  initial uncertainties after the -v, --variables option
              since these uncertainties derived automatically from the Fisher covariance.

       skip   In the case of EMCE and XMMC method, the initial minimization is skipped.

       lambda=<value>
              Initial value for the "lambda" parameter of the Levenberg-Marquardt algorithm.

       multiply=<value>
              Value of the "lambda multiplicator" parameter of the Levenberg-Marquardt algorithm.

       iterations=<max.iterations>
              Number of iterations during the Levenberg-Marquardt algorithm.

       accepted
              Count the accepted transitions in the MCMC and XMMC methods (default).

       nonaccepted
              Count the total (accepted plus non-accepted)  transitions  in  the  MCMC  and  XMMC
              methods.

       gibbs  Use the Gibbs sampler in the MCMC method.

       adaptive
              Use  the  adaptive  XMMC algorithm (i.e. the Fisher covariance is re-computed after
              each accepted transition).

       window=<window size>
              Window size for calculating the  autocorrelation  lengths  for  the  Markov  chains
              (these  autocorrelation  lengths are reported only in the case of XMMC method). The
              default  value  is  20,  which  is  fine  in  the  most  cases  since  the  typical
              autocorrelation lengths are between 1 and 2 for nice convergent chains.

       -q, --difference <variablename>=<difference>[,...]
              The  analysis  method  of  LMND  (Levenberg-Marquardt  minimization using numerical
              derivatives, see -U, --lmnd) requires the differences  that  are  used  during  the
              computations of the partial derivatives of the model function(s). With this option,
              one can specify these differences.

       -k, --separate <variablename>[,...]
              In the case of non-linear regression methods  (for  instance,  DHSX  or  XMMC)  the
              fit/regression  variables  in which the model functions are linear can be separated
              from the nonlinear part and therefore make the minimization process more robust and
              reliable.  Since  the  set  of variables in which the model functions are linear is
              ambiguous, the user should explicitly specify  this  supposedly  linear  subset  of
              regression      variables.      (For      instance,      the     model     function
              "a*b*x+a*cos(x)+b*sin(x)+c*x^2" is linear in both  "(a,c)"  and  "(b,c)"  parameter
              vectors  but  it  is  non-linear  in  "(a,b,c)".)  The  program  checks whether the
              specified subset of regression variables is a linear subset and reports  a  warning
              if  not.  Note that the subset of separated linear variables (defined here) and the
              subset of the fit/regression variables affected by  linear  constraints  (see  also
              section "Constraints") must be disjoint.

       --perturbations <noise level>, --perturbations <key>=<noise level>[,...]
              Additional  white  noise  to  be  added to each EMCE synthetic data sets. Each data
              block (referred here by the approprate data block keys, see also section  "Multiple
              data  blocks")  may  have  different  white noise levels. If there is only one data
              block, this command line argument is followed only by a  single  number  specifying
              the white noise level.

   Additional parameters for Monte-Carlo analysis:
       -s, --seed <random seed>
              Seed  for  the  random number generator. By default this seed is 0, thus all of the
              Monte-Carlo regression analyses (EMCE, MCMC, XMMC and the  optional  generator  for
              the  FIMA  method)  generate reproducible parameter distributions. A positive value
              after this option yields alternative random seeds while all negative values  result
              in  an  automatic  random  seed  (derived  from  various available sources, such as
              /dev/[u]random, system time, hardware MAC address and so), therefore  distributions
              generated involving this kind of automatic random seed are not reproducible.

       -i, --[mcmc,emce,xmmc,fima]-iterations <iterations>
              The  actual  number  of  Monte-Carlo  iterations  for the MCMC, EMCE, XMMC methods.
              Additionally, the FIMA method is capable to generate a mock  Gaussian  distribution
              of  the  parameter  with the same covariance as derived by the Fisher analysis. The
              number of points in this mock distribution is also specified by this  command  line
              option.

   Clipping outlier data points:
       -r, --sigma, --rejection-level <level>
              Rejection level in the units of standard deviations.

       -n, --iterations <number of iterations>
              Maximum  number  of iterations in the outlier clipping cycles. The actual number of
              outlier points can be traced by increasing the verbosity of the  program  (see  -V,
              --verbose).

       --[no-]weighted-sigma
              During  the  derivation  of  the  standard  deviation, the contribution of the data
              points data points can be weighted by the respective weights/error bars  (see  also
              -w,  --weight  or  -e,  --error  in the section "Fine-tuning of regression analysis
              methods"). If no weights/error bars are associated to the data  points  (i.e.  both
              -w,  --weight  or  -e,  --error  options  are  omitted),  this  option will have no
              practical effect.

       Note that in the actual version of `lfit`, only the CLLS, NLLM and LMND regression methods
       support the above discussed way of outlier clipping.

   Multiple data blocks:
       -i<key> <input file name>
              Input file name for the data block named as <key>.

       -c<key> <independent>[:<column index>],...
              Column  definitions  (see  also  -c,  --columns)  for the given data block named as
              <key>.

       -f<key> <model function>
              Expression for the model function assigned to the data block named as <key>.

       -y<key> <dependent expression>
              Expression of the dependent variable for the data block named as <key>.

       -e<key> <errors>
              Expression of the uncertainties for the data block named as <key>.

       -w<key> <weights>
              Expression of the weights for the data block named as <key>. Note that like in  the
              case  of -e, --errors and -w, --weights, only one of the -e<key>, -w<key> arguments
              should be specified.

   Constraints:
       -t, --constraint, --constraints <expression>{=<>}<expression>[,...]
              List of fit and domain constraints  between  the  regression  variables.  Each  fit
              constraint  expression  must be linear in the fit/regression variables. The program
              checks the linearity of the fit constraints and reports an  error  if  any  of  the
              constraints  are  non-linear.   A domain constraint can be any expression involving
              arbitrary binary arithmetic relation (such as strict greater than: '>', strict less
              than:  '<', greater or equal to: '>=' and less or requal to: '<='). Constraints can
              be specified either by a comma-separated list after a single command line  argument
              of -t, --constraints or by multiple of these command line arguments.

       -v, --variable <name>:=<value>
              Another  form  of  specifying  constraints.  The  variable specifications after -v,
              --variable can also be used to define constraints by writing ":="  instead  of  "="
              between the variable name and initial value. Thus, -v <name>:=<value> is equivalent
              to -v <name>=<value> -t <name>=<value>.

   User-defined functions:
       -x, --define, --macro <name>(<parameters>)=<definition expression>
              With this option, the user can define additional functions (also called macros)  on
              the top of the built-in functions and operators, dynamically loadaded functions and
              previously defined macros. Note  that  each  such  user-defined  function  must  be
              stand-alone,   i.e.  external  variables  (such  as  fit/regression  variables  and
              independent variables) cannot be  part  of  the  definition  expression,  only  the
              parameters of these functions.

   Dynamically loaded extensions and functions:
       -d, --dynamic <library>:<array>[,...]
              Load  the dynamically linked library (shared object) named <library> and import the
              global `lfit`-compatible set of functions defined in the arrays specified after the
              name  of  the  library.  The  arrays  must  have  to  be  declared with the type of
              'lfitfunction', as it is defined in the file "lfit.h". Each record  in  this  array
              contains  information  about a certain imported function, namely the actual name of
              this function, flags specifying  whether  the  function  is  differentiable  and/or
              linear  in  its  regression  parameters,  the  number  of  regression variables and
              independent variables and the actual C subroutine that implements the evaulation of
              the  function (and the optional computation of the partial derivatives). The module
              'linear.c'  and  'linear.so'  provides  a  simple  example  that   implements   the
              "line(a,b,x)=a*x+b"  function.  This  example function has two regression variables
              ("a" and "b") and one independent variable ("x") and the function itself is  linear
              in the regression variables.

   More on outputs:
       -z, --columns-output <column indices>
              Column  indices where the results are written in evaluation mode. If this option is
              omitted, the results of the function evaluation are written sequentally. Otherwise,
              the  input  file  is  written  to the output and the appropriate columns (specified
              here) are replaced by the respective results  of  the  function  evaluation.  Thus,
              although  the default column order is sequential, there is a significant difference
              between omitting this option and specifying "-z 1,2,...,N". In the first case,  the
              output  file  contains  only  the results of the function evaluations, while in the
              latter case, the first N columns  of  the  original  file  are  replaced  with  the
              results.

       --errors, --error-line, --error-columns
              Print the uncertainties of the fit/regression variables.

       -F, --format <variable name>=<format>[,...]
              Format   of  the  output  in  printf-style  for  each  fit/regression  variable(see
              printf(3)). The default format is %12.6g (6 signifiant figures).

       -F, --format <format>[,...]
              Format of the output in evaluation mode. The default format is %12.6g (6 signifiant
              figures).

       -C, --correlation-format <format>
              Format  of  the  correlation  matrix  elements.  The  default  format  is  %6.3f (3
              significant figures).

       -g, --derived-variable[s] <variable name>=<expression>[,...]
              Some  of  the  regression  and  analysis  methods  are  capable  to   compute   the
              uncertainties  and  correlations for derived regression variables. These additional
              (and therefore not independent) variables can be defined  with  this  command  line
              option.  In  the  definition  expression  one  should  use  only the fit/regression
              variables (as defined by the -v, --variables command  line  argument).  The  output
              format  of  these  variables can also be specified by the -F, --format command line
              argument.

       -u, --output-fitted <filename>
              Neme of an output file into which those lines of the input are  written  that  were
              involved  in  the  final  regression.  This option is useful in the case of outlier
              clipping in order to see what was the actual subset of input data that was used  in
              the fit (see also the -n, --iterations and -r, --sigma options).

       -j, --output-rejected <filename>
              Neme  of  an  output file into which those lines of the input are written that were
              rejected from the final regression. This option is useful in the  case  of  outlier
              clipping  in  order  to  see  what  was  the  actual subset of input data where the
              dependent variable represented outlier points (see also the  -n,  --iterations  and
              -r, --sigma options).

       -a, --output-all <filename>
              File  containing  the  lines  of  the input file that were involved in the complete
              regression analysis. This file is simply the original file, only the commented  and
              empty lines are omitted.

       -p, --output-expression <filename>
              In  this  file  the model function is written in which the fit/regression variables
              are replaced by their best-fit values.

       -l, --output-variables <filename>
              List of the names and values of the fit/regression variables in the same format  as
              used  after the -v, --variables command line argument. The content of this file can
              therefore be passed to subsequent invocations of `lfit`.

       --delta
              Write  the  individual  differences  between  the  independent  variables  and  the
              evaluated  best  fit  model  function  values  for  each  line  in the output files
              specified by the -u, --output-fitted, -j, --output-rejected  and  -a,  --output-all
              command line options.

       --delta-comment
              Same  as --delta, but the differences are written as a comment (i.e. separated by a
              '##' from the original input lines).

       --residual
              Write the final fit residual to the output file (after the  list  of  the  best-fit
              values for the fit/regression variables).

REPORTING BUGS

       Report bugs to <apal@szofi.net>, see also http://fitsh.net/.

COPYRIGHT

       Copyright © 1996, 2002, 2004-2008, 2009; Pal, Andras <apal@szofi.net>