Ubuntu Manpage: grmatch - pairing lines by involving identifier or cross matching

NAME

       grmatch - pairing lines by involving identifier or cross matching

SYNOPSIS

       grmatch [options] -r <reference> -i <input> [-o <output>]

DESCRIPTION

The program `grmatch` matches lines read from two input files, namely from a reference and
from an input file. All implemented algorithms are symmetric, in the manner that the
result should be the same if these two files are swapped. The only case when the order of
these files is important is when a geometrical transformation is also returned (see point
matching below), in this case the swapping of the files results the inverse form of the
original transformation. The lines (rows) can be matched using various criteria. 1. Lines
can be matched by identifier, where the identifier can be any concatenation of arbitrary,
space-separated columns found in the files. Generally, the identifier is represented by a
single column (e.g. it is an astronomical catalog identifier). The behaviour of the
program can be tuned for the cases when there are more than one rows with the same
identifier. 2. Lines can be matched using a 2-dimensional point matchig algorithm. In this
method, the program expects two-two columns both from the reference and input files which
can be treated as X and Y coordinates. If both point lists are known, the program tries to
find the appropriate geometrical transformation which transforms the points from the
frame of the reference list to the frame of the input list and, simultaneously, tries to
find as many pairs as possible. The parameters of the geometrical transformation and the
whole algorithm can be fine-tuned. 3. Lines can be matched using arbitrary- (N-)
dimensional coordinate matching algorithm. This method expects N-N columns both from the
reference and input files which can be treated as X_1, ..., X_N Cartesian coordinates and
the method assumes both of the point sets in the same reference frame. The point 'A' from
the reference list and the point 'P' from the input list forms a pair if the closest
point to 'A' from the input list is 'P' and vice versa.

OPTIONS

General options:
-h, --help
Give general summary about the command line options.

--long-help, --help-long
Gives a detailed list of command line options.

--wiki-help, --help-wiki, --mediawiki-help, --help-mediawiki
Gives a detailed list of command line options in Mediawiki format.

--version, --version-short, --short-version
Give some version information about the program.

-C, --comment
Comment the output (both the transformation file and the match file).

Options for input/output specifications:
-r <file>, --reference <file> --input-reference <file>
Mandatory, name of the reference file.

<inputfile>, -i <inputile>, --input <inputfile>
Name of the input file. If this switch is omitted, the input isread from stdin
(specifying some input is mandatory).

--separator-reference <char>|space, --separator-input <char>|space
Character for separating the fields of the reference and the input input files,
respectively. By default, the separation is done using whitespaces, it can be
ephasized by defining 'space' here. Otherwise, the character <char> should only be
a single character. For instance, use '--separator-reference ,' and/or
'--separator-input ,' to process CSV files.

-o <output>, --output <output>, --output-matched <output>
Name of the output file, containing the matched lines. The matched lines are pasted
lines, the first part is from the reference file and the second part is from the
input file, these two parts are concatenated by a TAB character. This switch is
optional, if it is not specified, no such output will be generated.

--output-matched-reference <out>, --output-matched-input <out>
Name of the output file, containing the lines corresponding to matches but only
from the reference file or from the input file, respectively.

--output-excluded-reference <out>, --output-excluded-input <out>
Names of the files which contain the valid but excluded lines from the reference
and from the input. These outputs are disjoint from the previous output and
altogether contaions all valid lines.

--output-id <out>
Name of the file which contaions only the identifiers of the matched lines. If the
primary matching method was not identifier matching, one should specify the column
indices of the identifiers by --col-ref-id and --col-inp-id also.

--output-transformation <output-transformation-file>
Name of the output file containing the geometrical transformation, in
human-readable format, if the matching method was point matching (in other case,
this option has no effect). The commented version of this file includes some
statistics about the matching (the total number of lines used and matched,
the required CPU time, the final triangulation level, the fit residuals and other
things like these).

In all of the above input/output file specifications, the replacement of the file name
by "-" (a single minus sign) forces the reading from stdin or writing to stdout. Note that
all parts of the any line after "#" (hashmark) are treated as a comment, therefore
ignored.

General options for point matching:
--match-points
This switch forces the usage of the point matching method. By default, this
method is assumed to be used, therefore this switch can be omitted.

--col-ref <x>,<y>, --col-inp <x>,<y>
The column indices containing the X and Y coordinates, for the reference and for
the input file, respectively. The index of the first column is always 1, the
index of the second is 2 and so on. Lines in which these columns do not contain
valid real numbers bers are omitted.

-a <order>, --order <order>
This switch specifies the polynomial order of the resulted geometrical
transformation. It can be arbitrary positive integer. Note that if the order is
A, at least (A+1)*(A+2)/2 valid points are needed both from the reference and both
from the input file to fit the transformation.

--max-distance <maxdist>
The maximal accepted distance between the matched points in the coordinate frame
of the input coordinate list (and not in the coordinate frame of the reference
coordinate list). Possible pairs (which are valid pairs due to the symmetric
coordinate matching algorihms) are excluded if their Eucledian distance is larger
than maxdist. Note that this option has no initial value, therefore, if omitted,
all possible pairs due to the symmetric matching are resulted, which, in certain
cases in practice, can result unexpected behaviour. One should always specify a
reasonable maximal distance which can be estimated only by the knowledge of
the physics of the input files.

See more options concerning to point matching in the section "Fine-Tuning of Point
Matching" below. That section also describes the tuning of the triangulation used by
the point matching algorithm. For a more detailed description about the point matching
algorithms based on pattern and triangle matching see [1], [2] or [3].

General options for coordinate matching:
--match-coord, --match-coords
This switch forces the usage of the coordinate matching method. Note that because
of the common options with the point matching method, one should specify this
switch to force the usage of the coordinate matching method (the default method is
point matching, see above).

--col-ref <x>[,<y>,[<z>...]] --col-inp <x>[,<y>,[<z>...]]
The column indices containing the spatial coordinates, for the reference and for
the input file, respectively. The index of the first column is always 1, the
index of the second is 2 and so on. Lines in which these columns do not contain
valid real numbers are omitted. Note that the dimension of the coordinate
matching space is specified indirectly, by the number of column indices listed
here. Because of this, the number of column indices should be the same for the
reference and input, in other case, when the dimensions are mismatched, the
program exits unsuccessfully.

--max-distance <maxdist>
The maximal accepted distance between the matched points. Possible pairs (which
are valid pairs due to the symmetric coordinate matching algorihms) are excluded if
their Eucledian distance is larger than maxdist. Note that this option has no
initial value, therefore, if omitted, all possible pairs due to the symmetric
matching are resulted (see also point matching, above).

General options for identifier matching:
--match-id, --match-identifiers
This switch forces the usage of the identifier matching method.

--col-ref-id <i>[,<j>,[<k>...]] --col-inp-id <i>[,<j>,[<k>...]]
Column index or indices containing the identifiers, from the reference and from
the input file, respectively.

--no-ambiguity, --first-ambiguity, --any-ambiguity, --full-ambiguity
These options tune the behaviour of the matching when there is more than one
occurrence of a given identifier in the reference and/or input file. If
--no-ambiguity is specified, these identifiers are discarded, this is the default
method. If --first-ambiguity is specified, only the first occurence is treated as
a matched line, independently from the number of occurrences. If the switch
--any-ambiguity is specified, the lines are paired sequentally, until there is
any left from the reference and from the input. For example, if there is 4
occurrences in the reference and 6 in the input file of a given identifier, 4
matched pairs are returned. Otherwise, if --full-ambiguity is specified, all
possible combinations of the lines are treated as matched lines. For example, if
there is 4 occurrences in the reference and 6 in the input file of a given
identifier, all 4*6=24 combinations are returned as matched pairs.

Fine-tuning of point matching:
--triangulation <parameters>
This switch is followed by comma-separated directives, which specify the
parameters of the triangulation-based point matching algorithm:

delaunay, level=<level>, full, auto, unitarity=<U>
These directives specify the triangulation level used for point matching.
"delaunay" forces the usage only of the Delaunay-triangles. This is the fastest
method, however, it is only working if the points in the reference and input lists
are almost competely overlapping and describe almost the same point sets
(within a ratio of common points above 60-70%). The "level" specifies the
level of the expansion of the Delaunay-triangulation (see [1] for more details).
In practice, the lower the ratio of common points and/or the ratio of the
overlapping, the higher level should be used. Specifying "level=1" or "level=2"
gives a robust but still fast method for general usage. The directive "full"
forces full triangulation. This can be overwhelmingly slow and annoying and
requires tons of memory if there are more than 40-50 points (the amounts of these
resources are proportional to the 6th(!) and 3rd power of the number of the
points, respectively). The directive "auto" increases the level of the
triangulation expansion automatically until a proper match is found. A match is
considered as a good match if the unitarity of the transformation is less than the
unitarity U specified by the "unitarity=U" directive (see also the section
Notes/Unitarity below).

mixed, conformable, reverse
These directives define the chirality of the triangle spaces to be used.
Practically, it means the following. If we don't know whether the input and
reference lists are inverted respecting to each other, one should use "mixed"
triangle space. If we are sure about that the input and reference lists are
not inverted, we can use "conformable" triangle space. If we know that the
input and reference lists are inverted, we can use "reverse" space. Note that
although "mixed" triangle space can always result a good match, it is a wise
idea to fix the chirality by specifying "conformable" or "reverse" if we really
know that the point sets are not inverted or inverted respecting to each
other. If the chirality is fixed, the program yields more matched pairs,
the appropriate triangulation level can be smaller and in "auto" mode, the
program returns the match definitely faster.

maxnumber=<max>, maxref=<mr>, maxinp=<mi>
These directives specify the maximal number of points which are used for
triangulation (for any type of triangulation). If "maxnumber" is specified,
it is equivalent to define "maxref" and "maxinp" with the same values. Then, the
first <mr> points from the reference and the first <mi> points from the input
list are used to generate the triangle sets. The "first" points are selected
using the optional information found in one of the columns, see the following
switches.

(Note that there should be only one --triangulation switch, all desired directives should
be written in the same argument, separated by commas.)

--col-ref-ordering [-]<w>, --col-inp-ordering [-]<w>.
These switches specify one-one column index from the reference and from the input
files which are used to order these lists and select the first "maxref" and
"maxinp" points (see above) for the generation of the two triangle meshes.
Both columns should contain valid real numbers, otherwise the whole(!) line
is excluded (not only from sorting but from the whole matching procedure). If there
is no negative sign before the column index, the data are sorted in
descending(!) order, therefore the lines with the lines with the highest(!) values
are selected for triangulation. If there is a negative sign before the index,
the data are sorted in ascending order by these values, therefore the lines
with the smallest(!) values are selected for triangulation. For example, if we want
to match star lists, we might want to use only the brightest ones to
generate the triangle sets. If the brightnesses of the stars are specified by
their fluxes, we should not use the negative sign (the list should be sorted in
descending order to select the first few lines as the brightest stars), and if
the brightness is known by the magnitude, we have to use the negative sign.

--fit iterations=<N>,firstrejection=<F>,sigma=<S>
Like --triangulation, this switch is followed by some directives. These
directives specify the number <N> of iterations ("iterations=<N>") for point
matching. The "firstrejection" directive speciy the serial number <F> of the
first iteration where points farer than <S> "sigma" level are excluded in the next
iteration. Note that in practice these type of iteration is really not
important (due to, for instance, the limitations of the outliers by the
--max-distance switch), however, some suspicious users can be convinced by such
arguments.

--weight reference|input,column=<wi>,[magnitude],[power=<p>]
These directives specify the weights which are used during the fit of the
geometrical transformation. For example, in practice it is useful in the
following situation. We try to match star lists, then the fainter stars are
believed to have higher astrometrical errors, therefore they should have smaller
influence in the fit. We can take the weights from the reference (specify
"reference") and from the input (specify "input"), from the column specified by the
weight-index. The weights can be derived from stellar magnitudes, if so,
specify "magnitude" to convert the read values in magnitude to flux. The real
weights then is the "power"th power of the flux. The default value of the
"power" is 1, however, for the maximum-likelihood estimation of an assumed
Gaussian distribution, the weights should be the second power of the fluxes.

Some notes on unitarity. The unitarity of a geometrical transformation measures how it
differs from the closest transformation which is affine and a combination of dilation,
rotation and shift. For such a transformation the unitarity is 0 and if the
second-order terms in a transformation distort a such unitary transformation, the
unitarity will have the same magnitude like the magnitude of this second-order effect.
For example, to map a part of a sphere with the size of d degrees will have an unitarity
of 1-cos(d). Therefore, for astrometrical purposes, a reasonable value of the critical
unitarity in "auto" triangulation mode can be estimated as 2 or 3 times 1-cos(d/2)
where d is the size of the field in which astrometry should be performed.

REPORTING BUGS

       Report bugs to <apal@szofi.net>, see also https://fitsh.net/.

COPYRIGHT

       Copyright © 1996, 2002, 2004-2008, 2010-2016, 2018-2020; Pal, Andras <apal@szofi.net>