lunar (1) vnl-join.1.gz

Provided by: vnlog_1.34-2_all bug

NAME

       vnl-join - joins two log files on a particular field

SYNOPSIS

        $ cat a.vnl
        # a b
        AA 11
        bb 12
        CC 13
        dd 14
        dd 123

        $ cat b.vnl
        # a c
        aa 1
        cc 3
        bb 4
        ee 5
        - 23

        Try to join unsorted data on field 'a':
        $ vnl-join -j a a.vnl b.vnl
        # a b c
        join: /dev/fd/5:3: is not sorted: CC 13
        join: /dev/fd/6:3: is not sorted: bb 4

        Sort the data, and join on 'a':
        $ vnl-join --vnl-sort - -j a a.vnl b.vnl | vnl-align
        # a  b c
        bb  12 4

        Sort the data, and join on 'a', ignoring case:
        $ vnl-join -i --vnl-sort - -j a a.vnl b.vnl | vnl-align
        # a b c
        AA 11 1
        bb 12 4
        CC 13 3

        Sort the data, and join on 'a'. Also print the unmatched lines from both files:
        $ vnl-join -a1 -a2 --vnl-sort - -j a a.vnl b.vnl | vnl-align
        # a  b   c
        -   -   23
        AA   11 -
        CC   13 -
        aa  -    1
        bb   12  4
        cc  -    3
        dd  123 -
        dd   14 -
        ee  -    5

        Sort the data, and join on 'a'. Print the unmatched lines from both files,
        Output ONLY column 'c' from the 2nd input:
        $ vnl-join -a1 -a2 -o 2.c --vnl-sort - -j a a.vnl b.vnl | vnl-align
        # c
        23
        -
        -
         1
         4
         3
        -
        -
         5

DESCRIPTION

         Usage: vnl-join [join options]
                         [--vnl-sort -|[sdfgiMhnRrV]+]
                         [ --vnl-[pre|suf]fix[1|2] xxx    |
                           --vnl-[pre|suf]fix xxx,yyy,zzz |
                           --vnl-autoprefix               |
                           --vnl-autosuffix ]
                         logfile1 logfile2

       This tool joins two vnlog files on a given field. "vnl-join" is a wrapper around the GNU
       coreutils "join" tool. Since this is a wrapper, most commandline options and behaviors of
       the "join" tool are present; consult the join(1) manpage for detail. The differences from
       GNU coreutils "join" are

       •   The input and output to this tool are vnlog files, complete with a legend

       •   The columns are referenced by name, not index. So instead of saying

             join -j1

           to join on the first column, you say

             join -j time

           to join on column "time".

       •   "-1" and "-2" are supported, but must refer to the same field. Since vnlog knows the
           identify of each field, it makes no sense for "-1" and "-2" to be different. So pass
           "-j" instead, it makes more sense in this context.

       •   "-a-" is available as a shorthand for "-a1 -a2": this is a full outer join, printing
           unmatched records from both of the inputs. Similarly, "-v-" is available as a
           shorthand for "-v1 -v2": this will output only the unique records in both of the
           inputs.

       •   "vnl-join"-specific options are available to adjust the field-naming in the output:

             --vnl-prefix1
             --vnl-suffix1
             --vnl-prefix2
             --vnl-suffix2
             --vnl-prefix
             --vnl-suffix
             --vnl-autoprefix
             --vnl-autosuffix

           See "Field names in the output" below for details.

       •   A "vnl-join"-specific option "--vnl-sort" is available to sort the input and/or
           output. See below for details.

       •   By default we call the "join" tool to do the actual work. If the underlying tool has a
           different name or lives in an odd path, this can be specified by passing "--vnl-tool
           TOOL"

       •   If no "-o" is given, we output the join field, the remaining fields in logfile1, the
           remaining fields in logfile2, .... This is what "-o auto" does, except we also handle
           empty vnlogs correctly.

       •   "-e" is not supported because vnlog uses "-" to represent undefined fields.

       •   "--header" is not supported because vnlog assumes a specific header structure, and
           "vnl-join" makes sure that this header is handled properly

       •   "-t" is not supported because vnlog assumes whitespace-separated fields

       •   "--zero-terminated" is not supported because vnlog assumes newline-separated records

       •   Rather than only 2-way joins, this tool supports N-way joins for any N > 2. See below
           for details.

       Past that, everything "join" does is supported, so see that man page for detailed
       documentation. Note that all non-legend comments are stripped out, since it's not obvious
       where they should end up.

   Field names in the output
       By default, the field names in the output match those in the input. This is what you want
       most of the time. It is possible, however that a column name adjustment is needed. One
       common use case for this is if the files being joined have identically-named columns,
       which would produce duplicate columns in the output.  Example: we fixed a bug in a
       program, and want to compare the results before and after the fix. The program produces an
       x-y trajectory as a function of time, so both the bugged and the bug-fixed programs
       produce a vnlog with a legend

        # time x y

       Joining this on "time" will produce a vnlog with a legend

        # time x y x y

       which is confusing, and not what you want. Instead, we invoke "vnl-join" as

        vnl-join --vnl-suffix1 _buggy --vnl-suffix2 _fixed -j time buggy.vnl fixed.vnl

       And in the output we get a legend

        # time x_buggy y_buggy x_fixed y_fixed

       Much better.

       Note that "vnl-join" provides several ways of specifying this. The above works only for
       2-way joins. An alternate syntax is available for N-way joins, a comma-separated list. The
       same could be expressed like this:

        vnl-join -a- --vnl-suffix _buggy,_fixed -j time buggy.vnl fixed.vnl

       Finally, if passing in structured filenames, "vnl-join" can infer the desired syntax from
       the filenames. The same as above could be expressed even simpler:

        vnl-join --vnl-autosuffix -j time buggy.vnl fixed.vnl

       This works by looking at the set of passed in filenames, and stripping out the common
       leading and trailing strings.

   Sorting of input and output
       The GNU coreutils "join" tool expects sorted columns because it can then take only a
       single pass through the data. If the input isn't sorted, then we can use normal shell
       substitutions to sort it:

        $ vnl-join -j key <(vnl-sort -s -k key a.vnl) <(vnl-sort -s -k key b.vnl)

       For convenience "vnl-join" provides a "--vnl-sort" option. This allows the above to be
       equivalently expressed as

        $ vnl-join -j key --vnl-sort - a.vnl b.vnl

       The "-" after the "--vnl-sort" indicates that we want to sort the input only. If we also
       want to sort the output, pass the short codes "sort" accepts instead of the "-". For
       instance, to sort the input for "join" and to sort the output numerically, in reverse, do
       this:

        $ vnl-join -j key --vnl-sort rg a.vnl b.vnl

       The reason this shorthand exists is to work around a quirk of "join". The sort order is
       assumed by "join" to be lexicographical, without any way to change this. For "sort", this
       is the default sort order, but "sort" has many options to change the sort order, options
       which are sorely missing from "join". A real-world example affected by this is the joining
       of numerical data. If you have "a.vnl":

        # time a
        8 a
        9 b
        10 c

       and "b.vnl":

        # time b
        9  d
        10 e

       Then you cannot use "vnl-join" directly to join the data on time:

        $ vnl-join -j time a.vnl b.vnl
        # time a b
        join: /dev/fd/4:3: is not sorted: 10 c
        join: /dev/fd/5:2: is not sorted: 10 e
        9 b d
        10 c e

       Instead you must re-sort both files lexicographically, and then (because you almost
       certainly want to) sort it back into numerical order:

        $ vnl-join -j time <(vnl-sort -s -k time a.vnl) <(vnl-sort -s -k time b.vnl) |
          vnl-sort -s -n -k time
        # time a b
        9 b d
        10 c e

       Yuck. The shorthand described earlier makes the interface part of this palatable:

        $ vnl-join -j time --vnl-sort n a.vnl b.vnl
        # time a b
        9 b d
        10 c e

       Note that the input sort is stable: "vnl-join" will invoke "vnl-sort -s".  If you want a
       stable post-sort, you need to ask for it with "--vnl-sort s...".

   N-way joins
       The GNU coreutils "join" tool is inherently designed to join exactly two files. "vnl-join"
       extends this capability by chaining together a number of "join" invocations to produce a
       generic N-way join. This works exactly how you would expect with the following caveats:

       •   Full outer joins are supported by passing "-a-", but no other "-a" option is
           supported. This is possible, but wasn't obviously worth the trouble.

       •   "-v" is not supported. Again, this is possible, but wasn't obviously worth the
           trouble.

       •   Similarly, "-o" is not supported. This is possible, but wasn't obviously worth the
           trouble, especially since the desired behavior can be obtained by post-processing with
           "vnl-filter".

BUGS AND CAVEATS

       The underlying "sort" tool assumes lexicographic ordering, and matches fields purely based
       on their textual contents. This means that for the purposes of joining, 10, 10.0 and 1.0e1
       are all considered different. If needed, you can normalize your keys with something like
       this:

        vnl-filter -p x='sprintf("%f",x)'

COMPATIBILITY

       I use GNU/Linux-based systems exclusively, but everything has been tested functional on
       FreeBSD and OSX in addition to Debian, Ubuntu and CentOS. I can imagine there's something
       I missed when testing on non-Linux systems, so please let me know if you find any issues.

SEE ALSO

       join(1)

REPOSITORY

       https://github.com/dkogan/vnlog/

AUTHOR

       Dima Kogan "<dima@secretsauce.net>"

       Copyright 2018 Dima Kogan "<dima@secretsauce.net>"

       This library is free software; you can redistribute it and/or modify it under the terms of
       the GNU Lesser General Public License as published by the Free Software Foundation; either
       version 2.1 of the License, or (at your option) any later version.

                                            2023-01-14                                VNL-JOIN(1)