Ubuntu Manpage: dirfile-format — the dirfile database format specification file

Provided by: libgetdata-tools_0.7.3-6ubuntu1_amd64

NAME

       dirfile-format — the dirfile database format specification file

DESCRIPTION

       The  dirfile  format  specification  fully  specifies the raw and derived time streams and
       auxiliary information for a dirfile(5) database.

       The format specification is contained in one or more case-sensitive text files located  in
       the  dirfile  tree.   Each  file is known as a fragment.  The primary fragment is the file
       called format located in the base dirfile directory.  This file may contain only  part  of
       the format specification, and may reference other fragments (using the /INCLUDE directive)
       containing  further  format  specification.   This  inclusion  mechanism  may  be   nested
       arbitrarily deep.

       The explicit text encoding of these files is not specified by these standards, but must be
       7-bit ASCII compatible.  Examples  of  acceptable  character  encodings  include  all  the
       ISO  8859  character  sets  (i.e.  Latin-1 through Latin-10, among others), as well as the
       UTF-8 encoding of Unicode and UCS.

SYNTAX

       The format specification is composed of field specification  lines  and  directive  lines,
       optionally  separated  by  blank  lines  or  lines  containing only whitespace.  Lines are
       separated by the line-feed character (0x0A).  Unless escaped (see below),  the  hash  mark
       (#)  is the comment delimiter; the comment delimiter, and any text following it to the end
       of the line, is ignored.

   Tokens
       Both field specification lines and directive lines consist of several tokens separated  by
       whitespace.   Whitespace  consists of one or more whitespace characters.  These are: space
       (0x20), horizontal tab (0x09), vertical tab (0x0B), form-feed (0x0C), and carriage  return
       (0x0D).   The  first token of a directive line is always a reserved word.  The first token
       of a field specification line is never a reserved word.   Any  amount  of  whitespace  may
       precede the first token on a line.

       Since tokens are separated by whitespace, to include a whitespace character in a token, it
       must either escaped by preceding it by a backslash character (\),  or  be  replaced  by  a
       character  escape  sequence  (see  below), or else the token must be enclosed in quotation
       marks (").  The quotation marks themselves will be stripped from the token. The null-token
       (that is, the token consisting of zero characters) may be specified by a pair of quotation
       marks with nothing between them ("").  To include a literal quotation mark in a token,  it
       must  be  escaped (\").  Similarly, a hash mark may be included in a token by including it
       in a quoted token or else by escaping it (\#), otherwise the hash mark will be  understood
       as the comment delimiter.

       It  is a syntax error to have a line which contains unmatched quotation marks, or in which
       the last character is the backslash character.

       Several characters when escaped by a preceding  backslash  character  are  interpreted  as
       special characters in tokens.  The character escape sequences are:

              \a     an alert (bell) character (ASCII 0x07 / U+0007)

              \b     a backspace character (ASCII 0x08 / U+0008)

              \e     an escape character (ASCII 0x1B / U+001B)

              \f     a form-feed character (ASCII 0x0C / U+000C)

              \n     a line-feed character (ASCII 0x0A / U+000A)

              \r     a carriage return character (ASCII 0x0D / U+000D)

              \t     a horizontal tab character (ASCII 0x09 / U+0009)

              \v     a vertical tab character (ASCII 0x0B / U+000B)

              \\     a backslash character (ASCII 0x5C / U+005C)

              \ooo   the single byte given by the octal number ooo.

              \xhh   the single byte given by the hexadecimal number hh.

              \uhhhhhhh
                     the  UTF-8  byte  sequence  encoding  the  Unicode  code  point given by the
                     hexadecimal number hhhhhhh.

       Any other character which is escaped is interpreted as the character itself.  (i.e.  \c is
       interpreted as c; also, as pointed out above, \" and \# are interpreted as simply " and #,
       without their special meanings).

       No token may contain the NULL character (ASCII  0x00  /  U+0000).   Furthermore,  although
       support  is  present  to  create UTF-8 byte sequences, tokens are not required to be valid
       UTF-8 sequences.  Any byte sequence not containing the NULL character forms a valid token.
       However,  there  may  be  further  restrictions  on  allowed  characters  for a token in a
       particular situation, (for example, when used as a field name).

DIRECTIVES

There are eight reserved words, which cannot be used as field names in the dirfile.
Instead, these specify directives. All reserved words start with an initial forward slash
(/), to distinguish them from field names. Previous versions of the Standards permitted
the omission of the slash. Like the rest of the format specification, directives are case
sensitive.

A number of the directives have fragment scope. A directive with fragment scope only
applies to the fragment in which it is present, plus any sub-fragments indicated by the
/INCLUDE directive, but only if those sub-fragments don't have their own corresponding
directive. Directives which have fragment scope are: /ENCODING, /ENDIAN, /FRAMEOFFSET,
and /PROTECT. Because of these scoping rules, different portions of the dirfile may have
different encodings, endiannesses, frame offsets, or protection levels.

If a directive with fragment scope appears more than once in a fragment, only the last
such directive will be honoured, with the exception that the effect of a directive will
not be propagated to sub-fragments if the directive line appears after the sub-fragment is
included. The scoping rules of the remaining directives are discussed below.

/ENCODING
The /ENCODING directive specifies the encoding scheme used to encode binary files
in the dirfile. The encoding scheme may be one of the predefined names listed
below, which are described in more detail in dirfile-encoding(5), or any other
site-specific encoding scheme. The predefined scheme names are:

none The dirfile is unencoded.

bzip2 The dirfile is compressed using the bzip2 compression scheme.

gzip The dirfile is compressed using the gzip compression scheme.

lzma The dirfile is compressed using the LZMA compression scheme.

slim The dirfile is compressed using the slim compression scheme.

text The dirfile is text encoded.

Implementations should fail gracefully when encountering an unknown encoding
scheme. If no encoding scheme is specified, behaviour is implementation dependent.
Syntax is:

/ENCODING <scheme>

The /ENCODING directive has fragment scope.

/ENDIAN
The /ENDIAN directive specifies the endianness of the raw data in the database.
The assumed endianness of raw data in dirfiles which omit this directive is
implementation dependent. Syntax is:

/ENDIAN ( big | little ) [ arm ]

where the "arm" token should be included if double precision floating point data
are stored in the ARM middle-endian format. The /ENDIAN directive has fragment
scope.

/FRAMEOFFSET
The /FRAMEOFFSET directive specifies the frame number of the first frame for which
data exists in binary files associated with RAW fields. Syntax is:

/FRAMEOFFSET <integer>

The /FRAMEOFFSET directive has fragment scope.

/INCLUDE
The /INCLUDE directive specifies another file (called a fragment) to parse for
additional format specification for the dirfile. The inclusion is treated as if
the lines of the fragment were pasted verbatim in place of the INCLUDE directive
line. The exception to this is that RAW fields specified in the fragment are
located in the directory containing the fragment and not in the directory
containing the parent fragment, and the binary file encoding may be different for
each fragment. The fragment may be specified either with an absolute path, or else
a relative path from the current file. Syntax is:

/INCLUDE <file>

The /INCLUDE directive has no scope: it is processed immediately and has no long-
term effect.

/META The /META directive specifies a metafield attached to a particular parent field.
The field metadata may be of any allowed type except RAW. Metafields are retrieved
in exactly the same way as regular field data, but the field code specified
consists of the parent and metafield names joined with a forward slash:

<parent-field>/<meta-field>

META fields may not be specified before their parent field has been. Syntax is:

/META <parent-field> {field specification line}

As an illustration of this concept,

/META pfield meta CONST FLOAT64 3.291882

provides a scalar metadatum called meta with value 3.291882 attached to the field
pfield. This particular metafield may be referred to by the field code
"pfield/meta". Note that different parent fields may have metafields with the same
name, since all references to metafields must include the parent field name.
Metafields may not themselves have further sub-metafields.

As an alternative to the /META directive, a metafield may be specified by a
standard field specification line, using

<parent-field>/<meta-field>

as the field name. That is, the above example metafield could have also been
specified as:

pfield/meta CONST FLOAT64 3.291882

The /META directive has no scope: it is processed immediately and has no long-term
effect.

/PROTECT
The /PROTECT directive specifies the advisory protection level of the current
fragment and of the RAW fields defined therein. The protection level indicates
whether writing to the fragment, or the binary data on disk is permitted. Syntax
is:

/PROTECT <level>

Four advisory protection levels are defined:

none No protection at all: data and metadata may be freely changed. This is the
default, if no /PROTECT directive is present.

format The dirfile metadata is protected from change, but RAW data on disk may be
modified.

data The RAW data on disk is protected from change, but metadata may be modified.

all Both metadata and data on disk are protected from change.

The /PROTECT directive has fragment scope.

/REFERENCE
The /REFERENCE directive specifies the name of the field to use as the dirfile's
reference field (see dirfile(5)). If no /REFERENCE directive is specified, the
first RAW field encountered is used as the reference field. The /REFERENCE
directive must specify a RAW field. Syntax is:

/REFERENCE <field-code>

The /REFERENCE directive has global scope: if multiple /REFERENCE directives appear
in the dirfile metadata, only the last such will be honoured.

/VERSION
The /VERSION directive specifies the particular version of the Dirfile Standards to
which the dirfile format specification conforms. This directive should occur
before any version dependent syntax is encountered. As of Standards Version 6, no
such syntax exists, and this directive is provided primarily to ease forward
compatibility. Syntax is:

/VERSION <integer>

The /VERSION directive has immediate scope: its effect is immediate, and it applies
only to metadata below it, including and propagating downwards to sub-fragments
after the directive. Its effect will also propagate upwards back to the parent
fragment, and affect subsequent metadata.

FIELD SPECIFICATION LINES

       Any line which does not start with a reserved word is assumed to be a field  specification
       line.  A field specification line consists of at least two tokens.  The first token is the
       field name.  The second token is the field type.  Subsequent tokens are field  parameters.
       The meaning and number these parameters depends on the field type specified.

   Field Names
       The  first token in a field specification line is the field name.  The field name consists
       of one or more characters, excluding both ASCII control characters (the bytes 0x01 through
       0x1F), and the characters

              &    /    ;    <    >    |    .

       which are reserved (but see below for the use of / to specify metafields).  The field name
       may not be INDEX, which is a special, implicit field  which  contains  the  integer  frame
       index.  Field names are case sensitive.

       If  the  field  name  beginning a field specification line does contain a / character, the
       line is assumed to specify a  metafield.   See  the  /META  directive  above  for  further
       details.

   Field Types
       There  are  thirteen  field types.  Of these, ten are of vector type (BIT, DIVIDE, LINCOM,
       LINTERP, MULTIPLY, PHASE, POLYNOM, RAW, RECIP, and SBIT) and  three  are  of  scalar  type
       (CONST, CARRAY, and STRING).  The possible fields types are:

       BIT    The BIT vector field type extracts one or more bits out of an input vector field as
              an unsigned number.  Syntax is:

                     <field-name> BIT <input> <first-bit> [<bits>]

              which specifies field-name to  be  the  value  of  bits  first-bit  through  first-
              bit+bits-1 of the input vector field input, when input is converted from its native
              type to an (endianness corrected) unsigned 64-bit integer.  If bits is omitted,  it
              is assumed to be 1.  Both first-bit and bits may be either literal numbers, or else
              the field code of a CONST or CARRAY field type containing their values.   The  SBIT
              field type is a signed version of this field type.

       CARRAY The  CARRAY  scalar field type is a list of constants fully specified in the format
              specification metadata.  Syntax is:

                     <field-name> CARRAY <type> <value0> <value1> <value2> ...

              where type may be any supported native data type (see the description  of  the  RAW
              field type below), and value0, value1, &c. are the values of successive elements in
              the scalar list interpreted as indicated by type.  No limit is placed on the number
              of  elements in a CARRAY.  (Note: despite being multivalued, this is not considered
              a vector field since the elements of the CARRAY are not indexed by frames.)

       CONST  The  CONST  scalar  field  type  is  a  constant  fully  specified  in  the  format
              specification metadata.  Syntax is:

                     <field-name> CONST <type> <value>

              where  type  may  be any supported native data type (see the description of the RAW
              field type below), and value is the numerical value of the constant interpreted  as
              indicated by type.

       DIVIDE The DIVIDE vector field type is the quotient of two vector fields.  Syntax is:

                     <field-name> DIVIDE <field1> <field1>

              The derived field will be computed as:

                     field-name[n] = field1[n] / field2[n2]

              with  the  index  n2  computed appropriately for the (potentially differing) sample
              rates of the input fields.  The resultant field will have the same sample  rate  as
              field1.

       LINCOM The  LINCOM  vector field type is the linear combination of one, two or three input
              vector fields.  Syntax is:

                     <field-name> LINCOM [<n>] <field1> <a1> <b1> [<field2> <a2>  <b2>  [<field3>
                     <a3> <b3>]]

              where n, if present, indicates the number of input vector fields (1, 2, or 3).  The
              derived field will be computed as:

                     field-name[n] = (a1 * field1[n] + b1) + (a2 *  field2[n2]  +  b2)  +  (a3  *
                     field3[n3] + b3)

              with  the field2 and field3 terms included only if specified and the indices n2 and
              n3 computed appropriately for the (potentially differing) sample rates of the input
              fields.   The  resultant  field  will  have  the  same sample rate as field1.  Each
              supplied co-efficient (a1, b1, a2, &c.) may be either a literal number, or else the
              field code of a CONST or CARRAY field type containing its value.

              If  n  is  not  specified,  the  number  of  fields is determined by looking at the
              supplied parameters.  Since it  is  possible  to  create  a  field  code  which  is
              identical to a literal number, the third token on the line is assumed to be n if it
              the entire token can be parsed as a literal number  using  the  rules  outlined  in
              strtod(3).   That  is,  if the field code specifying field1 could be mistaken for a
              literal number, n must be specified to prevent ambiguity.

       LINTERP
              The LINTERP vector field type specifies a table look up  based  on  another  vector
              field.  Syntax is:

                     <field-name> LINTERP <input> <table>

              where  input  is the input vector field for the table lookup, and table is the path
              to the lookup table file for the field.  If this path is relative, it is assumed to
              be  relative  to  the  directory  containing the fragment defining this field.  The
              lookup table file is an ASCII text file with two whitespace separated columns of  x
              and y values.  Values are linearly interpolated between the points specified in the
              lookup table.

       MULTIPLY
              The MULTIPLY vector field type is the product of two vector fields.  Syntax is:

                     <field-name> MULTIPLY <field1> <field2>

              The derived field will be computed as:

                     field-name[n] = field1[n] * field2[n2]

              with the index n2 computed appropriately for  the  (potentially  differing)  sample
              rates  of  the input fields.  The resultant field will have the same sample rate as
              field1.

       PHASE  The PHASE vector field type shifts an input vector field by the specified number of
              samples.  Syntax is:

                     <field-name> PHASE <input> <shift>

              which  specifies  field-name  to be the input vector field, input, shifted by shift
              samples.  A positive shift indicates a forward  shift,  towards  the  end-of-field.
              Results   of  shifting  past  the  beginning-  or  end-of-field  is  implementation
              dependent.  The shift parameter may be either a literal number, or else  the  field
              code of a CONST or CARRAY field type containing its values.

       POLYNOM
              The  POLYNOM  vector  field  type specifies a polynomial function of a single input
              vector field.  Syntax is:

                     <field_name> POLYNOM <input> <a0> <a1> [<a2> [<a3> [<a4> [<a5>]]]]

              where <input> is the input field code, and the order of the computed polynomial  is
              determined by how many co-efficients are present in the specification.  The derived
              field is computed as:

                     field-name[n] = a0 + a1 * input[n] + a2 * input[n]**2 + a3 *  input[n]**3  +
                     a4 * input[n]**4 + a5 * input[n]**5

              where  **  is  the exponentiation operator, and the higher order terms are computed
              only if the corresponding co-efficients ai are  specified.   The  coefficients,  if
              specified,  may  be  either  literal  numbers, or else the field code of a CONST or
              CARRAY field type containing the value.

       RECIP  The RECIP vector field type computes the reciprocal of a single input vector field.
              Syntax is:

                     <field_name> RECIP <input> <dividend>

              where  <input>  is  the  input field code and <dividend> is a scalar quantity.  The
              derived field is computed as:

                     field-name[n] = dividend / input[n].

              The dividend, if specified, may be either literal numbers, or else the  field  code
              of a CONST or CARRAY field type containing the value.

       RAW    The  RAW  vector  field type specifies raw time streams on disk.  In this case, the
              field name should correspond to the name of the file containing  the  time  stream.
              Syntax is:

                     <field-name> RAW <type> <sample-rate>

              where  sample-rate  is  the number of samples per dirfile frame for the time stream
              and type is a token specifying the native data format type:

                     UINT8  unsigned 8-bit integer

                     INT8   signed (two's complement) 8-bit integer

                     UINT16 unsigned 16-bit integer

                     INT16  signed (two's complement) 16-bit integer

                     UINT32 unsigned 32-bit integer

                     INT32  signed (two's complement) 32-bit integer

                     UINT64 unsigned 64-bit integer

                     INT64  signed (two's complement) 64-bit integer

                     FLOAT32 or FLOAT
                            IEEE-754 standard 32-bit single precision floating point number

                     FLOAT64 or DOUBLE
                            IEEE-754 standard 64-bit double precision floating point number

                     COMPLEX64
                            a 64-bit complex number consisting of two  IEEE-754  standard  32-bit
                            single  precision  floating  point  numbers representing the real and
                            imaginary parts of the complex number.

                     COMPLEX128
                            a 128-bit complex number consisting of two IEEE-754  standard  64-bit
                            double  precision  floating  point  numbers representing the real and
                            imaginary parts of the complex number.

              For more information on the storage of complex valued data, see dirfile(5).

              For backwards compatibility, implementations should also  recognise  the  following
              single character type aliases in use prior to Standards Version 5:

                     c      UINT8

                     u      UINT16

                     s      INT16

                     U      UINT32

                     i, S   INT32

                     f      FLOAT32

                     d      FLOAT64

              Types  INT8,  UINT64,  INT64,  COMPLEX64,  and  COMPLEX128 are not supported before
              Standards Version 5, so no single character type aliases  exist  for  these  types.
              Standards Version 8 removed support for these single character type codes.

              The  sample-rate  parameter  may  be either a literal number, or else the name of a
              CONST or CARRAY field type containing its values.

       SBIT   The SBIT vector field type extracts one or more bits out of an input  vector  field
              as a signed number.  Syntax is:

                     <field-name> SBIT <input> <first-bit> [<bits>]

              which  specifies  field-name  to  be  the  value  of  bits first-bit through first-
              bit+bits-1 of the input vector field input, when input is converted from its native
              type  to a (endianness corrected) signed 64-bit integer.  If bits is omitted, it is
              assumed to be 1.  Both first-bit and bits may be either literal  numbers,  or  else
              the  field  code  of a CONST or CARRAY field type containing their values.  The BIT
              field type is an unsigned version of this field type.

       STRING The STRING scalar field type is a character string fully specified  in  the  format
              file metadata.  Syntax is:

                     <field-name> STRING <value>

              where  value  is the string value of the field.  Note that value is a single token.
              To include whitespace in the string, enclose value in quotation marks ("), or  else
              escape the whitespace with the backslash character (\).

   Field Parameters
       All  input  vector field parameters should be field codes (see below).  Additionally, some
       of the numerical field parameters may be either literal numbers or else the field code  of
       a CONST field containing the value, or the field code of a CARRAY followed by a left angle
       bracket (<), then an non-negative integer used as the CARRAY element index, then  a  right
       angle bracket (>), that is:

              field_code<n>

       Parameters  which allow non-literal values are indicated above.  If the angle brackets and
       element index are omitted from a CARRAY field code used as a parameter, the first  element
       in the field (index zero) is assumed.

       Since  it  is  possible  to  create a field code which is identical to a literal number, a
       parameter is assumed to be the field code of a scalar  field  only  if  the  entire  token
       cannot  be parsed as a literal number using the rules outlined in strtod(3).  For example,
       a CONST field whose field code consists solely of digits can never be used as a  parameter
       in a field specification line.

       A  literal complex number is specified as two real (floating point) numbers separated by a
       semicolon (;) with no intervening whitespace.  So, for example, the tokens

              1;0 0;1 4;0 0;5 9.313e2;74.1

       represent, respectively, the real unit, the imaginary unit,  the  real  number  four,  the
       imaginary  number  5i,  and  the  complex  number  931.3  +  74.1i.  Because the semicolon
       character cannot be used in field names, a complex valued literal can  never  be  mistaken
       for  a  field  code.   This  allows, among other things, the composition of complex valued
       fields from purely real input fields.  For example, a complex  valued  field,  z,  may  be
       created from a real valued field re, representing the real part of the complex number, and
       the real valued field im, representing the imaginary part of the complex number, with  the
       following LINCOM specification:

              z LINCOM re 1 0 im 0;1 0

   Field Codes
       When  specifying the input to a field, either as a scalar parameter, or as an input vector
       field to a non-RAW vector field, field codes are used.  A field code is one of:

       •   a simple field name, indicating a vector or scalar field

       •   a parent field name, followed by a  forward  slash,  followed  by  a  metafield  name,
           indicating  a metafield.  See the description of the /META directive above for further
           details.

       •   either of the above, followed by a period, followed by a  representation  suffix,  but
           only if the field or metafield specified is not a STRING type field.

       A  representation  suffix  may be used used to extract a real number from a complex value.
       The available suffixes and their meanings are:

       .a     This representation indicates the angle (in radians) between the positive real axis
              and  the value (ie. the complex argument).  The argument is in the range [-pi, pi],
              and a branch cut exists along the negative real axis.  At the branch  cut,  -pi  is
              returned  if  the imaginary part is -0, and pi is returned if the imaginary part is
              +0.  If z=0, zero is returned.

       .i     This representation indicates the projection of the value onto the  imaginary  axis
              (ie. the imaginary part of the number).

       .m     This representation indicates the modulus of the value (ie. its absolute value).

       .r     This  representation  indicates the projection of the value onto the real axis (ie.
              the real part of the number).

       If the specified field is purely real,  the  representations  are  calculated  as  if  the
       imaginary  part  was equal to +0.  For example, given a complex valued vector, z, a vector
       containing the real part of z, re_z, could be produced with:

              re_z PHASE z.r 0

       and similarly for the complex  field's  imaginary  part,  argument,  and  absolute  value.
       (Although  it  should  be pointed out this simplistic an example isn't strictly necessary,
       since z.r could be used wherever re_z would be.)

STANDARDS VERSIONS

This document describes Version 8 of the Dirfile Standards.

Version 8 of the Standards (November 2010) added the DIVIDE, RECIP, and CARRAY field
types, made the forward slash on reserved words mandatory, and prohibited using the single
character data type aliases in the specification of RAW fields. It also introduced the
optional second (arm) token to the /ENDIAN directive.

Version 7 of the Standards (October 2009) added the SBIT and POLYNOM field types, and the
directive-less method of specifying metafields. It also introduced the data types
COMPLEX128 and COMPLEX64, along with the notion of representations. Finally, it made the
number of fields parameter for LINCOM optional.

Version 6 of the Standards (October 2008) added the /ENCODING, /META, /PROTECT, and
/REFERENCE directives, and the CONST and STRING field types. It permitted whitespace in
tokens and introduced the character escape sequences. It allowed CONST fields to be used
as parameters in field specification lines. It also removed FILEFRAM as an alias for
INDEX, and prohibited . but allowed # and \ in field names.

Version 5 of the Standards (August 2008) added VERSION and ENDIAN, slash demarcation of
reserved words, and removed the restriction on field name length. It introduced the data
types INT8, INT64, and UINT64, the new-style type specifiers, and increased the range of
the BIT field type from 32 to 64 bits. It also prohibited the characters &;<>\| in field
names.

Version 4 of the Standards (October 2006) added the PHASE field type.

Version 3 of the Standards (January 2006) added INCLUDE and increased the allowed length
of a field name from 16 to 50 characters.

Version 2 of the Standards (September 2005) added the MULTIPLY field type.

Version 1 of the Standards (November 2004) added FRAMEOFFSET and the optional fourth
argument to the BIT field type.

Version 0 of the Standards (before March 2003) refers to the dirfile standards supported
by the getdata(3) library originally introduced into the kst(1) sources, which contained
support for all other features covered by this document.

AUTHORS

       The    dirfile    specification     was     developed     by     C.     B.     Netterfield
       <netterfield@astro.utoronto.ca>.

       Since  Standards  Version  3, the dirfile specification has been maintained by D. V. Wiebe
       <getdata@ketiltrout.net>.