trusty (5) dirfile-encoding.5.gz

Provided by: libgetdata-tools_0.7.3-6ubuntu1_amd64 bug

NAME

       dirfile-encoding — dirfile database encoding schemes

DESCRIPTION

       The  Dirfile Standards indicate that RAW fields defined in the database are accompanied by
       binary files containing the field data in the specified  simple  data  type.   In  certain
       situations, it may be advantageous to convert the binary files in the database into a more
       convenient form.  This is accomplished by encoding the  binary  file  into  the  alternate
       form.   A common use-case for encoding a binary file is to compress it to save disk space.
       Only data is modified by an encoding scheme.  Database metadata is unaffected.

       Support for encoding  schemes  is  optional.   An  implementation  need  not  support  any
       particular  encoding  scheme,  or  may only support certain operations with it, but should
       expect to encounter unknown encoding schemes and fail gracefully in such situations.

       Additionally, how a particular encoding is implemented is not  specified  by  the  Dirfile
       Standards,  but,  for  purposes  of  interoperability,  all  dirfile  implementations  are
       encouraged to support the encoding implementation used by the  GetData  dirfile  reference
       implementation, elaborated below.

       An encoding scheme is local to the particular format specification fragment in which it is
       indicated.  This allows a single dirfile to have  binary  files  which  are  stored  using
       multiple encodings, by having them defined in multiple fragments.

       The  rest of this manual page discusses specifics of the encoding framework implemented in
       the GetData library, and does not constitute part of the Dirfile Standards.

THE GETDATA ENCODING FRAMEWORK

       The GetData library provides an  encoding  framework  which  abstracts  binary  file  I/O,
       allowing  for generic support for a wide variety of encoding schemes.  Functions which may
       make use of the encoding framework are:

              gd_add(3),                      gd_add_raw(3),                      gd_add_spec(3),
              gd_alter_encoding(3),                                       gd_alter_endianness(3),
              gd_alter_frameoffset(3),                                         gd_alter_entry(3),
              gd_alter_raw(3),                  gd_alter_spec(3),                  gd_getdata(3),
              gd_move(3), gd_nframes(3), gd_putdata(3), and gd_rename(3).

       Most of the encodings supported by GetData  are  implemented  through  external  libraries
       which handle the actual file I/O and data translation.  All such libraries are optional; a
       build of the library which omits an external library will lack support for the  associated
       encoding  scheme.  In this case, GetData will still properly identify the encoding scheme,
       but  attempts  to  use  GetData  for  file  I/O  via  the  encoding  will  fail  with  the
       GD_E_UNSUPPORTED error code.

       GetData  discovers  the  encoding  scheme of a particular RAW field by noting the filename
       extension of files associated with the  field.   Binary  files  which  form  an  unencoded
       dirfile  have no file extension.  The file extension used by the other encodings are noted
       below.  Encoding discovery proceeds by searching for files with the  known  list  of  file
       extensions (in an unspecified order) and stopping when the first successful match is made.
       Because of this, when the a field has multiple data files with different,  supported  file
       extensions  which could legitimately be associated with it, the encoding scheme discovered
       by GetData is not well defined.

       In addition to raw (unencoded) data, GetData supports five other  encoding  schemes:  text
       encoding,  bzip2  encoding, gzip encoding, lzma encoding, and slim encoding, all discussed
       below.

   Text Encoding
       The Text Encoding is unique among GetData encoding schemes in that it requires no external
       library.   As  a result, all builds of the library contain full support for this encoding.
       It is meant to serve as a reference encoding and example of  the  encoding  framework  for
       work on other encoding schemes.

       The  Text  Encoding  replaces  the  binary  data files with 7-bit ASCII files containing a
       decimal text encoding of the data, one sample per line.  All operations are  supported  by
       the Text Encoding.  The file extension of the Text Encoding is .txt.

   BZip2 Encoding
       The  BZip2  Encoding  compresses  raw binary files using the Burrows-Wheeler block sorting
       text compression algorithm and  Huffman  coding,  as  implemented  in  the  bzip2  format.
       GetData's  BZip2  Encoding scheme is implemented through the the bzip2 compression library
       written by Julian Seward.   GetData's  BZip2  Encoding  framework  currently  lacks  write
       capabilities;  as  a  result  the  BZip2  Encoding does not support functions which modify
       binary data.

       GetData caches an uncompressed megabyte of data at a time to speed access times.   A  call
       to  get_nframes(3)  requires  decompression  of  the  entire  binary file to determine its
       uncompressed size, and may take some time to complete.  The file extension  of  the  BZip2
       Encoding is .bz2.

   GZip Encoding
       The  GZip  Encoding  compresses  raw  binary  files  using  Lempel-Ziv  coding  (LZ77)  as
       implemented in the gzip format.  GetData's GZip Encoding scheme is implemented through the
       the  zlib  compression library written by Jean-loup Gailly and Mark Adler.  GetData's GZip
       Encoding framework currently lacks write capabilities; as a result the GZip Encoding  does
       not support functions which modify binary data.

       To speed the operation of get_nframes(3), the GZip Encoding takes the uncompressed size of
       the file the gzip footer, which contains the file's uncompressed  size  in  bytes,  modulo
       2^32.   As  a  result,  using  a field with an (uncompressed) binary file size larger than
       4 GiB as the reference field will result in the wrong number  of  frames  being  reported.
       The file extension of the GZip Encoding is .gz.

   LZMA Encoding
       The  LZMA Encoding compresses raw binary files using the Lempel-Ziv Markov Chain Algorithm
       (LZMA) as implemented in the xz container  format.   GetData's  LZMA  Encoding  scheme  is
       implemented  through the lzma library, part of the XZ Utils suite written by Lasse Collin,
       Ville Koskinen, and Igor Pavlov.  GetData's LZMA Encoding framework currently lacks  write
       capabilities; as a result the LZMA Encoding does not support functions which modify binary
       data.

       As with the BZip2 Encoding, GetData caches an uncompressed megabyte of data at a  time  to
       speed  access times.  A call to get_nframes(3) requires decompression of the entire binary
       file to determine its uncompressed size, and may take some time  to  complete.   The  file
       extension of the LZMA Encoding is .xz, or .lzma.

   Slim Encoding
       The  Slim  Encoding  compresses  raw  binary  files  using the slimlib compression library
       written by Joseph Fowler.  The slimlib library was developed at  Princeton  University  to
       compress  dirfile-like  data.   GetData's  Slim  Encoding  framework currently lacks write
       capabilities; as a result, the Slim Encoding does not support function which modify binary
       files.  The file extension of the Slim Encoding is .slm.

AUTHOR

       This manual page was by D. V. Wiebe <dvw@ketiltrout.net>.

SEE ALSO

       dirfile(5), dirfile-format(5), bzip2(1), gzip(1), zlib(3).