trusty (5) dirfile-encoding.5.gz

Provided by: libgetdata-tools_0.7.3-6ubuntu1_amd64 bug

NAME

       dirfile-encoding — dirfile database encoding schemes

DESCRIPTION

       The  Dirfile  Standards  indicate that RAW fields defined in the database are accompanied by binary files
       containing the field data in  the  specified  simple  data  type.   In  certain  situations,  it  may  be
       advantageous  to  convert  the  binary  files  in  the  database  into  a  more convenient form.  This is
       accomplished by encoding the binary file into the alternate form.   A  common  use-case  for  encoding  a
       binary file is to compress it to save disk space.  Only data is modified by an encoding scheme.  Database
       metadata is unaffected.

       Support for encoding schemes is optional.  An implementation need not  support  any  particular  encoding
       scheme,  or  may only support certain operations with it, but should expect to encounter unknown encoding
       schemes and fail gracefully in such situations.

       Additionally, how a particular encoding is implemented is not specified by the  Dirfile  Standards,  but,
       for  purposes  of  interoperability,  all  dirfile implementations are encouraged to support the encoding
       implementation used by the GetData dirfile reference implementation, elaborated below.

       An encoding scheme is local to the particular format specification fragment in  which  it  is  indicated.
       This  allows  a  single dirfile to have binary files which are stored using multiple encodings, by having
       them defined in multiple fragments.

       The rest of this manual page discusses specifics of the encoding framework  implemented  in  the  GetData
       library, and does not constitute part of the Dirfile Standards.

THE GETDATA ENCODING FRAMEWORK

       The  GetData library provides an encoding framework which abstracts binary file I/O, allowing for generic
       support for a wide variety of encoding schemes.  Functions which may make use of the  encoding  framework
       are:

              gd_add(3),    gd_add_raw(3),    gd_add_spec(3),    gd_alter_encoding(3),   gd_alter_endianness(3),
              gd_alter_frameoffset(3),  gd_alter_entry(3),  gd_alter_raw(3),  gd_alter_spec(3),   gd_getdata(3),
              gd_move(3), gd_nframes(3), gd_putdata(3), and gd_rename(3).

       Most  of  the  encodings supported by GetData are implemented through external libraries which handle the
       actual file I/O and data translation.  All such libraries are optional; a  build  of  the  library  which
       omits  an  external  library will lack support for the associated encoding scheme.  In this case, GetData
       will still properly identify the encoding scheme, but attempts to  use  GetData  for  file  I/O  via  the
       encoding will fail with the GD_E_UNSUPPORTED error code.

       GetData discovers the encoding scheme of a particular RAW field by noting the filename extension of files
       associated with the field.  Binary files which form an unencoded dirfile have  no  file  extension.   The
       file extension used by the other encodings are noted below.  Encoding discovery proceeds by searching for
       files with the known list of file extensions (in an  unspecified  order)  and  stopping  when  the  first
       successful  match  is  made.   Because  of this, when the a field has multiple data files with different,
       supported file extensions which could legitimately be associated with it, the encoding scheme  discovered
       by GetData is not well defined.

       In  addition  to raw (unencoded) data, GetData supports five other encoding schemes: text encoding, bzip2
       encoding, gzip encoding, lzma encoding, and slim encoding, all discussed below.

   Text Encoding
       The Text Encoding is unique among GetData encoding schemes in that it requires no external library.  As a
       result,  all  builds  of  the  library contain full support for this encoding.  It is meant to serve as a
       reference encoding and example of the encoding framework for work on other encoding schemes.

       The Text Encoding replaces the binary data files  with  7-bit  ASCII  files  containing  a  decimal  text
       encoding  of the data, one sample per line.  All operations are supported by the Text Encoding.  The file
       extension of the Text Encoding is .txt.

   BZip2 Encoding
       The BZip2 Encoding compresses raw binary files using the Burrows-Wheeler block sorting  text  compression
       algorithm  and  Huffman  coding,  as implemented in the bzip2 format.  GetData's BZip2 Encoding scheme is
       implemented through the the bzip2 compression library written by Julian Seward.  GetData's BZip2 Encoding
       framework  currently  lacks write capabilities; as a result the BZip2 Encoding does not support functions
       which modify binary data.

       GetData caches an  uncompressed  megabyte  of  data  at  a  time  to  speed  access  times.   A  call  to
       get_nframes(3)  requires  decompression of the entire binary file to determine its uncompressed size, and
       may take some time to complete.  The file extension of the BZip2 Encoding is .bz2.

   GZip Encoding
       The GZip Encoding compresses raw binary files using Lempel-Ziv coding (LZ77) as implemented in  the  gzip
       format.   GetData's  GZip Encoding scheme is implemented through the the zlib compression library written
       by  Jean-loup  Gailly  and  Mark  Adler.   GetData's  GZip  Encoding  framework  currently  lacks   write
       capabilities; as a result the GZip Encoding does not support functions which modify binary data.

       To  speed  the operation of get_nframes(3), the GZip Encoding takes the uncompressed size of the file the
       gzip footer, which contains the file's uncompressed size in bytes, modulo 2^32.  As  a  result,  using  a
       field with an (uncompressed) binary file size larger than 4 GiB as the reference field will result in the
       wrong number of frames being reported.  The file extension of the GZip Encoding is .gz.

   LZMA Encoding
       The LZMA Encoding compresses raw binary files using the  Lempel-Ziv  Markov  Chain  Algorithm  (LZMA)  as
       implemented  in  the xz container format.  GetData's LZMA Encoding scheme is implemented through the lzma
       library, part of the XZ Utils suite written by Lasse Collin, Ville Koskinen, and Igor Pavlov.   GetData's
       LZMA  Encoding  framework  currently  lacks  write  capabilities;  as a result the LZMA Encoding does not
       support functions which modify binary data.

       As with the BZip2 Encoding, GetData caches an uncompressed megabyte of data at a  time  to  speed  access
       times.   A  call  to  get_nframes(3)  requires  decompression  of the entire binary file to determine its
       uncompressed size, and may take some time to complete.  The file extension of the LZMA Encoding  is  .xz,
       or .lzma.

   Slim Encoding
       The  Slim  Encoding  compresses  raw binary files using the slimlib compression library written by Joseph
       Fowler.  The slimlib library was  developed  at  Princeton  University  to  compress  dirfile-like  data.
       GetData's Slim Encoding framework currently lacks write capabilities; as a result, the Slim Encoding does
       not support function which modify binary files.  The file extension of the Slim Encoding is .slm.

AUTHOR

       This manual page was by D. V. Wiebe <dvw@ketiltrout.net>.

SEE ALSO

       dirfile(5), dirfile-format(5), bzip2(1), gzip(1), zlib(3).