Provided by: libgetdata-doc_0.11.0-13_all bug

NAME

       dirfile-encoding — dirfile database encoding schemes

DESCRIPTION

       The  Dirfile Standards indicate that RAW fields defined in the database are accompanied by
       binary files containing the field data in the specified  simple  data  type.   In  certain
       situations, it may be advantageous to convert the binary files in the database into a more
       convenient form.  This is accomplished by encoding the  binary  file  into  the  alternate
       form.   A common use-case for encoding a binary file is to compress it to save disk space.
       Only data is modified by an encoding scheme.  Database metadata is never encoded.

       Support for encoding  schemes  is  optional.   An  implementation  need  not  support  any
       particular  encoding  scheme,  or  may only support certain operations with it, but should
       expect to encounter unknown encoding schemes and fail gracefully in such situations.

       Additionally, how a particular encoding is implemented is not  specified  by  the  Dirfile
       Standards,  but,  for  purposes  of  interoperability,  all  dirfile  implementations  are
       encouraged to support the encoding implementation used by the  GetData  dirfile  reference
       implementation, elaborated below.

       An encoding scheme is local to the particular format specification fragment in which it is
       indicated.  This allows a single dirfile to have  binary  files  which  are  stored  using
       multiple encodings, by having them defined in multiple fragments.

       The  rest of this manual page discusses specifics of the encoding framework implemented in
       the GetData library, and does not constitute part of the Dirfile Standards.

THE GETDATA ENCODING FRAMEWORK

       The GetData library provides an  encoding  framework  which  abstracts  binary  file  I/O,
       allowing  for generic support for a wide variety of encoding schemes.  Functions which may
       make use of the encoding framework are:

              gd_add(3),       gd_add_raw(3),        gd_add_spec(3),        gd_alter_encoding(3),
              gd_alter_endianness(3),         gd_alter_frameoffset(3),         gd_alter_entry(3),
              gd_alter_raw(3), gd_alter_spec(3), gd_flush(3),  gd_getdata(3),  gd_malter_spec(3),
              gd_move(3),   gd_nframes(3),   gd_putdata(3),  gd_raw_close(3),  gd_rename(3),  and
              gd_sync(3).

       Most of the encodings supported by GetData  are  implemented  through  external  libraries
       which handle the actual file I/O and data translation.  All such libraries are optional; a
       build of the library which omits an external library will lack support for the  associated
       encoding  scheme.  In this case, GetData will still properly identify the encoding scheme,
       but  attempts  to  use  GetData  for  file  I/O  via  the  encoding  will  fail  with  the
       GD_E_UNSUPPORTED error code.

       GetData  discovers  the  encoding  scheme of a particular RAW field by noting the filename
       extension of files associated with the  field.   Binary  files  which  form  an  unencoded
       dirfile  have no file extension.  The file extension used by the other encodings are noted
       below.  Encoding discovery proceeds by searching for files with the  known  list  of  file
       extensions (in an unspecified order) and stopping when the first successful match is made.
       Because of this, when the a field has multiple data files with different,  supported  file
       extensions  which could legitimately be associated with it, the encoding scheme discovered
       by GetData is not well defined.

       In addition to raw (unencoded) data, GetData supports nine other  encoding  schemes:  text
       encoding,  bzip2  encoding, flac encoding, gzip encoding, lzma encoding, sie (sample-index
       encoding), slim encoding, zzip encoding, and zzslim encoding, all discussed below.

       The text encoding and the sample-index encoding are implemented by  GetData  natively  and
       need no external library.  As a result, they are always present in the library.

   Out-of-place writes
       Some  of the encodings listed below only support writing via out-of-place writes; that is,
       raw files are written in a temporary location and only moved into place when closed.  As a
       result,  writing  to these encodings requires making a copy of the whole binary data file.
       A further side effect of this is that a third-party trying to concurrently read a  Dirfile
       which is being written to using one of these encodings usually doesn't work.

       Within  GetData, reading from a field so encoded after writing to it will cause writing to
       the temporary file to be finished and then the file  moved  into  place  before  the  read
       occurs,  which may take some time to do.  Encodings which perform out-of-place writes are:
       bzip2, flac, gzip, and lzma.

   BZip2 Encoding
       The BZip2 Encoding reads compressed raw  binary  files  using  the  Burrows-Wheeler  block
       sorting text compression algorithm and Huffman coding, as implemented in the bzip2 format.
       GetData's BZip2 Encoding scheme is  implemented  through  the  bzip2  compression  library
       written by Julian Seward.  All operations are supported by the BZip2 Encoding, but writing
       occurs out-of-place.  See the Out-of-place writes section above for details.

       GetData caches an uncompressed megabyte of data at a time to speed access times.   A  call
       to  gd_nframes(3)  requires  decompression  of  the  entire  binary  file to determine its
       uncompressed size, and may take some time to complete.  The file extension  of  the  BZip2
       Encoding is .bz2.

   FLAC Encoding
       The  FLAC  Encoding  compresses  raw  binary  files  using  the Free Lossless Audio Codec.
       GetData's FLAC Encoding scheme is implemented through the libFLAC reference implementation
       developed  by  Josh  Coalson and the Xiph.Org Foundation.  All operations are supported by
       the FLAC Encoding, but writing occurs out-of-place.  See the Out-of-place  writes  section
       above for details.

       The  FLAC  format  only permits samples up to 32-bits, but the libFLAC reference codec can
       only handle samples up to 24-bits.  GetData gets around this by slicing data that is wider
       than 16-bits into multiple channels (2, 4, or 8, depending on width).  For big-ended data,
       the most-significant 16-bits are in channel 0, the second 16-bits in channel 1,  &c.   For
       little-ended data, this is reversed, with the least significant word in channel 0.

       The  sample rate specified in the FLAC header is ignored and may be any valid value.  FLAC
       files written by GetData use a sample rate of 1  Hz.   The  file  extension  of  the  FLAC
       Encoding is .flac.  The Ogg container format is not supported.

   GZip Encoding
       The  GZip  Encoding  compresses  raw  binary  files  using  Lempel-Ziv  coding  (LZ77)  as
       implemented in the gzip format.  GetData's GZip Encoding scheme is implemented through the
       zlib  compression  library  written by Jean-loup Gailly and Mark Adler. All operations are
       supported by the GZip Encoding, but writing occurs  out-of-place.   See  the  Out-of-place
       writes section above for details.

       To  speed the operation of gd_nframes(3), the GZip Encoding takes the uncompressed size of
       the file the gzip footer, which contains the file's uncompressed  size  in  bytes,  modulo
       2**32.   As  a  result,  using a field with an (uncompressed) binary file size larger than
       4 GiB as the reference field will result in the wrong number  of  frames  being  reported.
       The file extension of the GZip Encoding is .gz.

   LZMA Encoding
       The  LZMA  Encoding  reads  compressed  raw binary files using the Lempel-Ziv Markov Chain
       Algorithm (LZMA) as implemented in the  xz  container  format.   GetData's  LZMA  Encoding
       scheme  is  implemented  through  the  lzma library, part of the XZ Utils suite written by
       Lasse Collin, Ville Koskinen, and Igor Pavlov.  All operations are supported by  the  LZMA
       Encoding,  but writing occurs out-of-place.  See the Out-of-place writes section above for
       details.  Writing is supported only for the .xz container format, and not for the obsolete
       .lzma format, which can still be read.

       GetData  caches  an uncompressed megabyte of data at a time to speed access times.  A call
       to gd_nframes(3) requires decompression  of  the  entire  binary  file  to  determine  its
       uncompressed  size,  and  may  take some time to complete.  The file extension of the LZMA
       Encoding is .xz, or .lzma.

   Sample-Index Encoding
       The Sample-Index Encoding (SIE) compresses raw binary data by replacing runs  of  repeated
       data,  similar  to  run-length encoding.  SIE files contain binary records consisting of a
       64-bit sample number followed by a datum (the size and format of which  is  determined  by
       the  RAW  field's data type in the format metadata).  The sample number indicates the last
       sample of the field which has the specified value.  The first sample with the value is the
       sample  immediately  following the data in the previous record, or sample number zero, for
       the first record.  Sample numbers are  relative  to  any  /FRAMEOFFSET  specified  in  the
       Dirfile  metadata.   All  operations are supported by the Sample-Index Encoding.  The file
       extension of the Sample-Index Encoding is .sie.

   Slim Encoding
       The Slim Encoding reads compressed raw binary files using the slimlib compression  library
       written  by  Joseph  Fowler.  The slimlib library was developed at Princeton University to
       compress dirfile-like data.  GetData's  Slim  Encoding  framework  currently  lacks  write
       capabilities; as a result, the Slim Encoding does not support function which modify binary
       files.  The file extension of the Slim Encoding is .slm.

       Using the Slim Encoding with GetData may result  in  unexpected,  but  manageable,  memory
       usage.  See the gd_getdata(3) manual page for details.

   Text Encoding
       The  Text  Encoding  replaces  the  binary  data files with 7-bit ASCII files containing a
       decimal text encoding of the data, one sample per line.  All operations are  supported  by
       the Text Encoding.  The file extension of the Text Encoding is .txt.

   ZZip Encoding
       The  ZZip  Encoding  reads  compressed  raw  binary  files  using the DEFLATE algorithm as
       implemented in the PKWARE ZIP archive container format.  GetData's ZZip Encoding scheme is
       implemented  through  the zzip library written by Tomi Ollila and Guido Draheim.  The ZZip
       Encoding framework currently lacks write capabilities; as a result the ZZip Encoding  does
       not support functions which modify binary data.

       Unlike  most encoding schemes, the ZZip encoding merges all binary data files defined in a
       given fragment into a single ZIP archive.  The name of this archive is raw.zip by default,
       but  a  different  name  may  be  specified  using  the  second parameter to the /ENCODING
       directive.  For example,

              /ENCODING zzip archive

       indicates that the ZIP archive is called archive.zip.  The  file  extension  of  the  ZZip
       Encoding is .zip.

   ZZSlim Encoding
       The  ZZSlim  Encoding  is  a  convolution  of the Slim Encoding and the ZZip Encoding.  To
       create ZZSlim Encoded files, first the raw data are compressed using the slim library, and
       then  these  slim-compressed files are archived (and compressed again) into a ZIP archive.
       As with the ZZip Encoding, the ZIP archive is raw.zip by default, but a different name may
       be specified with the /ENCODING directive.

       Notably,  since  the  archives have the same name as ZZip Encoded data, automatic encoding
       detection on ZZSlim Encoded data always fails: they are incorrectly identified  as  simply
       ZZip  Encoded.   As  a  result,  an  /ENCODING  directive  in  the  format  file or else a
       GD_ZZSLIM_ENCODED flag passed to gd_open(3) is required to read ZZSlim encoded data.   The
       file extension of the ZZSlim Encoding is .zip.

       Using  the  ZZSlim  Encoding with GetData may result in unexpected, but manageable, memory
       usage.  See the gd_getdata(3) manual page for details.

AUTHOR

       This manual page was written by D. V. Wiebe <dvw@ketiltrout.net>.

SEE ALSO

       bzip2(1), flac(1), gzip(1), xz(1), zlib(3), dirfile(5), dirfile-format(5)