Ubuntu Manpage: dirfile-encoding — dirfile database encoding schemes

Provided by: libgetdata-doc_0.9.0-2.2_all

NAME

       dirfile-encoding — dirfile database encoding schemes

DESCRIPTION

The Dirfile Standards indicate that RAW fields defined in the database are accompanied by
binary files containing the field data in the specified simple data type. In certain
situations, it may be advantageous to convert the binary files in the database into a more
convenient form. This is accomplished by encoding the binary file into the alternate
form. A common use-case for encoding a binary file is to compress it to save disk space.
Only data is modified by an encoding scheme. Database metadata is never encoded.

Support for encoding schemes is optional. An implementation need not support any
particular encoding scheme, or may only support certain operations with it, but should
expect to encounter unknown encoding schemes and fail gracefully in such situations.

Additionally, how a particular encoding is implemented is not specified by the Dirfile
Standards, but, for purposes of interoperability, all dirfile implementations are
encouraged to support the encoding implementation used by the GetData dirfile reference
implementation, elaborated below.

An encoding scheme is local to the particular format specification fragment in which it is
indicated. This allows a single dirfile to have binary files which are stored using
multiple encodings, by having them defined in multiple fragments.

The rest of this manual page discusses specifics of the encoding framework implemented in
the GetData library, and does not constitute part of the Dirfile Standards.

THE GETDATA ENCODING FRAMEWORK

The GetData library provides an encoding framework which abstracts binary file I/O,
allowing for generic support for a wide variety of encoding schemes. Functions which may
make use of the encoding framework are:

gd_add(3), gd_add_raw(3), gd_add_spec(3), gd_alter_encoding(3),
gd_alter_endianness(3), gd_alter_frameoffset(3), gd_alter_entry(3),
gd_alter_raw(3), gd_alter_spec(3), gd_flush(3), gd_getdata(3), gd_malter_spec(3),
gd_move(3), gd_nframes(3), gd_putdata(3), gd_raw_close(3), gd_rename(3), and
gd_sync(3).

Most of the encodings supported by GetData are implemented through external libraries
which handle the actual file I/O and data translation. All such libraries are optional; a
build of the library which omits an external library will lack support for the associated
encoding scheme. In this case, GetData will still properly identify the encoding scheme,
but attempts to use GetData for file I/O via the encoding will fail with the
GD_E_UNSUPPORTED error code.

GetData discovers the encoding scheme of a particular RAW field by noting the filename
extension of files associated with the field. Binary files which form an unencoded
dirfile have no file extension. The file extension used by the other encodings are noted
below. Encoding discovery proceeds by searching for files with the known list of file
extensions (in an unspecified order) and stopping when the first successful match is made.
Because of this, when the a field has multiple data files with different, supported file
extensions which could legitimately be associated with it, the encoding scheme discovered
by GetData is not well defined.

In addition to raw (unencoded) data, GetData supports nine other encoding schemes: text
encoding, bzip2 encoding, flac encoding, gzip encoding, lzma encoding, sie (sample-index
encoding), slim encoding, zzip encoding, and zzslim encoding, all discussed below.

The text encoding and the sample-index encoding are implemented by GetData natively and
need no external library. As a result, they are always present in the library.

Out-of-place writes
Some of the encodings listed below only support writing via out-of-place writes; that is,
raw files are written in a temporary location and only moved into place when closed. As a
result, writing to these encodings requires making a copy of the whole binary data file.
A further side effect of this is that a third-party trying to concurrently read a Dirfile
which is being written to using one of these encodings usually doesn't work.

Within GetData, reading from a field so encoded after writing to it will cause writing to
the temporary file to be finished and then the file moved into place before the read
occurs, which may take some time to do. Encodings which perform out-of-place writes are:
bzip2, flac, gzip, and lzma.

BZip2 Encoding
The BZip2 Encoding reads compressed raw binary files using the Burrows-Wheeler block
sorting text compression algorithm and Huffman coding, as implemented in the bzip2 format.
GetData's BZip2 Encoding scheme is implemented through the bzip2 compression library
written by Julian Seward. All operations are supported by the BZip2 Encoding, but writing
occurs out-of-place. See the Out-of-place writes section above for details.

GetData caches an uncompressed megabyte of data at a time to speed access times. A call
to gd_nframes(3) requires decompression of the entire binary file to determine its
uncompressed size, and may take some time to complete. The file extension of the BZip2
Encoding is .bz2.

FLAC Encoding
The FLAC Encoding compresses raw binary files using the Free Lossless Audio Codec.
GetData's FLAC Encoding scheme is implemented through the libFLAC reference implementation
developed by Josh Coalson and the Xiph.Org Foundation. All operations are supported by
the FLAC Encdoing, but writing occurs out-of-place. See the Out-of-place writes section
above for details.

The FLAC format only permits samples up to 32-bits, but the libFLAC reference codec can
only handle samples up to 24-bits. GetData gets around this by slicing data that is wider
than 16-bits into multiple channels (2, 4, or 8, depending on width). For big-ended data,
the most-significant 16-bits are in channel 0, the second 16-bits in channel 1, &c. For
little-ended data, this is reversed, with the least significant word in channel 0.

The sample rate specified in the FLAC header is ignored and may be any valid value. FLAC
files written by GetData use a sample rate of 1 Hz. The file extension of the FLAC
Encoding is .flac. The Ogg container format is not supported.

GZip Encoding
The GZip Encoding compresses raw binary files using Lempel-Ziv coding (LZ77) as
implemented in the gzip format. GetData's GZip Encoding scheme is implemented through the
zlib compression library written by Jean-loup Gailly and Mark Adler. All operations are
supported by the GZip Encoding, but writing occurs out-of-place. See the Out-of-place
writes section above for details.

To speed the operation of gd_nframes(3), the GZip Encoding takes the uncompressed size of
the file the gzip footer, which contains the file's uncompressed size in bytes, modulo
2**32. As a result, using a field with an (uncompressed) binary file size larger than
4 GiB as the reference field will result in the wrong number of frames being reported.
The file extension of the GZip Encoding is .gz.

LZMA Encoding
The LZMA Encoding reads compressed raw binary files using the Lempel-Ziv Markov Chain
Algorithm (LZMA) as implemented in the xz container format. GetData's LZMA Encoding
scheme is implemented through the lzma library, part of the XZ Utils suite written by
Lasse Collin, Ville Koskinen, and Igor Pavlov. All operations are supported by the LZMA
Encoding, but writing occurs out-of-place. See the Out-of-place writes section above for
details. Writing is supported only for the .xz container format, and not for the obsolete
.lzma format, which can still be read.

Sample-Index Encoding
The Sample-Index Encoding (SIE) compresses raw binary data by replacing runs of repeated
data, similar to run-length encoding. SIE files contain binary records consisting of a
64-bit sample number followed by a datum (the size and format of which is determined by
the RAW field's data type in the format metadata). The sample number indicates the last
sample of the field which has the specified value. The first sample with the value is the
sample immediately following the data in the previous record, or sample number zero, for
the first record. Sample numbers are relative to any /FRAMEOFFSET specified in the
Dirfile metadata. All operations are supported by the Sample-Index Encoding. The file
extension of the Sample-Index Encoding is .sie.

Slim Encoding
The Slim Encoding reads compressed raw binary files using the slimlib compression library
written by Joseph Fowler. The slimlib library was developed at Princeton University to
compress dirfile-like data. GetData's Slim Encoding framework currently lacks write
capabilities; as a result, the Slim Encoding does not support function which modify binary
files. The file extension of the Slim Encoding is .slm.

Using the Slim Encoding with GetData may result in unexpected, but manageable, memory
usage. See the gd_getdata(3) manual page for details.

Text Encoding
The Text Encoding replaces the binary data files with 7-bit ASCII files containing a
decimal text encoding of the data, one sample per line. All operations are supported by
the Text Encoding. The file extension of the Text Encoding is .txt.

ZZip Encoding
The ZZip Encoding reads compressed raw binary files using the DEFLATE algorithm as
implemented in the PKWARE ZIP archive container format. GetData's ZZip Encoding scheme is
implemented through the zzip library written by Tomi Ollila and Guido Draheim. The ZZip
Encoding framework currently lacks write capabilities; as a result the ZZip Encoding does
not support functions which modify binary data.

Unlike most encoding schemes, the ZZip encoding merges all binary data files defined in a
given fragment into a single ZIP archive. The name of this archive is raw.zip by default,
but a different name may be specified using the second parameter to the /ENCODING
directive. For example,

/ENCODING zzip archive

indicates that the ZIP archive is called archive.zip. The file extension of the ZZip
Encoding is .zip.

ZZSlim Encoding
The ZZSlim Encoding is a convolution of the Slim Encoding and the ZZip Encoding. To
create ZZSlim Encoded files, first the raw data are compressed using the slim library, and
then these slim-compressed files are archived (and compressed again) into a ZIP archive.
As with the ZZip Encoding, the ZIP archive is raw.zip by default, but a different name may
be specified with the /ENCODING directive.

Notably, since the archives have the same name as ZZip Encoded data, automatic encoding
detection on ZZSlim Encoded data always fails: they are incorrectly identified as simply
ZZip Encoded. As a result, an /ENCODING directive in the format file or else a
GD_ZZSLIM_ENCODED flag passed to gd_open(3) is required to read ZZSlim encoded data. The
file extension of the ZZSlim Encoding is .zip.

Using the ZZSlim Encoding with GetData may result in unexpected, but manageable, memory
usage. See the gd_getdata(3) manual page for details.

AUTHOR

       This manual page was written by D. V. Wiebe <dvw@ketiltrout.net>.

NAME

DESCRIPTION

THE GETDATA ENCODING FRAMEWORK

AUTHOR

SEE ALSO