Ubuntu Manpage: dirfile-encoding — dirfile database encoding schemes

Provided by: libgetdata-tools_0.7.3-6ubuntu1_amd64

NAME

       dirfile-encoding — dirfile database encoding schemes

DESCRIPTION

The Dirfile Standards indicate that RAW fields defined in the database are accompanied by
binary files containing the field data in the specified simple data type. In certain
situations, it may be advantageous to convert the binary files in the database into a more
convenient form. This is accomplished by encoding the binary file into the alternate
form. A common use-case for encoding a binary file is to compress it to save disk space.
Only data is modified by an encoding scheme. Database metadata is unaffected.

Support for encoding schemes is optional. An implementation need not support any
particular encoding scheme, or may only support certain operations with it, but should
expect to encounter unknown encoding schemes and fail gracefully in such situations.

Additionally, how a particular encoding is implemented is not specified by the Dirfile
Standards, but, for purposes of interoperability, all dirfile implementations are
encouraged to support the encoding implementation used by the GetData dirfile reference
implementation, elaborated below.

An encoding scheme is local to the particular format specification fragment in which it is
indicated. This allows a single dirfile to have binary files which are stored using
multiple encodings, by having them defined in multiple fragments.

The rest of this manual page discusses specifics of the encoding framework implemented in
the GetData library, and does not constitute part of the Dirfile Standards.

THE GETDATA ENCODING FRAMEWORK

The GetData library provides an encoding framework which abstracts binary file I/O,
allowing for generic support for a wide variety of encoding schemes. Functions which may
make use of the encoding framework are:

gd_add(3), gd_add_raw(3), gd_add_spec(3),
gd_alter_encoding(3), gd_alter_endianness(3),
gd_alter_frameoffset(3), gd_alter_entry(3),
gd_alter_raw(3), gd_alter_spec(3), gd_getdata(3),
gd_move(3), gd_nframes(3), gd_putdata(3), and gd_rename(3).

Most of the encodings supported by GetData are implemented through external libraries
which handle the actual file I/O and data translation. All such libraries are optional; a
build of the library which omits an external library will lack support for the associated
encoding scheme. In this case, GetData will still properly identify the encoding scheme,
but attempts to use GetData for file I/O via the encoding will fail with the
GD_E_UNSUPPORTED error code.

GetData discovers the encoding scheme of a particular RAW field by noting the filename
extension of files associated with the field. Binary files which form an unencoded
dirfile have no file extension. The file extension used by the other encodings are noted
below. Encoding discovery proceeds by searching for files with the known list of file
extensions (in an unspecified order) and stopping when the first successful match is made.
Because of this, when the a field has multiple data files with different, supported file
extensions which could legitimately be associated with it, the encoding scheme discovered
by GetData is not well defined.

In addition to raw (unencoded) data, GetData supports five other encoding schemes: text
encoding, bzip2 encoding, gzip encoding, lzma encoding, and slim encoding, all discussed
below.

Text Encoding
The Text Encoding is unique among GetData encoding schemes in that it requires no external
library. As a result, all builds of the library contain full support for this encoding.
It is meant to serve as a reference encoding and example of the encoding framework for
work on other encoding schemes.

The Text Encoding replaces the binary data files with 7-bit ASCII files containing a
decimal text encoding of the data, one sample per line. All operations are supported by
the Text Encoding. The file extension of the Text Encoding is .txt.

BZip2 Encoding
The BZip2 Encoding compresses raw binary files using the Burrows-Wheeler block sorting
text compression algorithm and Huffman coding, as implemented in the bzip2 format.
GetData's BZip2 Encoding scheme is implemented through the the bzip2 compression library
written by Julian Seward. GetData's BZip2 Encoding framework currently lacks write
capabilities; as a result the BZip2 Encoding does not support functions which modify
binary data.

GetData caches an uncompressed megabyte of data at a time to speed access times. A call
to get_nframes(3) requires decompression of the entire binary file to determine its
uncompressed size, and may take some time to complete. The file extension of the BZip2
Encoding is .bz2.

GZip Encoding
The GZip Encoding compresses raw binary files using Lempel-Ziv coding (LZ77) as
implemented in the gzip format. GetData's GZip Encoding scheme is implemented through the
the zlib compression library written by Jean-loup Gailly and Mark Adler. GetData's GZip
Encoding framework currently lacks write capabilities; as a result the GZip Encoding does
not support functions which modify binary data.

To speed the operation of get_nframes(3), the GZip Encoding takes the uncompressed size of
the file the gzip footer, which contains the file's uncompressed size in bytes, modulo
2^32. As a result, using a field with an (uncompressed) binary file size larger than
4 GiB as the reference field will result in the wrong number of frames being reported.
The file extension of the GZip Encoding is .gz.

LZMA Encoding
The LZMA Encoding compresses raw binary files using the Lempel-Ziv Markov Chain Algorithm
(LZMA) as implemented in the xz container format. GetData's LZMA Encoding scheme is
implemented through the lzma library, part of the XZ Utils suite written by Lasse Collin,
Ville Koskinen, and Igor Pavlov. GetData's LZMA Encoding framework currently lacks write
capabilities; as a result the LZMA Encoding does not support functions which modify binary
data.

As with the BZip2 Encoding, GetData caches an uncompressed megabyte of data at a time to
speed access times. A call to get_nframes(3) requires decompression of the entire binary
file to determine its uncompressed size, and may take some time to complete. The file
extension of the LZMA Encoding is .xz, or .lzma.

Slim Encoding
The Slim Encoding compresses raw binary files using the slimlib compression library
written by Joseph Fowler. The slimlib library was developed at Princeton University to
compress dirfile-like data. GetData's Slim Encoding framework currently lacks write
capabilities; as a result, the Slim Encoding does not support function which modify binary
files. The file extension of the Slim Encoding is .slm.

AUTHOR

       This manual page was by D. V. Wiebe <dvw@ketiltrout.net>.

NAME

DESCRIPTION

THE GETDATA ENCODING FRAMEWORK

AUTHOR

SEE ALSO