Ubuntu Manpage: dirfile-encoding — dirfile database encoding schemes

name
description
the getdata encoding framework
author
see also

Provided by: libgetdata-tools_0.7.3-6ubuntu1_amd64

NAME

       dirfile-encoding — dirfile database encoding schemes

DESCRIPTION

The Dirfile Standards indicate that RAW fields defined in the database are accompanied by binary files
containing the field data in the specified simple data type. In certain situations, it may be
advantageous to convert the binary files in the database into a more convenient form. This is
accomplished by encoding the binary file into the alternate form. A common use-case for encoding a
binary file is to compress it to save disk space. Only data is modified by an encoding scheme. Database
metadata is unaffected.

Support for encoding schemes is optional. An implementation need not support any particular encoding
scheme, or may only support certain operations with it, but should expect to encounter unknown encoding
schemes and fail gracefully in such situations.

Additionally, how a particular encoding is implemented is not specified by the Dirfile Standards, but,
for purposes of interoperability, all dirfile implementations are encouraged to support the encoding
implementation used by the GetData dirfile reference implementation, elaborated below.

An encoding scheme is local to the particular format specification fragment in which it is indicated.
This allows a single dirfile to have binary files which are stored using multiple encodings, by having
them defined in multiple fragments.

The rest of this manual page discusses specifics of the encoding framework implemented in the GetData
library, and does not constitute part of the Dirfile Standards.

THE GETDATA ENCODING FRAMEWORK

The GetData library provides an encoding framework which abstracts binary file I/O, allowing for generic
support for a wide variety of encoding schemes. Functions which may make use of the encoding framework
are:

gd_add(3), gd_add_raw(3), gd_add_spec(3), gd_alter_encoding(3), gd_alter_endianness(3),
gd_alter_frameoffset(3), gd_alter_entry(3), gd_alter_raw(3), gd_alter_spec(3), gd_getdata(3),
gd_move(3), gd_nframes(3), gd_putdata(3), and gd_rename(3).

Most of the encodings supported by GetData are implemented through external libraries which handle the
actual file I/O and data translation. All such libraries are optional; a build of the library which
omits an external library will lack support for the associated encoding scheme. In this case, GetData
will still properly identify the encoding scheme, but attempts to use GetData for file I/O via the
encoding will fail with the GD_E_UNSUPPORTED error code.

GetData discovers the encoding scheme of a particular RAW field by noting the filename extension of files
associated with the field. Binary files which form an unencoded dirfile have no file extension. The
file extension used by the other encodings are noted below. Encoding discovery proceeds by searching for
files with the known list of file extensions (in an unspecified order) and stopping when the first
successful match is made. Because of this, when the a field has multiple data files with different,
supported file extensions which could legitimately be associated with it, the encoding scheme discovered
by GetData is not well defined.

In addition to raw (unencoded) data, GetData supports five other encoding schemes: text encoding, bzip2
encoding, gzip encoding, lzma encoding, and slim encoding, all discussed below.

Text Encoding
The Text Encoding is unique among GetData encoding schemes in that it requires no external library. As a
result, all builds of the library contain full support for this encoding. It is meant to serve as a
reference encoding and example of the encoding framework for work on other encoding schemes.

The Text Encoding replaces the binary data files with 7-bit ASCII files containing a decimal text
encoding of the data, one sample per line. All operations are supported by the Text Encoding. The file
extension of the Text Encoding is .txt.

BZip2 Encoding
The BZip2 Encoding compresses raw binary files using the Burrows-Wheeler block sorting text compression
algorithm and Huffman coding, as implemented in the bzip2 format. GetData's BZip2 Encoding scheme is
implemented through the the bzip2 compression library written by Julian Seward. GetData's BZip2 Encoding
framework currently lacks write capabilities; as a result the BZip2 Encoding does not support functions
which modify binary data.

GetData caches an uncompressed megabyte of data at a time to speed access times. A call to
get_nframes(3) requires decompression of the entire binary file to determine its uncompressed size, and
may take some time to complete. The file extension of the BZip2 Encoding is .bz2.

GZip Encoding
The GZip Encoding compresses raw binary files using Lempel-Ziv coding (LZ77) as implemented in the gzip
format. GetData's GZip Encoding scheme is implemented through the the zlib compression library written
by Jean-loup Gailly and Mark Adler. GetData's GZip Encoding framework currently lacks write
capabilities; as a result the GZip Encoding does not support functions which modify binary data.

To speed the operation of get_nframes(3), the GZip Encoding takes the uncompressed size of the file the
gzip footer, which contains the file's uncompressed size in bytes, modulo 2^32. As a result, using a
field with an (uncompressed) binary file size larger than 4 GiB as the reference field will result in the
wrong number of frames being reported. The file extension of the GZip Encoding is .gz.

LZMA Encoding
The LZMA Encoding compresses raw binary files using the Lempel-Ziv Markov Chain Algorithm (LZMA) as
implemented in the xz container format. GetData's LZMA Encoding scheme is implemented through the lzma
library, part of the XZ Utils suite written by Lasse Collin, Ville Koskinen, and Igor Pavlov. GetData's
LZMA Encoding framework currently lacks write capabilities; as a result the LZMA Encoding does not
support functions which modify binary data.

As with the BZip2 Encoding, GetData caches an uncompressed megabyte of data at a time to speed access
times. A call to get_nframes(3) requires decompression of the entire binary file to determine its
uncompressed size, and may take some time to complete. The file extension of the LZMA Encoding is .xz,
or .lzma.

Slim Encoding
The Slim Encoding compresses raw binary files using the slimlib compression library written by Joseph
Fowler. The slimlib library was developed at Princeton University to compress dirfile-like data.
GetData's Slim Encoding framework currently lacks write capabilities; as a result, the Slim Encoding does
not support function which modify binary files. The file extension of the Slim Encoding is .slm.

AUTHOR

       This manual page was by D. V. Wiebe <dvw@ketiltrout.net>.

NAME

DESCRIPTION

THE GETDATA ENCODING FRAMEWORK

AUTHOR

SEE ALSO