Provided by:
libgetdata-tools_0.7.3-6_i386 
NAME
dirfile-encoding -- dirfile database encoding schemes
DESCRIPTION
The Dirfile Standards indicate that RAW fields defined in the database
are accompanied by binary files containing the field data in the
specified simple data type. In certain situations, it may be
advantageous to convert the binary files in the database into a more
convenient form. This is accomplished by encoding the binary file into
the alternate form. A common use-case for encoding a binary file is to
compress it to save disk space. Only data is modified by an encoding
scheme. Database metadata is unaffected.
Support for encoding schemes is optional. An implementation need not
support any particular encoding scheme, or may only support certain
operations with it, but should expect to encounter unknown encoding
schemes and fail gracefully in such situations.
Additionally, how a particular encoding is implemented is not specified
by the Dirfile Standards, but, for purposes of interoperability, all
dirfile implementations are encouraged to support the encoding
implementation used by the GetData dirfile reference implementation,
elaborated below.
An encoding scheme is local to the particular format specification
fragment in which it is indicated. This allows a single dirfile to
have binary files which are stored using multiple encodings, by having
them defined in multiple fragments.
The rest of this manual page discusses specifics of the encoding
framework implemented in the GetData library, and does not constitute
part of the Dirfile Standards.
THE GETDATA ENCODING FRAMEWORK
The GetData library provides an encoding framework which abstracts
binary file I/O, allowing for generic support for a wide variety of
encoding schemes. Functions which may make use of the encoding
framework are:
gd_add(3), gd_add_raw(3), gd_add_spec(3),
gd_alter_encoding(3), gd_alter_endianness(3),
gd_alter_frameoffset(3), gd_alter_entry(3),
gd_alter_raw(3), gd_alter_spec(3), gd_getdata(3),
gd_move(3), gd_nframes(3), gd_putdata(3), and gd_rename(3).
Most of the encodings supported by GetData are implemented through
external libraries which handle the actual file I/O and data
translation. All such libraries are optional; a build of the library
which omits an external library will lack support for the associated
encoding scheme. In this case, GetData will still properly identify
the encoding scheme, but attempts to use GetData for file I/O via the
encoding will fail with the GD_E_UNSUPPORTED error code.
GetData discovers the encoding scheme of a particular RAW field by
noting the filename extension of files associated with the field.
Binary files which form an unencoded dirfile have no file extension.
The file extension used by the other encodings are noted below.
Encoding discovery proceeds by searching for files with the known list
of file extensions (in an unspecified order) and stopping when the
first successful match is made. Because of this, when the a field has
multiple data files with different, supported file extensions which
could legitimately be associated with it, the encoding scheme
discovered by GetData is not well defined.
In addition to raw (unencoded) data, GetData supports five other
encoding schemes: text encoding, bzip2 encoding, gzip encoding, lzma
encoding, and slim encoding, all discussed below.
Text Encoding
The Text Encoding is unique among GetData encoding schemes in that it
requires no external library. As a result, all builds of the library
contain full support for this encoding. It is meant to serve as a
reference encoding and example of the encoding framework for work on
other encoding schemes.
The Text Encoding replaces the binary data files with 7-bit ASCII files
containing a decimal text encoding of the data, one sample per line.
All operations are supported by the Text Encoding. The file extension
of the Text Encoding is .txt.
BZip2 Encoding
The BZip2 Encoding compresses raw binary files using the Burrows-
Wheeler block sorting text compression algorithm and Huffman coding, as
implemented in the bzip2 format. GetData's BZip2 Encoding scheme is
implemented through the the bzip2 compression library written by Julian
Seward. GetData's BZip2 Encoding framework currently lacks write
capabilities; as a result the BZip2 Encoding does not support functions
which modify binary data.
GetData caches an uncompressed megabyte of data at a time to speed
access times. A call to get_nframes(3) requires decompression of the
entire binary file to determine its uncompressed size, and may take
some time to complete. The file extension of the BZip2 Encoding is
.bz2.
GZip Encoding
The GZip Encoding compresses raw binary files using Lempel-Ziv coding
(LZ77) as implemented in the gzip format. GetData's GZip Encoding
scheme is implemented through the the zlib compression library written
by Jean-loup Gailly and Mark Adler. GetData's GZip Encoding framework
currently lacks write capabilities; as a result the GZip Encoding does
not support functions which modify binary data.
To speed the operation of get_nframes(3), the GZip Encoding takes the
uncompressed size of the file the gzip footer, which contains the
file's uncompressed size in bytes, modulo 2^32. As a result, using a
field with an (uncompressed) binary file size larger than 4 GiB as the
reference field will result in the wrong number of frames being
reported. The file extension of the GZip Encoding is .gz.
LZMA Encoding
The LZMA Encoding compresses raw binary files using the Lempel-Ziv
Markov Chain Algorithm (LZMA) as implemented in the xz container
format. GetData's LZMA Encoding scheme is implemented through the lzma
library, part of the XZ Utils suite written by Lasse Collin, Ville
Koskinen, and Igor Pavlov. GetData's LZMA Encoding framework currently
lacks write capabilities; as a result the LZMA Encoding does not
support functions which modify binary data.
As with the BZip2 Encoding, GetData caches an uncompressed megabyte of
data at a time to speed access times. A call to get_nframes(3)
requires decompression of the entire binary file to determine its
uncompressed size, and may take some time to complete. The file
extension of the LZMA Encoding is .xz, or .lzma.
Slim Encoding
The Slim Encoding compresses raw binary files using the slimlib
compression library written by Joseph Fowler. The slimlib library was
developed at Princeton University to compress dirfile-like data.
GetData's Slim Encoding framework currently lacks write capabilities;
as a result, the Slim Encoding does not support function which modify
binary files. The file extension of the Slim Encoding is .slm.
AUTHOR
This manual page was by D. V. Wiebe <dvw@ketiltrout.net>.
SEE ALSO
dirfile(5), dirfile-format(5), bzip2(1), gzip(1), zlib(3).