Ubuntu Manpage: Lucy::Docs::FileFormat - Overview of index file format.

name
overview
write-once philosophy
top-level entries
a segment's component parts
compound files
a typical search

bionic (3) Lucy::Docs::FileFormat.3pm.gz

NAME

       Lucy::Docs::FileFormat - Overview of index file format.

OVERVIEW

       It is not necessary to understand the current implementation details of the index file format in order to
       use Apache Lucy effectively, but it may be helpful if you are interested in tweaking for high
       performance, exotic usage, or debugging and development.

       On a file system, an index is a directory.  The files inside have a hierarchical relationship: an index
       is made up of "segments", each of which is an independent inverted index with its own subdirectory; each
       segment is made up of several component parts.

           [index]--|
                    |--snapshot_XXX.json
                    |--schema_XXX.json
                    |--write.lock
                    |
                    |--seg_1--|
                    |         |--segmeta.json
                    |         |--cfmeta.json
                    |         |--cf.dat-------|
                    |                         |--[lexicon]
                    |                         |--[postings]
                    |                         |--[documents]
                    |                         |--[highlight]
                    |                         |--[deletions]
                    |
                    |--seg_2--|
                    |         |--segmeta.json
                    |         |--cfmeta.json
                    |         |--cf.dat-------|
                    |                         |--[lexicon]
                    |                         |--[postings]
                    |                         |--[documents]
                    |                         |--[highlight]
                    |                         |--[deletions]
                    |
                    |--[...]--|

Write-once philosophy

       All segment directory names consist of the string "seg_" followed by a number in base 36: seg_1, seg_5m,
       seg_p9s2 and so on, with higher numbers indicating more recent segments.  Once a segment is finished and
       committed, its name is never re-used and its files are never modified.

       Old segments become obsolete and can be removed when their data has been consolidated into new segments
       during the process of segment merging and optimization.  A fully-optimized index has only one segment.

Top-level entries

       There are a handful of "top-level" files and directories which belong to the entire index rather than to
       a particular segment.

   snapshot_XXX.json
       A "snapshot" file, e.g. "snapshot_m7p.json", is list of index files and directories.  Because index
       files, once written, are never modified, the list of entries in a snapshot defines a point-in-time view
       of the data in an index.

       Like segment directories, snapshot files also utilize the unique-base-36-number naming convention; the
       higher the number, the more recent the file.  The appearance of a new snapshot file within the index
       directory constitutes an index update.  While a new segment is being written new files may be added to
       the index directory, but until a new snapshot file gets written, a Searcher opening the index for reading
       won't know about them.

   schema_XXX.json
       The schema file is a Schema object describing the index's format, serialized as JSON.  It, too, is
       versioned, and a given snapshot file will reference one and only one schema file.

   locks
       By default, only one indexing process may safely modify the index at any given time.  Processes reserve
       an index by laying claim to the "write.lock" file within the "locks/" directory.  A smattering of other
       lock files may be used from time to time, as well.

A segment's component parts

By default, each segment has up to five logical components: lexicon, postings, document storage,
highlight data, and deletions. Binary data from these components gets stored in virtual files within the
"cf.dat" compound file; metadata is stored in a shared "segmeta.json" file.

segmeta.json
The segmeta.json file is a central repository for segment metadata. In addition to information such as
document counts and field numbers, it also warehouses arbitrary metadata on behalf of individual index
components.

Lexicon
Each indexed field gets its own lexicon in each segment. The exact files involved depend on the field's
type, but generally speaking there will be two parts. First, there's a primary "lexicon-XXX.dat" file
which houses a complete term list associating terms with corpus frequency statistics, postings file
locations, etc. Second, one or more "lexicon index" files may be present which contain periodic samples
from the primary lexicon file to facilitate fast lookups.

Postings
"Posting" is a technical term from the field of information retrieval, defined as a single instance of a
one term indexing one document. If you are looking at the index in the back of a book, and you see that
"freedom" is referenced on pages 8, 86, and 240, that would be three postings, which taken together form
a "posting list". The same terminology applies to an index in electronic form.

Each segment has one postings file per indexed field. When a search is performed for a single term,
first that term is looked up in the lexicon. If the term exists in the segment, the record in the
lexicon will contain information about which postings file to look at and where to look.

The first thing any posting record tells you is a document id. By iterating over all the postings
associated with a term, you can find all the documents that match that term, a process which is analogous
to looking up page numbers in a book's index. However, each posting record typically contains other
information in addition to document id, e.g. the positions at which the term occurs within the field.

Documents
The document storage section is a simple database, organized into two files:

• documents.dat - Serialized documents.

• documents.ix - Document storage index, a solid array of 64-bit integers where each integer location
corresponds to a document id, and the value at that location points at a file position in the
documents.dat file.

Highlight data
The files which store data used for excerpting and highlighting are organized similarly to the files used
to store documents.

• highlight.dat - Chunks of serialized highlight data, one per doc id.

• highlight.ix - Highlight data index -- as with the "documents.ix" file, a solid array of 64-bit file
pointers.

Deletions
When a document is "deleted" from a segment, it is not actually purged right away; it is merely marked as
"deleted" via a deletions file. Deletions files contains bit vectors with one bit for each document in
the segment; if bit #254 is set then document 254 is deleted, and if that document turns up in a search
it will be masked out.

It is only when a segment's contents are rewritten to a new segment during the segment-merging process
that deleted documents truly go away.

Compound Files

       If you peer inside an index directory, you won't actually find any files named "documents.dat",
       "highlight.ix", etc. unless there is an indexing process underway.  What you will find instead is one
       "cf.dat" and one "cfmeta.json" file per segment.

       To minimize the need for file descriptors at search-time, all per-segment binary data files are
       concatenated together in "cf.dat" at the close of each indexing session.  Information about where each
       file begins and ends is stored in "cfmeta.json".  When the segment is opened for reading, a single file
       descriptor per "cf.dat" file can be shared among several readers.

A Typical Search

       Here's a simplified narrative, dramatizing how a search for "freedom" against a given segment plays out:

       1.  The searcher asks the relevant Lexicon Index, "Do you know anything about 'freedom'?"  Lexicon Index
           replies, "Can't say for sure, but if the main Lexicon file does, 'freedom' is probably somewhere
           around byte 21008".

       2.  The main Lexicon tells the searcher "One moment, let me scan our records...  Yes, we have 2 documents
           which contain 'freedom'.  You'll find them in seg_6/postings-4.dat starting at byte 66991."

       3.  The Postings file says "Yep, we have 'freedom', all right!  Document id 40 has 1 'freedom', and
           document 44 has 8.  If you need to know more, like if any 'freedom' is part of the phrase 'freedom of
           speech', ask me about positions!

       4.  If the searcher is only looking for 'freedom' in isolation, that's where it stops.  It now knows
           enough to assign the documents scores against "freedom", with the 8-freedom document likely ranking
           higher than the single-freedom document.