Ubuntu Manpage: glimpseindex - index whole file systems to be searched by glimpse

NAME

       glimpseindex - index whole file systems to be searched by glimpse

OVERVIEW

Glimpse (which stands for GLobal IMPlicit SEarch) is a popular UNIX indexing and query
system that allows you to search through a large set of files very quickly. Glimpseindex
is the indexing program for glimpse. Glimpse supports most of agrep's options (agrep is
our powerful version of grep) including approximate matching (e.g., finding misspelled
words), Boolean queries, and even some limited forms of regular expressions. It is used
in the same way, except that you don't have to specify file names. So, if you are looking
for a needle anywhere in your file system, all you have to do is say glimpse needle and
all lines containing needle will appear preceded by the file name. See man glimpse for
details on how to use glimpse.

Glimpseindex provides three indexing options: a tiny index (2-3% of the total size of all
files), a small index (7-8%) and a medium-size index (20-30%). Search times are normally
better with larger indexes (although unless files are quite large, the small index is just
about as good as the medium one). To index all your files, you say glimpseindex ~ for
tiny index (where ~ stands for the home directory), glimpseindex -o ~ for small index, and
glimpseindex -b ~ for medium.

Please submit bug reports or comments at http://webglimpse.net/bugzilla/ Mail
majordomo@webglimpse.net with SUBSCRIBE WGUSERS in the message body to be added to the
webglimpse mailing list, where glimpse discussion is also directed. HTML version of these
manual pages can be found in http://webglimpse.net/docs/glimpseindexhelp.html Also, see
the glimpse home pages in http://webglimpse.net/glimpse/

SYNOPSIS

       glimpseindex [ -abEfFiInostT -w number -dD filename(s) -H directory -M number -S number  ]
       directory_name[s]

INTRODUCTION

Glimpseindex builds an index of all text files in all the directories specified and all
their subdirectories (recursively). It is also possible to build several separate indexes
(possibly even overlapping). The simplest way to index your files is to say

glimpseindex -o ~

The index consists of several files (described in detail below), all with the prefix
.glimpse_ stored in the user's home directory (unless otherwise specified with the -H
option). Files with one of the following suffixes are not indexed: ".o", ".gz", ".Z",
".z", ".hqx", ".zip", ".tar". (Unless the -z option is used, see below.) In addition,
glimpseindex attempts to determine whether a file is a text file and does not index files
that it thinks are not text files. Numbers are not indexed unless the -n option is used.
It is possible to prevent specified files from being indexed by adding their names to the
.glimpse_exclude file (described below). The -o option builds a larger index than without
it (typically about 7-8% vs. 2-3% without -o) allowing for a faster search (1-5 times
faster). The -b builds an even larger index and allows an even faster search some of the
time (-b is helpful mostly when large files are present). There is an incremental
indexing option -f, which updates an existing index by determining which files have been
created or modified since the index was built and adding them to the index (see -f).
Glimpseindex is reasonably fast, taking about 20 minutes to index 15,000 files of about
200MB (on an Dec Alpha 233) and 2-4 minutes to update an existing index. (Your mileage may
vary.) It is also possible to increment the index by adding a specific file (the -a
option).

Once an index is built, searching for pattern is as easy as saying

glimpse pattern

(See man glimpse for all glimpse's options and features.)

A DETAILED DESCRIPTION OF GLIMPSEINDEX

Glimpse does not automatically index files. You have to tell it to do it. This can be
done manually, but a better way is to set it to run every night. It is probably a good
idea to run glimpseindex manually for the first time to be sure it works properly. The
following is a simple script to run glimpseindex every night. We assume that this script
is stored in a file called glimpse.script:

glimpseindex -o -t -w 5000 ~ >& .glimpse_out
at -m 0300 glimpse.script
(It might be interesting to collect all the outputs of glimpse by changing >& to >>& so
that the file .glimpse_out maintains a history. In this case the file must be created
before the first time >>& is used. If you use ksh, replace '>&' with '2>&1'.)

Glimpseindex stores the names of all the files that it indexed in the file
.glimpse_filenames. Each file is listed by its full path name as obtained at the time the
files were indexed. For example, /usr1/udi/file1. Glimpse uses this full name when it
performs the search, so the name must match the current name. This may become a problem
when the indexing and the search are done from different machines (e.g., through NFS),
which may cause the path names to be different. For example,
/tmp_mnt/R/xxx/xxx/usr1/udi/file1. (The same is true for several other .glimpse files.
See below.)

Glimpseindex does not follow symbolic links unless they are explicitly included in the
.glimpse_include file (described below).

Glimpseindex makes an effort to identify non-text files such as binary files, compressed
files, uuencoded files, postscript files, binhex files, etc. These files are
automatically not indexed. In addition, all files whose names end with `.o', `.gz', `.Z',
`.z', `.hqx', `.zip', or `.tar' will not be indexed (unless they are specifically included
in .glimpse_include - see below).

The options for glimpseindex are as follows:

-a adds the given file[s] and/or directories to an existing index. Any given
directory will be traversed recursively and all files will be indexed (unless they
appear in .glimpse_exclude; see below). Using this option is generally much faster
than indexing everything from scratch, although in rare cases the index may not be
as good. If for some reason the index is full (which can happen unless -o or -b
are used) glimpseindex -a will produce an error message and will exit without
changing the original index.

-b builds a medium-size index (20-30% of the size of all files), allowing faster
search. This option forces glimpseindex to store an exact (byte level) pointer to
each occurrence of each word (except for some very common words belonging to the
stop list).

-B uses a hash table that is 4 times bigger (256k entries instead of 64K) to speed up
indexing. The memory usage will increase typically by about 2 MB. This option is
only for indexing speed; it does not affect the final index.

-d filename(s)
deletes the given file(s) from the index.

-D filename(s)
deletes the given file(s) from the list of file names, but not from the index.
This is much faster than -d, and the file(s) will not be found by glimpse.
However, the index itself will not become smaller.

-E does not run a check on file types. Glimpse normally attempts to exclude non-text
files, but this attempt is not always perfect. With -E, glimpseindex indexes all
files, except those that are specifically excluded in .glimpse_exclude and those
whose file names end with one of the excluded suffixes.

-f incremental indexing. glimpseindex scans all files and adds to the index only
those files that were created or modified after the current index was built. If
there is no current index or if this procedure fails, glimpseindex automatically
reverts to the default mode (which is to index everything from scratch). This
option may create an inefficient index for several reasons, one of which is that
deleted files are not really deleted from the index. Unless changes are small,
mostly additions, and -o is used, we suggest to use the default mode as much as
possible.

-F Glimpseindex receives the list of files to index from standard input.

-H directory
Put or update the index and all other .glimpse files (listed below) in "directory".
The default is the home directory. When glimpse is run, the -H option must be used
to direct glimpse to this directory, because glimpse assumes that the index is in
the home directory (see also the -H option in glimpse).

-i Make .glimpse_include (SEE GLIMPSEINDEX FILES) take precedence over
.glimpse_exclude, so that, for example, one can exclude everything (by putting *)
and then explicitly include files.

-I Instead of indexing, only show (print to standard out) the list of files that would
be indexed. It is useful for filtering purposes. ("glimpseindex -I dir |
glimpseindex -F" is the same as "glimpseindex dir".)

-M x Tells glimpseindex to use x MB of memory for temporary tables. The more memory you
allow the faster glimpseindex will run. The default is x=2. The value of x must
be a positive integer. Glimpseindex will need more memory than x for other things,
and glimpseindex may perform some 'forks', so you'll have to experiment if you want
to use this option. WARNING: If x is too large you may run out of swap space.

-n Index numbers as well as text. The default is not to index numbers. This is
useful when searching for dates or other identifying numbers, but it may make the
index very large if there are lots of numbers. In general, glimpseindex strips
away any non-alphabetic character. For example, the string abc123 will be indexed
as abc if the -n option is not used and as abc123 if it is used. Glimpse provides
warnings (in .glimpse_messages) for all files in which more than half the words
that were added to the index from that file had digits in them (this is an attempt
to identify data files that should probably not be indexed). One can use the
.glimpse_exclude file to exclude data files or any other files. (See GLIMPSEINDEX
FILES.)

-o Build a small index rather than tiny (meaning 7-9% of the sizes of all files - your
mileage may vary) allowing faster search. This option forces glimpseindex to
allocate one block per file (a block usually contains many files). A detailed
explanation of how blocks affect glimpse can be found in the glimpse article. (See
also LIMITATIONS.)

-R Recompute .glimpse_filenames_index from .glimpse_filenames. The file
.glimpse_filenames_index speeds up processing. Glimpseindex usually computes it
automatically. However, if for some reason one wants to change the path names of
the files listed in .glimpse_filenames, then running glimpseindex -R recomputes
.glimpse_filenames_index. This is useful if the index is computed on one machine,
but is used on another (with the same hierarchy). The names of the files listed in
.glimpse_filenames are used in runtime, so changing them can be done at any time in
any way (as long as just the names not the content is changed). This is not really
an option in the regular sense; rather, it is a program by itself, and it is meant
as a post-processing step. (Available only from version 3.6.)

-s supports structured queries. This option was added to support the Harvest project
and it is applicable mostly in that context. See STRUCTURED QUERIES below for more
information and also http://harvest.sourceforge.net/ for more information about the
Harvest project.

-S k The number k determines the size of the stop-list. The stop-list consists of words
that are too common and are not indexed (e.g., 'the' or 'and'). Instead of having
a fixed stop-list, glimpseindex figures out the words that are too common for every
index separately. The rules are different for the different indexing options. The
tiny index contains all words (the savings from a stop-list are too small to
bother). The small index (-o), the number k is a percentage threshold. A word
will be in the stop list if it appears in at least k% of all files. The default
value is 80%. (If there are less than 256 files, then the stop-list is not
maintained.) The medium index (-b) counts all occurrences of all words, and a word
is added to the stop-list if it appears at least k times per MByte. The default
value is 500. A query that includes a stop list word is of course less efficient.
(See also LIMITATIONS below.)

-t (A new option in version 3.5.) The order in which files are indexed is determined
by scanning the directories, which is mostly arbitrary. With the -t option,
combined with either -o and -b, the indexed files are stored in reversed order of
modification age (younger files first). Results of queries are then automatically
returned in this order. Furthermore, glimpse can filter results by age; for
example, asking to look at only files that are at most 5 days old.

-T builds the turbo file. Starting at version 3.0, this is the default, so using this
option has no effect.

-w k Glimpseindex does a reasonable, but not a perfect, job of determining which files
should not be indexed. Sometimes a large text file should not be indexed; for
example, a dictionary may match most queries. The -w option stores in a file
called .glimpse_messages (in the same directory as the index) the list of all files
that contribute at least k new words to the index. The user can look at this list
of files and decide which should or should not be indexed. The file
.glimpse_exclude contains files that will not be indexed (see more below). We
recommend to set k to about 1000. This is not an exact measure. For example, if
the same file appears twice, then the second copy will not contribute any new words
to the dictionary (but if you exclude the first copy and index again, the second
copy will contribute).

-X (starting at version 4.0B1) Extract titles from HTML pages and add the titles to
the index (in .glimpse_filenames). (This feature was added to improve the
performance of WebGlimpse.) Works only on files whose names end with .html, .htm,
.shtml, and .shtm. (see glimpse.h/EXTRACT_INFO_SUFFIX to add to these suffixes.)
The routine to extract titles is called extract_info, in index/filetype.c. This
feature can be modified in various ways to extract info from many filetypes. The
titles are appended to the corresponding filenames with a space separator.
Glimpseindex assumes that filenames don't have spaces in them.

-z Allow customizable filtering, using the file .glimpse_filters to perform the
programs listed there for each match. The best example is compress/decompress. If
.glimpse_filters include the line
*.Z uncompress <
(separated by tabs) then before indexing any file that matches the pattern "*.Z"
(same syntax as the one for .glimpse_exclude) the command listed is executed first
(assuming input is from stdin, which is why uncompress needs <) and its output
(assuming it goes to stdout) is indexed. The file itself is not changed (i.e., it
stays compressed). Then if glimpse -z is used, the same program is used on these
files on the fly. Any program can be used (we run 'exec'). For example, one can
filter out parts of files that should not be indexed. Glimpseindex tries to apply
all filters in .glimpse_filters in the order they are given. For example, if you
want to uncompress a file and then extract some part of it, put the compression
command (the example above) first and then another line that specifies the
extraction. Note that this can slow down the search because the filters need to be
run before files are searched.

GLIMPSEINDEX FILES

All files used by glimpse are located at the directory(ies) where the index(es) is (are)
stored and have .glimpse_ as a prefix. The first two files (.glimpse_exclude and
.glimpse_include) are optionally supplied by the user. The other files are built and read
by glimpse.

.glimpse_exclude
contains a list of files that glimpseindex is explicitly told to ignore. In
general, the syntax of .glimpse_exclude/include is the same as that of agrep (or
any other grep). The lines in the .glimpse_exclude file are matched to the file
names, and if they match, the files are excluded. Notice that agrep matches to
parts of the string! e.g., agrep /ftp/pub will match /home/ftp/pub and
/ftp/pub/whatever. So, if you want to exclude /ftp/pub/core, you just list it, as
is, in the .glimpse_exclude file. If you put "/home/ftp/pub/cdrom" in
.glimpse_exclude, every file name that matches that string will be excluded,
meaning all files below it. You can use ^ to indicate the beginning of a file
name, and $ to indicate the end of one, and you can use * and ? in the usual way.
For example /ftp/*html will exclude /ftp/pub/foo.html, but will also exclude
/home/ftp/pub/html/whatever; if you want to exclude files that start with /ftp and
end with html use ^/ftp*html$ Notice that putting a * at the beginning or at the
end is redundant (in fact, in this case glimpseindex will remove the * when it does
the indexing). No other meta characters are allowed in .glimpse_exclude (e.g.,
don't use .* or # or |). Lines with * or ? must have no more than 30 characters.
Notice that, although the index itself will not be indexed, the list of file names
(.glimpse_filenames) will be indexed unless it is explicitly listed in
.glimpse_exclude.

.glimpse_filters
See the description above for the -z option.

.glimpse_include
contains a list of files that glimpseindex is explicitly told to include in the
index even though they may look like non-text files. Symbolic links are followed
by glimpseindex only if they are specifically included here. The syntax is the
same as the one for .glimpse_exclude (see there). If a file is in both
.glimpse_exclude and .glimpse_include it will be excluded unless -i is used.

.glimpse_filenames
contains the list of all indexed file names, one per line. This is an ASCII file
that can also be used with agrep to search for a file name leading to a fast find
command. For example,
glimpse 'count#\.c$' ~/.glimpse_filenames
will output the names of all (indexed) .c files that have 'count' in their name
(including anywhere on the path from the index). Setting the following alias in
the .login file may be useful:
alias findfile 'glimpse -h :1 ~/.glimpse_filenames'

.glimpse_index
contains the index. The index consists of lines, each starting with a word
followed by a list of block numbers (unless the -o or -b options are used, in which
case each word is followed by an offset into the file .glimpse_partitions where all
pointers are kept). The block/file numbers are stored in binary form, so this is
not an ASCII file.

.glimpse_messages
contains the output of the -w option (see above).

.glimpse_partitions
contains the partition of the indexed space into blocks and, when the index is
built with the -o or -b options, some part of the index. This file is used
internally by glimpse and it is a non-ASCII file.

.glimpse_statistics
contains some statistics about the makeup of the index. Useful for some advanced
applications and customization of glimpse.

STRUCTURED QUERIES

Glimpse can search for Boolean combinations of "attribute=value" terms by using the
Harvest SOIF parser library (in glimpse/libtemplate). To search this way, the index must
be made by using the -s option of glimpseindex (this can be used in conjunction with other
glimpseindex options). For glimpse and glimpseindex to recognize "structured" files, they
must be in SOIF format. In this format, each value is prefixed by an attribute-name with
the size of the value (in bytes) present in "{}" after the name of the attribute. For
example, The following lines are part of an SOIF file:
type{17}: Directory-Listing
md5{32}: 3858c73d68616df0ed58a44d306b12ba
Any string can serve as an attribute name. Glimpse "pattern;type=Directory-Listing" will
search for "pattern" only in files whose type is "Directory-Listing". The file itself is
considered to be one "object" and its name/url appears as the first attribute with an "@"
prefix; e.g., @FILE { http://xxx... } The scope of Boolean operations changes from records
(lines) to whole files when structured queries are used in glimpse (since individual query
terms can look at different attributes and they may not be "covered" by the record/line).
Note that glimpse can only search for patterns in the value parts of the SOIF file: there
are some attributes (like the TTL, MD5, etc.) that are interpreted by Harvest's internal
routines. See RFC 2655 for more detailed information of the SOIF format.

REFERENCES

       1.     U.  Manber  and  S.  Wu,  "GLIMPSE:  A Tool to Search Through Entire File Systems,"
              Usenix Winter 1994 Technical Conference (best paper award), San Francisco  (January
              1994),  pp.  23-32.   Also,  Technical Report #TR 93-34, Dept. of Computer Science,
              University of Arizona, October 1993 (a postscript file is  available  by  anonymous
              ftp at ftp://webglimpse.net/pub/glimpse/TR93-34.ps).

       2.     S.  Wu  and U. Manber, "Fast Text Searching Allowing Errors," Communications of the
              ACM 35 (October 1992), pp. 83-91.

LIMITATIONS

The index of glimpse is word based. A pattern that contains more than one word cannot be
found in the index. The way glimpse overcomes this weakness is by splitting any multi-
word pattern into its set of words and looking for all of them in the index. For example,
glimpse 'linear programming' will first consult the index to find all files containing
both linear and programming, and then apply agrep to find the combined pattern. This is
usually an effective solution, but it can be slow for cases where both words are very
common, but their combination is not.

The index of glimpse stores all patterns in lower case. When glimpse searches the index
it first converts all patterns to lower case, finds the appropriate files, and then
searches the actual files using the original patterns. So, for example, glimpse ABCXYZ
will first find all files containing abcxyz in any combination of lower and upper cases,
and then searches these files directly, so only the right cases will be found. One
problem with this approach is discovering misspellings that are caused by wrong cases.
For example, glimpse -B abcXYZ will first search the index for the best match to abcxyz
(because the pattern is converted to lower case); it will find that there are matches with
no errors, and will go to those files to search them directly, this time with the original
upper cases. If the closest match is, say AbcXYZ, glimpse may miss it, because it doesn't
expect an error. Another problem is speed. If you search for "ATT", it will look at the
index for "att". Unless you use -w to match the whole word, glimpse may have to search
all files containing, for example, "Seattle" which has "att" in it.

There is no size limit for simple patterns and simple patterns with Boolean AND or OR.
More complicated patterns are currently limited to approximately 30 characters. Lines are
limited to 1024 characters. Records are limited to 48K, and may be truncated if they are
larger than that. The limit of record length can be changed by modifying the parameter
Max_record in agrep.h.

Each line in .glimpse_exclude or .glimpse_include that contains a * or a ? must not exceed
30 characters length.

Glimpseindex does not index words of size > 64.

A medium-size index (-b) may lead to actually slower query times if the files are all very
small.

Under -b, it may be impossible to make the stop list empty. Glimpseindex is using the
"sort" routine, and all occurrences of a word appear at some point on one line. Sort is
limiting the size of lines it can handle (the value depends on the platform; ours is
16KB). If the lines are too big, the word is added to the stop list.

BUGS

       Please submit bug reports or comments at http://webglimpse.net/bugzilla/

DIAGNOSTICS

       (Only in version 3.6 and above.)
       exit status 0: terminated normally;
       exit status 1: glimpseindex errors (e.g., bad option combos, no files were indexed, etc.)
       exit status 2: system errors (e.g., write failed, sort failed, malloc failed).

AUTHORS

       Udi Manber and Burra Gopal, Department of Computer Science, University of Arizona, and Sun
       Wu, the National Chung-Cheng University, Taiwan. Now maintained by Golda Velez at Internet
       WorkShop (Email:  gvelez@webglimpse.net)

                                        November 10, 1997                         GLIMPSEINDEX(1)