Provided by: glimpse_4.18.7-6_amd64 bug

NAME

       glimpseindex - index whole file systems to be searched by glimpse

OVERVIEW

       Glimpse (which stands for GLobal IMPlicit SEarch) is a popular UNIX indexing and query system that allows
       you to search through a large set of files very  quickly.   Glimpseindex  is  the  indexing  program  for
       glimpse.   Glimpse  supports  most  of  agrep's options (agrep is our powerful version of grep) including
       approximate matching (e.g., finding misspelled words), Boolean queries, and even some  limited  forms  of
       regular  expressions.  It is used in the same way, except that you don't have to specify file names.  So,
       if you are looking for a needle anywhere in your file system, all you have to do is  say  glimpse  needle
       and  all  lines  containing needle will appear preceded by the file name.  See man glimpse for details on
       how to use glimpse.

       Glimpseindex provides three indexing options: a tiny index (2-3% of the total size of all files), a small
       index  (7-8%)  and  a  medium-size  index (20-30%).  Search times are normally better with larger indexes
       (although unless files are quite large, the small index is just about as good as  the  medium  one).   To
       index  all  your  files,  you  say glimpseindex ~ for tiny index (where ~ stands for the home directory),
       glimpseindex -o ~ for small index, and glimpseindex -b ~ for medium.

       Please submit bug reports or comments at  http://webglimpse.net/bugzilla/  Mail  majordomo@webglimpse.net
       with  SUBSCRIBE  WGUSERS  in  the  message body to be added to the webglimpse mailing list, where glimpse
       discussion  is   also   directed.    HTML   version   of   these   manual   pages   can   be   found   in
       http://webglimpse.net/docs/glimpseindexhelp.html    Also,    see    the    glimpse    home    pages    in
       http://webglimpse.net/glimpse/

SYNOPSIS

       glimpseindex  [  -abEfFiInostT  -w  number  -dD  filename(s)  -H  directory  -M  number   -S   number   ]
       directory_name[s]

INTRODUCTION

       Glimpseindex  builds  an  index  of  all  text  files  in  all  the  directories  specified and all their
       subdirectories (recursively).  It is also possible to  build  several  separate  indexes  (possibly  even
       overlapping).  The simplest way to index your files is to say

       glimpseindex -o ~

       The  index consists of several files (described in detail below), all with the prefix .glimpse_ stored in
       the user's home directory (unless otherwise specified with  the  -H  option).   Files  with  one  of  the
       following  suffixes  are  not  indexed:  ".o", ".gz", ".Z", ".z", ".hqx", ".zip", ".tar".  (Unless the -z
       option is used, see below.)  In addition, glimpseindex attempts to determine whether a  file  is  a  text
       file  and  does not index files that it thinks are not text files.  Numbers are not indexed unless the -n
       option is used.  It is possible to prevent specified files from being indexed by adding  their  names  to
       the  .glimpse_exclude  file  (described  below).   The  -o  option  builds a larger index than without it
       (typically about 7-8% vs. 2-3% without -o) allowing for a faster  search  (1-5  times  faster).   The  -b
       builds  an even larger index and allows an even faster search some of the time (-b is helpful mostly when
       large files are present).  There is an incremental indexing option -f, which updates an existing index by
       determining  which  files  have been created or modified since the index was built and adding them to the
       index (see -f).  Glimpseindex is reasonably fast, taking about 20 minutes to index 15,000 files of  about
       200MB  (on an Dec Alpha 233) and 2-4 minutes to update an existing index. (Your mileage may vary.)  It is
       also possible to increment the index by adding a specific file (the -a option).

       Once an index is built, searching for pattern is as easy as saying

       glimpse pattern

       (See man glimpse for all glimpse's options and features.)

A DETAILED DESCRIPTION OF GLIMPSEINDEX

       Glimpse does not automatically index files.  You have to tell it to do it.  This can  be  done  manually,
       but  a  better  way  is  to  set  it  to run every night.  It is probably a good idea to run glimpseindex
       manually for the first time to be sure it works properly.  The  following  is  a  simple  script  to  run
       glimpseindex every night.  We assume that this script is stored in a file called glimpse.script:

       glimpseindex -o -t -w 5000 ~ >& .glimpse_out
       at -m 0300 glimpse.script
       (It  might  be  interesting  to collect all the outputs of glimpse by changing >& to >>& so that the file
       .glimpse_out maintains a history.  In this case the file must be created before the  first  time  >>&  is
       used.  If you use ksh, replace '>&' with '2>&1'.)

       Glimpseindex stores the names of all the files that it indexed in the file .glimpse_filenames.  Each file
       is listed by its full path  name  as  obtained  at  the  time  the  files  were  indexed.   For  example,
       /usr1/udi/file1.   Glimpse  uses  this  full name when it performs the search, so the name must match the
       current name.  This may become a problem when the  indexing  and  the  search  are  done  from  different
       machines  (e.g.,  through  NFS),  which  may  cause  the  path  names  to  be  different.   For  example,
       /tmp_mnt/R/xxx/xxx/usr1/udi/file1.  (The same is true for several other .glimpse files.  See below.)

       Glimpseindex does not follow symbolic links unless they are explicitly included in  the  .glimpse_include
       file (described below).

       Glimpseindex makes an effort to identify non-text files such as binary files, compressed files, uuencoded
       files, postscript files, binhex files, etc.  These files are automatically not indexed.  In addition, all
       files whose names end with `.o', `.gz', `.Z', `.z', `.hqx', `.zip', or `.tar' will not be indexed (unless
       they are specifically included in .glimpse_include - see below).

       The options for glimpseindex are as follows:

       -a     adds the given file[s] and/or directories to an existing  index.   Any  given  directory  will  be
              traversed  recursively  and all files will be indexed (unless they appear in .glimpse_exclude; see
              below).  Using this option is  generally  much  faster  than  indexing  everything  from  scratch,
              although  in rare cases the index may not be as good.  If for some reason the index is full (which
              can happen unless -o or -b are used) glimpseindex -a will produce an error message and  will  exit
              without changing the original index.

       -b     builds a medium-size index (20-30% of the size of all files), allowing faster search.  This option
              forces glimpseindex to store an exact (byte level) pointer to each occurrence of each word (except
              for some very common words belonging to the stop list).

       -B     uses  a hash table that is 4 times bigger (256k entries instead of 64K) to speed up indexing.  The
              memory usage will increase typically by about 2 MB.  This option is only for  indexing  speed;  it
              does not affect the final index.

       -d filename(s)
              deletes the given file(s) from the index.

       -D filename(s)
              deletes  the  given  file(s)  from  the  list of file names, but not from the index.  This is much
              faster than -d, and the file(s) will not be found by glimpse.  However, the index itself will  not
              become smaller.

       -E     does not run a check on file types.  Glimpse normally attempts to exclude non-text files, but this
              attempt is not always perfect.  With -E, glimpseindex indexes all files,  except  those  that  are
              specifically  excluded in .glimpse_exclude and those whose file names end with one of the excluded
              suffixes.

       -f     incremental indexing.  glimpseindex scans all files and adds to the index only  those  files  that
              were  created  or  modified after the current index was built.  If there is no current index or if
              this procedure fails, glimpseindex automatically reverts to the default mode (which  is  to  index
              everything from scratch).  This option may create an inefficient index for several reasons, one of
              which is that deleted files are not really deleted from the  index.   Unless  changes  are  small,
              mostly additions, and -o is used, we suggest to use the default mode as much as possible.

       -F     Glimpseindex receives the list of files to index from standard input.

       -H directory
              Put  or  update the index and all other .glimpse files (listed below) in "directory".  The default
              is the home directory.  When glimpse is run, the -H option must be used to direct glimpse to  this
              directory, because glimpse assumes that the index is in the home directory (see also the -H option
              in glimpse).

       -i     Make .glimpse_include (SEE GLIMPSEINDEX FILES) take precedence over .glimpse_exclude, so that, for
              example, one can exclude everything (by putting *) and then explicitly include files.

       -I     Instead  of  indexing,  only show (print to standard out) the list of files that would be indexed.
              It is useful for filtering purposes.  ("glimpseindex -I dir | glimpseindex  -F"  is  the  same  as
              "glimpseindex dir".)

       -M x   Tells  glimpseindex  to  use  x  MB of memory for temporary tables.  The more memory you allow the
              faster glimpseindex will run.  The default is x=2.  The value of x must  be  a  positive  integer.
              Glimpseindex  will  need  more  memory  than x for other things, and glimpseindex may perform some
              'forks', so you'll have to experiment if you want to use this option.  WARNING: If x is too  large
              you may run out of swap space.

       -n     Index  numbers  as  well  as  text.   The  default  is  not to index numbers.  This is useful when
              searching for dates or other identifying numbers, but it may make the index very  large  if  there
              are  lots  of  numbers.   In  general, glimpseindex strips away any non-alphabetic character.  For
              example, the string abc123 will be indexed as abc if the -n option is not used and as abc123 if it
              is  used.   Glimpse provides warnings (in .glimpse_messages) for all files in which more than half
              the words that were added to the index from that file had digits in them (this is  an  attempt  to
              identify  data  files that should probably not be indexed).  One can use the .glimpse_exclude file
              to exclude data files or any other files.  (See GLIMPSEINDEX FILES.)

       -o     Build a small index rather than tiny (meaning 7-9% of the sizes of all files -  your  mileage  may
              vary)  allowing  faster search.  This option forces glimpseindex to allocate one block per file (a
              block usually contains many files).  A detailed explanation of how blocks affect  glimpse  can  be
              found in the glimpse article.  (See also LIMITATIONS.)

       -R     Recompute  .glimpse_filenames_index  from  .glimpse_filenames.   The file .glimpse_filenames_index
              speeds up processing.  Glimpseindex usually computes  it  automatically.   However,  if  for  some
              reason  one wants to change the path names of the files listed in .glimpse_filenames, then running
              glimpseindex -R recomputes .glimpse_filenames_index.  This is useful if the index is  computed  on
              one  machine,  but is used on another (with the same hierarchy).  The names of the files listed in
              .glimpse_filenames are used in runtime, so changing them can be done at any time in  any  way  (as
              long  as  just the names not the content is changed).  This is not really an option in the regular
              sense;  rather, it is a program by itself, and it is meant as a post-processing step.   (Available
              only from version 3.6.)

       -s     supports  structured  queries.   This  option  was  added to support the Harvest project and it is
              applicable mostly in that context.  See STRUCTURED QUERIES below for  more  information  and  also
              http://harvest.sourceforge.net/ for more information about the Harvest project.

       -S k   The  number  k determines the size of the stop-list.  The stop-list consists of words that are too
              common and are not indexed  (e.g.,  'the'  or  'and').   Instead  of  having  a  fixed  stop-list,
              glimpseindex  figures out the words that are too common for every index separately.  The rules are
              different for the different indexing options.  The tiny index contains all words (the savings from
              a  stop-list  are  too  small  to  bother).   The  small  index (-o), the number k is a percentage
              threshold.  A word will be in the stop list if it appears in  at  least  k%  of  all  files.   The
              default  value  is 80%.  (If there are less than 256 files, then the stop-list is not maintained.)
              The medium index (-b) counts all occurrences of all words, and a word is added to the stop-list if
              it  appears  at  least k times per MByte.  The default value is 500.  A query that includes a stop
              list word is of course less efficient.  (See also LIMITATIONS below.)

       -t     (A new option in version 3.5.)  The order in which files are indexed is determined by scanning the
              directories,  which  is mostly arbitrary.  With the -t option, combined with either -o and -b, the
              indexed files are stored in reversed order of modification age (younger files first).  Results  of
              queries are then automatically returned in this order.  Furthermore, glimpse can filter results by
              age; for example, asking to look at only files that are at most 5 days old.

       -T     builds the turbo file.  Starting at version 3.0, this is the default, so using this option has  no
              effect.

       -w k   Glimpseindex  does  a  reasonable, but not a perfect, job of determining which files should not be
              indexed.  Sometimes a large text file should not be indexed; for example, a dictionary  may  match
              most  queries.   The -w option stores in a file called .glimpse_messages (in the same directory as
              the index) the list of all files that contribute at least k new words to the index.  The user  can
              look  at  this  list  of  files  and  decide  which  should  or  should  not be indexed.  The file
              .glimpse_exclude contains files that will not be indexed (see more below).  We recommend to set  k
              to  about  1000.  This is not an exact measure.  For example, if the same file appears twice, then
              the second copy will not contribute any new words to the dictionary (but if you exclude the  first
              copy and index again, the second copy will contribute).

       -X     (starting  at  version  4.0B1)  Extract titles from HTML pages and add the titles to the index (in
              .glimpse_filenames).  (This feature was added to improve the performance  of  WebGlimpse.)   Works
              only    on    files   whose   names   end   with   .html,   .htm,   .shtml,   and   .shtm.    (see
              glimpse.h/EXTRACT_INFO_SUFFIX to add to these suffixes.)  The routine to extract titles is  called
              extract_info,  in  index/filetype.c.  This feature can be modified in various ways to extract info
              from many filetypes.  The titles  are  appended  to  the  corresponding  filenames  with  a  space
              separator.  Glimpseindex assumes that filenames don't have spaces in them.

       -z     Allow customizable filtering, using the file .glimpse_filters to perform the programs listed there
              for each match.  The best example is compress/decompress.  If .glimpse_filters include the line
              *.Z   uncompress <
              (separated by tabs) then before indexing any file that matches the pattern "*.Z" (same  syntax  as
              the  one for .glimpse_exclude) the command listed is executed first (assuming input is from stdin,
              which is why uncompress needs <) and its output (assuming it goes to stdout) is indexed.  The file
              itself  is  not changed (i.e., it stays compressed).  Then if glimpse -z is used, the same program
              is used on these files on the fly.  Any program can be used (we run 'exec').  For example, one can
              filter  out parts of files that should not be indexed.  Glimpseindex tries to apply all filters in
              .glimpse_filters in the order they are given.  For example, if you want to uncompress a  file  and
              then  extract  some  part  of  it,  put the compression command (the example above) first and then
              another line that specifies the extraction.  Note that this can slow down the search  because  the
              filters need to be run before files are searched.

GLIMPSEINDEX FILES

       All  files used by glimpse are located at the directory(ies) where the index(es) is (are) stored and have
       .glimpse_ as a prefix.  The first  two  files  (.glimpse_exclude  and  .glimpse_include)  are  optionally
       supplied by the user.  The other files are built and read by glimpse.

       .glimpse_exclude
              contains  a  list of files that glimpseindex is explicitly told to ignore.  In general, the syntax
              of .glimpse_exclude/include is the same as that of agrep (or any other grep).  The  lines  in  the
              .glimpse_exclude  file  are  matched to the file names, and if they match, the files are excluded.
              Notice that agrep matches to parts of the string!  e.g., agrep /ftp/pub will  match  /home/ftp/pub
              and  /ftp/pub/whatever.  So, if you want to exclude /ftp/pub/core, you just list it, as is, in the
              .glimpse_exclude file.  If you put "/home/ftp/pub/cdrom" in .glimpse_exclude, every file name that
              matches  that  string will be excluded, meaning all files below it.  You can use ^ to indicate the
              beginning of a file name, and $ to indicate the end of one, and you can use * and ? in  the  usual
              way.    For   example   /ftp/*html   will   exclude   /ftp/pub/foo.html,  but  will  also  exclude
              /home/ftp/pub/html/whatever;  if you want to exclude files that start with /ftp and end with  html
              use  ^/ftp*html$  Notice that putting a * at the beginning or at the end is redundant (in fact, in
              this case glimpseindex will remove the * when it does the indexing).  No other meta characters are
              allowed  in  .glimpse_exclude (e.g., don't use .* or # or |).  Lines with * or ? must have no more
              than 30 characters.  Notice that, although the index itself will not be indexed, the list of  file
              names (.glimpse_filenames) will be indexed unless it is explicitly listed in .glimpse_exclude.

       .glimpse_filters
              See the description above for the -z option.

       .glimpse_include
              contains  a list of files that glimpseindex is explicitly told to include in the index even though
              they may look like non-text files.  Symbolic links are followed by glimpseindex only if  they  are
              specifically  included  here.  The syntax is the same as the one for .glimpse_exclude (see there).
              If a file is in both .glimpse_exclude and .glimpse_include it will be excluded unless -i is used.

       .glimpse_filenames
              contains the list of all indexed file names, one per line.  This is an ASCII file that can also be
              used with agrep to search for a file name leading to a fast find command.  For example,
              glimpse 'count#\.c$' ~/.glimpse_filenames
              will  output  the  names  of  all  (indexed)  .c  files that have 'count' in their name (including
              anywhere on the path from the index).  Setting the following alias  in  the  .login  file  may  be
              useful:
              alias findfile 'glimpse -h :1 ~/.glimpse_filenames'

       .glimpse_index
              contains  the index.  The index consists of lines, each starting with a word followed by a list of
              block numbers (unless the -o or -b options are used, in which case each word  is  followed  by  an
              offset into the file .glimpse_partitions where all pointers are kept).  The block/file numbers are
              stored in binary form, so this is not an ASCII file.

       .glimpse_messages
              contains the output of the -w option (see above).

       .glimpse_partitions
              contains the partition of the indexed space into blocks and, when the index is built with  the  -o
              or  -b  options, some part of the index.  This file is used internally by glimpse and it is a non-
              ASCII file.

       .glimpse_statistics
              contains some statistics about the makeup of the index.  Useful for some advanced applications and
              customization of glimpse.

STRUCTURED QUERIES

       Glimpse  can  search for Boolean combinations of "attribute=value" terms by using the Harvest SOIF parser
       library (in glimpse/libtemplate).  To search this way, the index must be made by using the -s  option  of
       glimpseindex  (this  can  be  used  in  conjunction  with  other  glimpseindex  options). For glimpse and
       glimpseindex to recognize "structured" files, they must be in SOIF format. In this format, each value  is
       prefixed by an attribute-name with the size of the value (in bytes) present in "{}" after the name of the
       attribute.  For example, The following lines are part of an SOIF file:
       type{17}:       Directory-Listing
       md5{32}:        3858c73d68616df0ed58a44d306b12ba
       Any string can serve as an attribute name.   Glimpse  "pattern;type=Directory-Listing"  will  search  for
       "pattern"  only  in  files  whose  type  is "Directory-Listing".  The file itself is considered to be one
       "object" and its name/url appears as the first attribute with an "@" prefix; e.g., @FILE {  http://xxx...
       } The scope of Boolean operations changes from records (lines) to whole files when structured queries are
       used in glimpse (since individual query terms can look at  different  attributes  and  they  may  not  be
       "covered"  by the record/line).  Note that glimpse can only search for patterns in the value parts of the
       SOIF file: there are some attributes (like the TTL, MD5, etc.) that are interpreted by Harvest's internal
       routines.  See RFC 2655 for more detailed information of the SOIF format.

REFERENCES

       1.     U.  Manber  and S. Wu, "GLIMPSE: A Tool to Search Through Entire File Systems," Usenix Winter 1994
              Technical Conference (best paper award), San Francisco (January 1994), pp. 23-32.  Also, Technical
              Report  #TR  93-34,  Dept.  of Computer Science, University of Arizona, October 1993 (a postscript
              file is available by anonymous ftp at ftp://webglimpse.net/pub/glimpse/TR93-34.ps).

       2.     S. Wu and U. Manber, "Fast Text Searching Allowing Errors," Communications of the ACM 35  (October
              1992), pp. 83-91.

SEE ALSO

       agrep(1), ed(1), ex(1), glimpse(1), glimpseserver(1), grep(1V), sh(1), csh(1).

LIMITATIONS

       The  index  of  glimpse is word based.  A pattern that contains more than one word cannot be found in the
       index.  The way glimpse overcomes this weakness is by splitting any multi-word pattern into  its  set  of
       words  and  looking  for  all of them in the index.  For example, glimpse 'linear programming' will first
       consult the index to find all files containing both linear and programming, and then apply agrep to  find
       the  combined  pattern.   This  is usually an effective solution, but it can be slow for cases where both
       words are very common, but their combination is not.

       The index of glimpse stores all patterns in lower  case.   When  glimpse  searches  the  index  it  first
       converts  all  patterns  to  lower  case, finds the appropriate files, and then searches the actual files
       using the original patterns.  So, for example, glimpse ABCXYZ will first find all files containing abcxyz
       in  any  combination  of lower and upper cases, and then searches these files directly, so only the right
       cases will be found.  One problem with this approach is discovering misspellings that are caused by wrong
       cases.   For example, glimpse -B abcXYZ will first search the index for the best match to abcxyz (because
       the pattern is converted to lower case); it will find that there are matches with no errors, and will  go
       to  those  files  to search them directly, this time with the original upper cases.  If the closest match
       is, say AbcXYZ, glimpse may miss it, because it doesn't expect an error.  Another problem is  speed.   If
       you  search  for  "ATT", it will look at the index for "att".  Unless you use -w to match the whole word,
       glimpse may have to search all files containing, for example, "Seattle" which has "att" in it.

       There is no size limit for simple patterns and simple patterns with Boolean AND or OR.  More  complicated
       patterns  are  currently  limited  to approximately 30 characters.  Lines are limited to 1024 characters.
       Records are limited to 48K, and may be truncated if they are larger  than  that.   The  limit  of  record
       length can be changed by modifying the parameter Max_record in agrep.h.

       Each  line in .glimpse_exclude or .glimpse_include that contains a * or a ? must not exceed 30 characters
       length.

       Glimpseindex does not index words of size > 64.

       A medium-size index (-b) may lead to actually slower query times if the files are all very small.

       Under -b, it may be impossible to make the stop list empty.  Glimpseindex is using  the  "sort"  routine,
       and  all  occurrences  of a word appear at some point on one line.  Sort is limiting the size of lines it
       can handle (the value depends on the platform; ours is 16KB).  If the lines are  too  big,  the  word  is
       added to the stop list.

BUGS

       Please submit bug reports or comments at http://webglimpse.net/bugzilla/

DIAGNOSTICS

       (Only in version 3.6 and above.)
       exit status 0: terminated normally;
       exit status 1: glimpseindex errors (e.g., bad option combos, no files were indexed, etc.)
       exit status 2: system errors (e.g., write failed, sort failed, malloc failed).

AUTHORS

       Udi  Manber  and  Burra  Gopal,  Department  of  Computer Science, University of Arizona, and Sun Wu, the
       National Chung-Cheng University, Taiwan. Now maintained by  Golda  Velez  at  Internet  WorkShop  (Email:
       gvelez@webglimpse.net)

                                                November 10, 1997                                GLIMPSEINDEX(1)