lunar (1) glimpseindex.1.gz

Provided by: glimpse_4.18.7-6_amd64 bug

NAME

       glimpseindex - index whole file systems to be searched by glimpse

OVERVIEW

       Glimpse  (which  stands  for  GLobal IMPlicit SEarch) is a popular UNIX indexing and query
       system that allows you to search through a large set of files very quickly.   Glimpseindex
       is  the  indexing program for glimpse.  Glimpse supports most of agrep's options (agrep is
       our powerful version of grep) including approximate  matching  (e.g.,  finding  misspelled
       words),  Boolean  queries, and even some limited forms of regular expressions.  It is used
       in the same way, except that you don't have to specify file names.  So, if you are looking
       for  a  needle  anywhere in your file system, all you have to do is say glimpse needle and
       all lines containing needle will appear preceded by the file name.  See  man  glimpse  for
       details on how to use glimpse.

       Glimpseindex  provides three indexing options: a tiny index (2-3% of the total size of all
       files), a small index (7-8%) and a medium-size index (20-30%).  Search times are  normally
       better with larger indexes (although unless files are quite large, the small index is just
       about as good as the medium one).  To index all your files, you  say  glimpseindex  ~  for
       tiny index (where ~ stands for the home directory), glimpseindex -o ~ for small index, and
       glimpseindex -b ~ for medium.

       Please  submit  bug  reports   or   comments   at   http://webglimpse.net/bugzilla/   Mail
       majordomo@webglimpse.net  with  SUBSCRIBE  WGUSERS  in the message body to be added to the
       webglimpse mailing list, where glimpse discussion is also directed.  HTML version of these
       manual  pages  can  be found in http://webglimpse.net/docs/glimpseindexhelp.html Also, see
       the glimpse home pages in http://webglimpse.net/glimpse/

SYNOPSIS

       glimpseindex [ -abEfFiInostT -w number -dD filename(s) -H directory -M number -S number  ]
       directory_name[s]

INTRODUCTION

       Glimpseindex  builds  an  index of all text files in all the directories specified and all
       their subdirectories (recursively).  It is also possible to build several separate indexes
       (possibly even overlapping).  The simplest way to index your files is to say

       glimpseindex -o ~

       The  index  consists  of  several  files  (described in detail below), all with the prefix
       .glimpse_ stored in the user's home directory (unless  otherwise  specified  with  the  -H
       option).   Files  with  one  of the following suffixes are not indexed: ".o", ".gz", ".Z",
       ".z", ".hqx", ".zip", ".tar".  (Unless the -z option is used, see  below.)   In  addition,
       glimpseindex  attempts to determine whether a file is a text file and does not index files
       that it thinks are not text files.  Numbers are not indexed unless the -n option is  used.
       It  is possible to prevent specified files from being indexed by adding their names to the
       .glimpse_exclude file (described below).  The -o option builds a larger index than without
       it  (typically  about  7-8%  vs.  2-3% without -o) allowing for a faster search (1-5 times
       faster).  The -b builds an even larger index and allows an even faster search some of  the
       time  (-b  is  helpful  mostly  when  large  files  are present).  There is an incremental
       indexing option -f, which updates an existing index by determining which files  have  been
       created  or  modified  since  the  index  was built and adding them to the index (see -f).
       Glimpseindex is reasonably fast, taking about 20 minutes to index 15,000  files  of  about
       200MB (on an Dec Alpha 233) and 2-4 minutes to update an existing index. (Your mileage may
       vary.)  It is also possible to increment the index by  adding  a  specific  file  (the  -a
       option).

       Once an index is built, searching for pattern is as easy as saying

       glimpse pattern

       (See man glimpse for all glimpse's options and features.)

A DETAILED DESCRIPTION OF GLIMPSEINDEX

       Glimpse  does  not  automatically index files.  You have to tell it to do it.  This can be
       done manually, but a better way is to set it to run every night.  It is  probably  a  good
       idea  to  run  glimpseindex manually for the first time to be sure it works properly.  The
       following is a simple script to run glimpseindex every night.  We assume that this  script
       is stored in a file called glimpse.script:

       glimpseindex -o -t -w 5000 ~ >& .glimpse_out
       at -m 0300 glimpse.script
       (It  might  be  interesting to collect all the outputs of glimpse by changing >& to >>& so
       that the file .glimpse_out maintains a history.  In this case the  file  must  be  created
       before the first time >>& is used.  If you use ksh, replace '>&' with '2>&1'.)

       Glimpseindex   stores   the   names  of  all  the  files  that  it  indexed  in  the  file
       .glimpse_filenames.  Each file is listed by its full path name as obtained at the time the
       files  were  indexed.   For example, /usr1/udi/file1.  Glimpse uses this full name when it
       performs the search, so the name must match the current name.  This may become  a  problem
       when  the  indexing  and  the search are done from different machines (e.g., through NFS),
       which   may    cause    the    path    names    to    be    different.     For    example,
       /tmp_mnt/R/xxx/xxx/usr1/udi/file1.   (The  same  is true for several other .glimpse files.
       See below.)

       Glimpseindex does not follow symbolic links unless they are  explicitly  included  in  the
       .glimpse_include file (described below).

       Glimpseindex  makes  an effort to identify non-text files such as binary files, compressed
       files,  uuencoded  files,  postscript  files,  binhex  files,  etc.    These   files   are
       automatically not indexed.  In addition, all files whose names end with `.o', `.gz', `.Z',
       `.z', `.hqx', `.zip', or `.tar' will not be indexed (unless they are specifically included
       in .glimpse_include - see below).

       The options for glimpseindex are as follows:

       -a     adds  the  given  file[s]  and/or  directories  to  an  existing  index.  Any given
              directory will be traversed recursively and all files will be indexed (unless  they
              appear in .glimpse_exclude; see below).  Using this option is generally much faster
              than indexing everything from scratch, although in rare cases the index may not  be
              as  good.   If  for some reason the index is full (which can happen unless -o or -b
              are used) glimpseindex -a will produce an  error  message  and  will  exit  without
              changing the original index.

       -b     builds  a  medium-size  index  (20-30%  of  the size of all files), allowing faster
              search.  This option forces glimpseindex to store an exact (byte level) pointer  to
              each  occurrence  of  each word (except for some very common words belonging to the
              stop list).

       -B     uses a hash table that is 4 times bigger (256k entries instead of 64K) to speed  up
              indexing.   The memory usage will increase typically by about 2 MB.  This option is
              only for indexing speed; it does not affect the final index.

       -d filename(s)
              deletes the given file(s) from the index.

       -D filename(s)
              deletes the given file(s) from the list of file names,  but  not  from  the  index.
              This  is  much  faster  than  -d,  and  the  file(s)  will not be found by glimpse.
              However, the index itself will not become smaller.

       -E     does not run a check on file types.  Glimpse normally attempts to exclude  non-text
              files,  but  this attempt is not always perfect.  With -E, glimpseindex indexes all
              files, except those that are specifically excluded in  .glimpse_exclude  and  those
              whose file names end with one of the excluded suffixes.

       -f     incremental  indexing.   glimpseindex  scans  all  files and adds to the index only
              those files that were created or modified after the current index  was  built.   If
              there  is  no  current index or if this procedure fails, glimpseindex automatically
              reverts to the default mode (which is to  index  everything  from  scratch).   This
              option  may  create  an inefficient index for several reasons, one of which is that
              deleted files are not really deleted from the index.   Unless  changes  are  small,
              mostly  additions,  and  -o  is used, we suggest to use the default mode as much as
              possible.

       -F     Glimpseindex receives the list of files to index from standard input.

       -H directory
              Put or update the index and all other .glimpse files (listed below) in "directory".
              The default is the home directory.  When glimpse is run, the -H option must be used
              to direct glimpse to this directory, because glimpse assumes that the index  is  in
              the home directory (see also the -H option in glimpse).

       -i     Make    .glimpse_include    (SEE   GLIMPSEINDEX   FILES)   take   precedence   over
              .glimpse_exclude, so that, for example, one can exclude everything (by  putting  *)
              and then explicitly include files.

       -I     Instead of indexing, only show (print to standard out) the list of files that would
              be indexed.  It  is  useful  for  filtering  purposes.   ("glimpseindex  -I  dir  |
              glimpseindex -F" is the same as "glimpseindex dir".)

       -M x   Tells glimpseindex to use x MB of memory for temporary tables.  The more memory you
              allow the faster glimpseindex will run.  The default is x=2.  The value of  x  must
              be a positive integer.  Glimpseindex will need more memory than x for other things,
              and glimpseindex may perform some 'forks', so you'll have to experiment if you want
              to use this option.  WARNING: If x is too large you may run out of swap space.

       -n     Index  numbers  as  well  as  text.   The default is not to index numbers.  This is
              useful when searching for dates or other identifying numbers, but it may  make  the
              index  very  large  if  there are lots of numbers.  In general, glimpseindex strips
              away any non-alphabetic character.  For example, the string abc123 will be  indexed
              as  abc if the -n option is not used and as abc123 if it is used.  Glimpse provides
              warnings (in .glimpse_messages) for all files in which more  than  half  the  words
              that  were added to the index from that file had digits in them (this is an attempt
              to identify data files that should probably not  be  indexed).   One  can  use  the
              .glimpse_exclude  file to exclude data files or any other files.  (See GLIMPSEINDEX
              FILES.)

       -o     Build a small index rather than tiny (meaning 7-9% of the sizes of all files - your
              mileage  may  vary)  allowing  faster  search.   This option forces glimpseindex to
              allocate one block per file (a block usually  contains  many  files).   A  detailed
              explanation of how blocks affect glimpse can be found in the glimpse article.  (See
              also LIMITATIONS.)

       -R     Recompute   .glimpse_filenames_index    from    .glimpse_filenames.     The    file
              .glimpse_filenames_index  speeds  up  processing.  Glimpseindex usually computes it
              automatically.  However, if for some reason one wants to change the path  names  of
              the  files  listed  in  .glimpse_filenames, then running glimpseindex -R recomputes
              .glimpse_filenames_index.  This is useful if the index is computed on one  machine,
              but is used on another (with the same hierarchy).  The names of the files listed in
              .glimpse_filenames are used in runtime, so changing them can be done at any time in
              any way (as long as just the names not the content is changed).  This is not really
              an option in the regular sense;  rather, it is a program by itself, and it is meant
              as a post-processing step.  (Available only from version 3.6.)

       -s     supports  structured queries.  This option was added to support the Harvest project
              and it is applicable mostly in that context.  See STRUCTURED QUERIES below for more
              information and also http://harvest.sourceforge.net/ for more information about the
              Harvest project.

       -S k   The number k determines the size of the stop-list.  The stop-list consists of words
              that  are too common and are not indexed (e.g., 'the' or 'and').  Instead of having
              a fixed stop-list, glimpseindex figures out the words that are too common for every
              index separately.  The rules are different for the different indexing options.  The
              tiny index contains all words (the savings  from  a  stop-list  are  too  small  to
              bother).   The  small  index  (-o), the number k is a percentage threshold.  A word
              will be in the stop list if it appears in at least k% of all  files.   The  default
              value  is  80%.   (If  there  are  less  than  256 files, then the stop-list is not
              maintained.)  The medium index (-b) counts all occurrences of all words, and a word
              is  added  to  the stop-list if it appears at least k times per MByte.  The default
              value is 500.  A query that includes a stop list word is of course less  efficient.
              (See also LIMITATIONS below.)

       -t     (A  new option in version 3.5.)  The order in which files are indexed is determined
              by scanning the directories, which  is  mostly  arbitrary.   With  the  -t  option,
              combined  with  either -o and -b, the indexed files are stored in reversed order of
              modification age (younger files first).  Results of queries are then  automatically
              returned  in  this  order.   Furthermore,  glimpse  can  filter results by age; for
              example, asking to look at only files that are at most 5 days old.

       -T     builds the turbo file.  Starting at version 3.0, this is the default, so using this
              option has no effect.

       -w k   Glimpseindex  does  a reasonable, but not a perfect, job of determining which files
              should not be indexed.  Sometimes a large text file  should  not  be  indexed;  for
              example,  a  dictionary  may  match  most  queries.  The -w option stores in a file
              called .glimpse_messages (in the same directory as the index) the list of all files
              that  contribute at least k new words to the index.  The user can look at this list
              of  files  and  decide  which  should  or  should  not  be   indexed.    The   file
              .glimpse_exclude  contains  files  that  will  not be indexed (see more below).  We
              recommend to set k to about 1000.  This is not an exact measure.  For  example,  if
              the same file appears twice, then the second copy will not contribute any new words
              to the dictionary (but if you exclude the first copy and index  again,  the  second
              copy will contribute).

       -X     (starting  at  version  4.0B1) Extract titles from HTML pages and add the titles to
              the index  (in  .glimpse_filenames).   (This  feature  was  added  to  improve  the
              performance  of WebGlimpse.)  Works only on files whose names end with .html, .htm,
              .shtml, and .shtm.  (see glimpse.h/EXTRACT_INFO_SUFFIX to add to  these  suffixes.)
              The  routine  to  extract titles is called extract_info, in index/filetype.c.  This
              feature can be modified in various ways to extract info from many  filetypes.   The
              titles  are  appended  to  the  corresponding  filenames  with  a  space separator.
              Glimpseindex assumes that filenames don't have spaces in them.

       -z     Allow customizable filtering,  using  the  file  .glimpse_filters  to  perform  the
              programs listed there for each match.  The best example is compress/decompress.  If
              .glimpse_filters include the line
              *.Z   uncompress <
              (separated by tabs) then before indexing any file that matches  the  pattern  "*.Z"
              (same  syntax as the one for .glimpse_exclude) the command listed is executed first
              (assuming input is from stdin, which is why uncompress  needs  <)  and  its  output
              (assuming  it goes to stdout) is indexed.  The file itself is not changed (i.e., it
              stays compressed).  Then if glimpse -z is used, the same program is used  on  these
              files  on  the fly.  Any program can be used (we run 'exec').  For example, one can
              filter out parts of files that should not be indexed.  Glimpseindex tries to  apply
              all  filters  in .glimpse_filters in the order they are given.  For example, if you
              want to uncompress a file and then extract some part of  it,  put  the  compression
              command  (the  example  above)  first  and  then  another  line  that specifies the
              extraction.  Note that this can slow down the search because the filters need to be
              run before files are searched.

GLIMPSEINDEX FILES

       All  files  used by glimpse are located at the directory(ies) where the index(es) is (are)
       stored and have .glimpse_  as  a  prefix.   The  first  two  files  (.glimpse_exclude  and
       .glimpse_include) are optionally supplied by the user.  The other files are built and read
       by glimpse.

       .glimpse_exclude
              contains a list of files that  glimpseindex  is  explicitly  told  to  ignore.   In
              general,  the  syntax  of .glimpse_exclude/include is the same as that of agrep (or
              any other grep).  The lines in the .glimpse_exclude file are matched  to  the  file
              names,  and  if  they  match, the files are excluded.  Notice that agrep matches to
              parts  of  the  string!   e.g.,  agrep  /ftp/pub  will  match   /home/ftp/pub   and
              /ftp/pub/whatever.   So, if you want to exclude /ftp/pub/core, you just list it, as
              is,  in  the  .glimpse_exclude  file.   If   you   put   "/home/ftp/pub/cdrom"   in
              .glimpse_exclude,  every  file  name  that  matches  that  string will be excluded,
              meaning all files below it.  You can use ^ to indicate  the  beginning  of  a  file
              name,  and  $ to indicate the end of one, and you can use * and ? in the usual way.
              For example /ftp/*html  will  exclude  /ftp/pub/foo.html,  but  will  also  exclude
              /home/ftp/pub/html/whatever;  if you want to exclude files that start with /ftp and
              end with html use ^/ftp*html$ Notice that putting a * at the beginning  or  at  the
              end is redundant (in fact, in this case glimpseindex will remove the * when it does
              the indexing).  No other meta characters are  allowed  in  .glimpse_exclude  (e.g.,
              don't  use  .* or # or |).  Lines with * or ? must have no more than 30 characters.
              Notice that, although the index itself will not be indexed, the list of file  names
              (.glimpse_filenames)   will   be   indexed   unless  it  is  explicitly  listed  in
              .glimpse_exclude.

       .glimpse_filters
              See the description above for the -z option.

       .glimpse_include
              contains a list of files that glimpseindex is explicitly told  to  include  in  the
              index  even  though they may look like non-text files.  Symbolic links are followed
              by glimpseindex only if they are specifically included here.   The  syntax  is  the
              same  as  the  one  for  .glimpse_exclude  (see  there).   If  a  file  is  in both
              .glimpse_exclude and .glimpse_include it will be excluded unless -i is used.

       .glimpse_filenames
              contains the list of all indexed file names, one per line.  This is an  ASCII  file
              that  can  also be used with agrep to search for a file name leading to a fast find
              command.  For example,
              glimpse 'count#\.c$' ~/.glimpse_filenames
              will output the names of all (indexed) .c files that have  'count'  in  their  name
              (including  anywhere  on  the path from the index).  Setting the following alias in
              the .login file may be useful:
              alias findfile 'glimpse -h :1 ~/.glimpse_filenames'

       .glimpse_index
              contains the index.  The index  consists  of  lines,  each  starting  with  a  word
              followed by a list of block numbers (unless the -o or -b options are used, in which
              case each word is followed by an offset into the file .glimpse_partitions where all
              pointers  are  kept).  The block/file numbers are stored in binary form, so this is
              not an ASCII file.

       .glimpse_messages
              contains the output of the -w option (see above).

       .glimpse_partitions
              contains the partition of the indexed space into blocks  and,  when  the  index  is
              built  with  the  -o  or  -b  options,  some  part of the index.  This file is used
              internally by glimpse and it is a non-ASCII file.

       .glimpse_statistics
              contains some statistics about the makeup of the index.  Useful for  some  advanced
              applications and customization of glimpse.

STRUCTURED QUERIES

       Glimpse  can  search  for  Boolean  combinations  of  "attribute=value" terms by using the
       Harvest SOIF parser library (in glimpse/libtemplate).  To search this way, the index  must
       be made by using the -s option of glimpseindex (this can be used in conjunction with other
       glimpseindex options). For glimpse and glimpseindex to recognize "structured" files,  they
       must  be  in SOIF format. In this format, each value is prefixed by an attribute-name with
       the size of the value (in bytes) present in "{}" after the name  of  the  attribute.   For
       example, The following lines are part of an SOIF file:
       type{17}:       Directory-Listing
       md5{32}:        3858c73d68616df0ed58a44d306b12ba
       Any  string can serve as an attribute name.  Glimpse "pattern;type=Directory-Listing" will
       search for "pattern" only in files whose type is "Directory-Listing".  The file itself  is
       considered  to be one "object" and its name/url appears as the first attribute with an "@"
       prefix; e.g., @FILE { http://xxx... } The scope of Boolean operations changes from records
       (lines) to whole files when structured queries are used in glimpse (since individual query
       terms can look at different attributes and they may not be "covered" by the  record/line).
       Note  that glimpse can only search for patterns in the value parts of the SOIF file: there
       are some attributes (like the TTL, MD5, etc.) that are interpreted by  Harvest's  internal
       routines.  See RFC 2655 for more detailed information of the SOIF format.

REFERENCES

       1.     U.  Manber  and  S.  Wu,  "GLIMPSE:  A Tool to Search Through Entire File Systems,"
              Usenix Winter 1994 Technical Conference (best paper award), San Francisco  (January
              1994),  pp.  23-32.   Also,  Technical Report #TR 93-34, Dept. of Computer Science,
              University of Arizona, October 1993 (a postscript file is  available  by  anonymous
              ftp at ftp://webglimpse.net/pub/glimpse/TR93-34.ps).

       2.     S.  Wu  and U. Manber, "Fast Text Searching Allowing Errors," Communications of the
              ACM 35 (October 1992), pp. 83-91.

SEE ALSO

       agrep(1), ed(1), ex(1), glimpse(1), glimpseserver(1), grep(1V), sh(1), csh(1).

LIMITATIONS

       The index of glimpse is word based.  A pattern that contains more than one word cannot  be
       found  in  the  index.  The way glimpse overcomes this weakness is by splitting any multi-
       word pattern into its set of words and looking for all of them in the index.  For example,
       glimpse  'linear  programming'  will  first consult the index to find all files containing
       both linear and programming, and then apply agrep to find the combined pattern.   This  is
       usually  an  effective  solution,  but  it can be slow for cases where both words are very
       common, but their combination is not.

       The index of glimpse stores all patterns in lower case.  When glimpse searches  the  index
       it  first  converts  all  patterns  to  lower  case, finds the appropriate files, and then
       searches the actual files using the original patterns.  So, for  example,  glimpse  ABCXYZ
       will  first  find all files containing abcxyz in any combination of lower and upper cases,
       and then searches these files directly, so only  the  right  cases  will  be  found.   One
       problem  with  this  approach  is discovering misspellings that are caused by wrong cases.
       For example, glimpse -B abcXYZ will first search the index for the best  match  to  abcxyz
       (because the pattern is converted to lower case); it will find that there are matches with
       no errors, and will go to those files to search them directly, this time with the original
       upper cases.  If the closest match is, say AbcXYZ, glimpse may miss it, because it doesn't
       expect an error.  Another problem is speed.  If you search for "ATT", it will look at  the
       index  for  "att".   Unless you use -w to match the whole word, glimpse may have to search
       all files containing, for example, "Seattle" which has "att" in it.

       There is no size limit for simple patterns and simple patterns with  Boolean  AND  or  OR.
       More complicated patterns are currently limited to approximately 30 characters.  Lines are
       limited to 1024 characters.  Records are limited to 48K, and may be truncated if they  are
       larger  than  that.   The limit of record length can be changed by modifying the parameter
       Max_record in agrep.h.

       Each line in .glimpse_exclude or .glimpse_include that contains a * or a ? must not exceed
       30 characters length.

       Glimpseindex does not index words of size > 64.

       A medium-size index (-b) may lead to actually slower query times if the files are all very
       small.

       Under -b, it may be impossible to make the stop list empty.   Glimpseindex  is  using  the
       "sort"  routine,  and all occurrences of a word appear at some point on one line.  Sort is
       limiting the size of lines it can handle (the value  depends  on  the  platform;  ours  is
       16KB).  If the lines are too big, the word is added to the stop list.

BUGS

       Please submit bug reports or comments at http://webglimpse.net/bugzilla/

DIAGNOSTICS

       (Only in version 3.6 and above.)
       exit status 0: terminated normally;
       exit status 1: glimpseindex errors (e.g., bad option combos, no files were indexed, etc.)
       exit status 2: system errors (e.g., write failed, sort failed, malloc failed).

AUTHORS

       Udi Manber and Burra Gopal, Department of Computer Science, University of Arizona, and Sun
       Wu, the National Chung-Cheng University, Taiwan. Now maintained by Golda Velez at Internet
       WorkShop (Email:  gvelez@webglimpse.net)

                                        November 10, 1997                         GLIMPSEINDEX(1)