bionic (1) glimpse.1.gz

Provided by: glimpse_4.18.7-3build1_amd64 bug

NAME

       glimpse - search quickly through entire file systems

OVERVIEW

       Glimpse  (which  stands for GLobal IMPlicit SEarch) is a very popular UNIX indexing and query system that
       allows you to search through a large set of files very quickly.  Glimpse supports most of agrep's options
       (agrep  is our powerful version of grep) including approximate matching (e.g., finding misspelled words),
       Boolean queries, and even some limited forms of regular expressions.  It is used in the same way,  except
       that  you  don't  have  to specify file names.  So, if you are looking for a needle anywhere in your file
       system, all you have to do is say glimpse needle and all lines containing needle will appear preceded  by
       the file name.

       To  use  glimpse  you  first  need to index your files with glimpseindex.  For example, glimpseindex -o ~
       will index everything at or below your home directory.  See man glimpseindex for more details.

       Glimpse is also available for web sites, as a set of tools called WebGlimpse.  (The old glimpseHTTP is no
       longer supported and is not recommended.)  See http://webglimpse.net/ for more information.

       Glimpse includes all of agrep and can be used instead of agrep by giving a file name(s) at the end of the
       command.  This will cause glimpse to ignore the index and run agrep as usual.  For  example,  glimpse  -1
       pattern  file  is  the  same  as agrep -1 pattern file.  Agrep is distributed as a self-contained package
       within glimpse, and can be used separately.  We added a new option to agrep:  -r searches recursively the
       directory  and  everything  below  it  (see agrep options below); it is used only when glimpse reverts to
       agrep.

       Mail majordomo@webglimpse.net with SUBSCRIBE wgusers in the body to be  added  to  the  Webglimpse  users
       mailing list.  This is now the location where glimpse questions are also discussed.  Bugs can be reported
       at  http://webglimpse.net/bugzilla/  HTML   version   of   these   manual   pages   can   be   found   in
       http://webglimpse.net/docs/glimpsehelp.html     Also,     see     the     glimpse     home    pages    in
       http://webglimpse.net/glimpse

SYNOPSIS

       glimpse - [almost all letters] pattern

INTRODUCTION

       We start with simple ways to use glimpse and describe all the options in detail later on.  Once an  index
       is built, using glimpseindex, searching for pattern is as easy as saying

       glimpse pattern

       The  output  of  glimpse  is  similar to that of agrep (or any other grep).  The pattern can be any agrep
       legal pattern including a regular expression or a Boolean query (e.g., searching for Tucson  AND  Arizona
       is done by glimpse 'Tucson;Arizona').

       The speed of glimpse depends mainly on the number and sizes of the files that contain a match and only to
       a second degree on the total size of all indexed files.  If the pattern is reasonably uncommon, then  all
       matches  will  be  reported  in  a  few  seconds  even  if  the  indexed files total 500MB or more.  Some
       information on how glimpse works and a reference to a detailed article are given below.

       Most of agrep (and other grep's) options are supported, including approximate matching.  For example,

       glimpse -1 'Tuson;Arezona'

       will output all lines containing both patterns allowing one spelling error in any of the patterns (either
       insertion, deletion, or substitution), which in this case is definitely needed.

       glimpse -w -i 'parent'

       specifies  case  insensitive (-i) and match on complete words (-w).  So 'Parent' and 'PARENT' will match,
       'parent/child' will match, but 'parenthesis' or 'parents' will not  match.   (Starting  at  version  3.0,
       glimpse  can be much faster when these two options are specified, especially for very large indexes.  You
       may want to set an alias especially for "glimpse -w -i".)

       The -F option provides a pattern that must match the file name.  For example,

       glimpse -F '\.c$' needle

       will find the pattern needle in all files whose name ends with .c.  (Glimpse will first check  its  index
       to  determine  which  files may contain the pattern and then run agrep on the file names to further limit
       the search.)  The -F option should not be put at the end after the main pattern (e.g., "glimpse needle -F
       hay" is incorrect).

A Detailed Description of All the Options of Glimpse

       -#     #  is  an integer between 1 and 8 specifying the maximum number of errors permitted in finding the
              approximate matches (the default is zero).  Generally, each insertion, deletion,  or  substitution
              counts  as  one  error.   It  is possible to adjust the relative cost of insertions, deletions and
              substitutions (see -I -D and -S options).  Since the index  stores  only  lower  case  characters,
              errors  of  substituting  upper  case  with  lower case may be missed (see LIMITATIONS).  Allowing
              errors in the match requires more time and can slow down the match by a factor of  2-4.   Be  very
              careful when specifying more than one error, as the number of matches tend to grow very quickly.

       -a     prints  attribute  names.   This  option  applies  only to Harvest SOIF structured data (used with
              glimpseindex -s).  (See http://harvest.sourceforge.net/ for more  information  about  the  Harvest
              project.)

       -A     used for glimpse internals.

       -b     prints  the  byte  offset  (from  the  beginning of the file) of the end of each match.  The first
              character in a file has offset 0.

       -B     Best match mode.  (Warning: -B sometimes misses matches.  It is safer to  specify  the  number  of
              errors explicitly.)  When -B is specified and no exact matches are found, glimpse will continue to
              search until the closest matches (i.e., the ones with minimum number  of  errors)  are  found,  at
              which  point  the  following message will be shown: "the best match contains x errors, there are y
              matches, output them? (y/n)" This message refers to the number of  matches  found  in  the  index.
              There  may  be  many more matches in the actual text (or there may be none if -F is used to filter
              files).  When the -#, -c, or -l options are specified, the -B option is ignored.  In  general,  -B
              may  be  slower than -#, but not by very much.  Since the index stores only lower case characters,
              errors of substituting upper case with lower case may be missed (see LIMITATIONS).

       -c     Display only the count of matching records.  Only files with count > 0 are displayed.

       -C     tells glimpse to send its queries to glimpseserver.

       -d 'delim'
              Define delim to be the separator between two records.  The default value is '$', namely  a  record
              is by default a line.  delim can be a string of size at most 8 (with possible use of ^ and $), but
              not a regular expression.  Text between two delim's, before the first delim, and  after  the  last
              delim  is  considered  as  one  record.  For example, -d '$$' defines paragraphs as records and -d
              '^From ' defines mail messages as records.  glimpse matches each record separately.   This  option
              does  not currently work with regular expressions.  The -d option is especially useful for Boolean
              AND queries, because the patterns need not appear in the same line but in the  same  record.   For
              example,  glimpse -F mail -d '^From ' 'glimpse;arizona;announcement' will output all mail messages
              (in their entirety) that have the 3 patterns anywhere in the message  (or  the  header),  assuming
              that  files  with 'mail' in their name contain mail messages.  If you want the scope of the record
              to be the whole file, use the -W option.  Glimpse warning: Use this  option  with  care.   If  the
              delimiter  is  set to match mail messages, for example, and glimpse finds the pattern in a regular
              file, it may not find the delimiter and will therefore output the whole file.  (The  -t  option  -
              see  below - can be used to put the delim at the end of the record.)  Performance Note: Agrep (and
              glimpse) resorts to more complex search when the -d option is used.   The  search  is  slower  and
              unfortunately no more than 32 characters can be used in the pattern.

       -Dk    Set  the  cost  of a deletion to k (k is a positive integer).  This option does not currently work
              with regular expressions.

       -e pattern
              Same as a simple pattern argument, but useful when the pattern begins with a `-'.

       -E     prints the lines in the index (as they appear in the index) which match the pattern.  Used  mostly
              for  debugging  and  maintenance  of  the  index.  This is not an option that a user needs to know
              about.

       -f file_name
              this option has a different meaning for agrep than for glimpse: In glimpse, only the  files  whose
              names   are   listed   in   file_name  are  matched.   (The  file  names  have  to  appear  as  in
              .glimpse_filenames.)  In agrep, the file_name contains the list of the patterns that are searched.
              (Starting at version 3.6, this option for glimpse is much faster for large files.)

       -F file_pattern
              limits the search to those files whose name (including the whole path) matches file_pattern.  This
              option can be used in a variety of applications to provide  limited  search  even  for  one  large
              index.  If file_pattern matches a directory, then all files with this directory on their path will
              be considered.  To limit the search to actual file names,  use  $  at  the  end  of  the  pattern.
              file_pattern  can  be a regular expression and even a Boolean pattern.  This option is implemented
              by running agrep file_pattern on the list of file  names  obtained  from  the  index.   Therefore,
              searching  the  index  itself  takes the same amount of time, but limiting the second phase of the
              search to only a few files can speed up the search significantly.  For example,

              glimpse -F 'src#\.c$' needle

              will search for needle in all .c files with src somewhere along the  path.   The  -F  file_pattern
              must  appear  before  the  search  pattern  (e.g., glimpse needle -F '\.c$' will not work).  It is
              possible to use some of agrep's options when matching file names.  In this  case  all  options  as
              well  as  the  file_pattern  should  be  in quotes.  (-B and -v do not work very well as part of a
              file_pattern.)  For example,

              glimpse -F '-1 \.html' pattern

              will allow one spelling error when matching .html to the file names (so ".htm" and  ".shtml"  will
              match as well).

              glimpse -F '-v \.c$' counter

              will search for 'counter' in all files except for .c files.

       -g     prints the file number (its position in the .glimpse_filenames file) rather than its name.

       -G     Output the (whole) files that contain a match.

       -h     Do not display filenames.

       -H directory_name
              searches  for  the  index and the other .glimpse files in directory_name.  The default is the home
              directory.  This option is useful, for example, if several different indexes  are  maintained  for
              different archives (e.g., one for mail messages, one for source code, one for articles).

       -i     Case-insensitive search — e.g., "A" and "a" are considered equivalent.  Glimpse's index stores all
              patterns in lower case (see LIMITATIONS below).  Performance Note: When -i is used  together  with
              the  -w  option,  the  search  may  become  much  faster.   It is recommended to have -i and -w as
              defaults, for example, through an alias.  We use the following alias in our .cshrc file
              alias glwi 'glimpse -w -i'

       -Ik    Set the cost of an insertion to k (k is a positive integer).  This option does not currently  work
              with regular expressions.

       -j     If  the  index was constructed with the -t option, then -j will output the files last modification
              dates in addition to everything else.  There are no major performance penalties for this option.

       -J host_name
              used in conjunction with glimpseserver (-C) to connect to one particular server.

       -k     No symbol in the pattern is treated as a meta character.  For example, glimpse -k 'a(b|c)*d'  will
              find  the  occurrences  of a(b|c)*d whereas glimpse 'a(b|c)*d' will find substrings that match the
              regular expression 'a(b|c)*d'.  (The only exception is ^ at the beginning of the pattern and $  at
              the  end  of  the pattern, which are still interpreted in the usual way.  Use \^ or \$ if you need
              them verbatim.)

       -K port_number
              used in conjunction with glimpseserver (-C) to connect to one particular server at  the  specified
              TCP port number.

       -l     Output  only the files names that contain a match.  This option differs from the -N option in that
              the files themselves are searched, but the matching lines are not shown.

       -L x | x:y | x:y:z
              if one number is given, it is a limit on the total number of matches.  Glimpse  outputs  only  the
              first  x  matches.   If  -l  is  used (i.e., only file names are sought), then the limit is on the
              number of files; otherwise, the limit is on the number of  records.   If  two  numbers  are  given
              (x:y), then y is an added limit on the total number of files.  If three numbers are given (x:y:z),
              then z is an added limit on the number of matches per file.  If any of the x, y, or z is set to 0,
              it  means  to  ignore  it  (in  other words 0 = infinity in this case);  for example, -L 0:10 will
              output all matches to the first 10 files that contain a match.  This option is particularly useful
              for servers that needs to limit the amount of output provided to clients.

       -m     used for glimpse internals.

       -M     used for glimpse internals.

       -n     Each  matching  record  (line)  is  prefixed by its record (line) number in the file.  Performance
              Note: To compute the record/line number, agrep needs to search for all record delimiters (or  line
              breaks), which can slow down the search.

       -N     searches  only  the  index (so the search is faster).  If -o or -b are used then the result is the
              number of files that have a potential match plus a prompt to ask if  you  want  to  see  the  file
              names.   (If  -y is used, then there is no prompt and the names of the files will be shown.)  This
              could be a way to get the matching file names without even having access to the files  themselves.
              However,  because  only the index is searched, some potential matches may not be real matches.  In
              other words, with -N you will not miss any file but you may get extra files.  For  example,  since
              the index stores everything in lower case, a case-sensitive query may match a file that has only a
              case-insensitive match.  Boolean queries may match a file that has all the keywords but not in the
              same  line  (indexing  with -b allows glimpse to figure out whether the keywords are close, but it
              cannot figure out from the index whether they are exactly on the same line or in the  same  record
              without  looking at the file).  If the index was not build with -o or -b, then this option outputs
              the number of blocks matching the pattern.  This is useful as an indication of how long the search
              will  take.   All files are partitioned into usually 200-250 blocks.  The file .glimpse_statistics
              contains the total number of blocks (or glimpse -N a will give a pretty good estimate; only blocks
              with no occurrences of 'a' will be missed).

       -o     the  opposite  of -t: the delimiter is not output at the tail, but at the beginning of the matched
              record.

       -O     the file names are not printed before every matched record; instead, each filename is printed just
              once, and all the matched records within it are printed after it.

       -p     (from  version 4.0B1 only) Supports reading compressed set of filenames.  The -p option allows you
              to  utilize  compressed  `neighborhoods'  (sets  of  filenames)  to  limit  your  search,  without
              uncompressing them.  Added mostly for WebGlimpse.  The usage is:
              "-p  filename:X:Y:Z"  where  "filename"  is the file with compressed neighborhoods, X is an offset
              into that file (usually 0, must be a multiple of sizeof(int)), Y is the length glimpse must access
              from  that  file  (if  0, then whole file; must be a multiple of sizeof(int)), and Z must be 2 (it
              indicates that "filename" has the sparse-set representation of compressed neighborhoods: the other
              values  are  for  internal  use only). Note that any colon ":" in filename must be escaped using a
              backslash .

       -P     used for glimpse internals.

       -q     prints the offsets of the beginning and end of each matched record.  The difference between -q and
              -b  is that -b prints the offsets of the actual matched string, while -q prints the offsets of the
              whole record where the match occurred.  The output format is  @x{y},  where  x  is  the  beginning
              offset and y is the end offset.

       -Q     when  used together with -N glimpse not only displays the filename where the match occurs, but the
              exact occurrences (offsets) as seen in the index.  This option is relevant only if the  index  was
              built  with  -b;   otherwise,  the offsets are not available in the index.  This option is ignored
              when used not with -N.

       -r     This option is an agrep option and it will be ignored in glimpse, unless glimpse is  used  with  a
              file  name  at  the end which makes it run as agrep.  If the file name is a directory name, the -r
              option will search (recursively) the whole directory and everything below it.  (The glimpse  index
              will not be used.)

       -R k   defines the maximum size (in bytes) of a record.  The maximum value (which is the default) is 48K.
              Defining the maximum to be lower than the deafult may speed up some searches.

       -s     Work silently, that is, display nothing except error messages.  This is useful  for  checking  the
              error status.

       -Sk    Set  the  cost  of  a substitution to k (k is a positive integer).  This option does not currently
              work with regular expressions.

       -t     Similar to the -d option, except that the delimiter is assumed to appear at the end of the record.
              Glimpse  will  output the record starting from the end of delim to (and including) the next delim.
              (See warning for the -d option.)

       -T directory
              Use directory as a place where temporary files are built.  (Glimpse produces some small  temporary
              files usually in /tmp.)  This option is useful mainly in the context of structured queries for the
              Harvest project, where the temporary files may be non-trivial, and the /tmp directory may not have
              enough space for them.

       -U     (starting  at  version  4.0B1)  Interprets  an  index  created  with  the  -X  or the -U option in
              glimpseindex.  Useful mostly for WebGlimpse or similar web  applications.   When  glimpse  outputs
              matches, it will display the filename, the URL, and the title automatically.

       -v     (This  option  is an agrep option and it will be ignored in glimpse, unless glimpse is used with a
              file name at the end which makes it run as agrep.)  Output all records/lines that do not contain a
              match.  (Glimpse does not support the NOT operator yet.)

       -V     prints the current version of glimpse.

       -w     Search  for the pattern as a word — i.e., surrounded by non-alphanumeric characters.  For example,
              glimpse -w car will match car, but not  characters  and  not  car10.   The  non-alphanumeric  must
              surround  the  match;   they  cannot be counted as errors.  This option does not work with regular
              expressions.  Performance Note: When -w is used together with the -i option, the search may become
              much  faster.   The -w will not work with $, ^, and _ (see BUGS below).  It is recommended to have
              -i and -w as defaults, for example, through an alias.  We use the following alias  in  our  .cshrc
              file
              alias glwi 'glimpse -w -i'

       -W     The default for Boolean AND queries is that they cover one record (the default for a record is one
              line) at a time.  For example, glimpse 'good;bad' will output all lines containing both 'good' and
              'bad'.   The  -W option changes the scope of Booleans to be the whole file.  Within a file glimpse
              will output all matches to any of the patterns.  So, glimpse -W 'good;bad' will output  all  lines
              containing  'good'  or  'bad', but only in files that contain both patterns.  The NOT operator '~'
              can be used only with -W.  It is described later on.  The OR operator  is  essentially  unaffected
              (unless  it  is  in  combination  with the other Boolean operations).  For structured queries, the
              scope is always the whole attribute or file.

       -x     The pattern must match the whole line.  (This option  is  translated  to  -w  when  the  index  is
              searched and it is used only when the actual text is searched.  It is of limited use in glimpse.)

       -X     (from  version 4.0B1 only) Output the names of files that contain a match even if these files have
              been deleted since the index was built.  Without this option  glimpse  will  simply  ignore  these
              files.

       -y     Do not prompt.  Proceed with the match as if the answer to any prompt is y.  Servers (or any other
              scripts) using glimpse will probably want to use this option.

       -Y k   If the index was constructed with the -t option, then -Y x will output only matches to files  that
              were  created  or  modified  within the last x days.  There are no major performance penalties for
              this option.

       -z     Allow customizable filtering, using the file .glimpse_filters to perform the programs listed there
              for each match.  The best example is compress/decompress.  If .glimpse_filters include the line
              *.Z   uncompress <
              (separated  by  tabs) then before indexing any file that matches the pattern "*.Z" (same syntax as
              the one for .glimpse_exclude) the command listed is executed first (assuming input is from  stdin,
              which is why uncompress needs <) and its output (assuming it goes to stdout) is indexed.  The file
              itself is not changed (i.e., it stays compressed).  Then if glimpse -z is used, the  same  program
              is used on these files on the fly.  Any program can be used (we run 'exec').  For example, one can
              filter out parts of files that should not be indexed.  Glimpseindex tries to apply all filters  in
              .glimpse_filters  in  the order they are given.  For example, if you want to uncompress a file and
              then extract some part of it, put the compression command  (the  example  above)  first  and  then
              another  line  that specifies the extraction.  Note that this can slow down the search because the
              filters need to be run before files are searched.  (See also glimpseindex.)

       -Z     No op.  (It's useful for glimpse's internals. Trust us.)

       The characters `$', `^', `', `[', `]', `^', `|', `(', `)', `!', and `\'  can  cause  unexpected  results
       when  included  in  the  pattern,  as  these characters are also meaningful to the shell.  To avoid these
       problems, enclose the entire pattern in single quotes, i.e., 'pattern'.  Do not use double quotes (").

PATTERNS

       glimpse supports a large  variety  of  patterns,  including  simple  strings,  strings  with  classes  of
       characters, sets of strings, wild cards, and regular expressions (see LIMITATIONS).

       Strings
              Strings  are  any  sequence of characters, including the special symbols `^' for beginning of line
              and `$' for end of line.  The following special characters ( `$', `^', `', `[',  `^',  `|',  `(',
              `)',  `!', and `\' ) as well as the following meta characters special to glimpse (and agrep): `;',
              `,', `#', `<', `>', `-', and `.', should be preceded by `\' if they are to be matched  as  regular
              characters.  For example, \^abc\\ corresponds to the string ^abc\, whereas ^abc corresponds to the
              string abc at the beginning of a line.

       Classes of characters
              a list of characters inside [] (in order)  corresponds  to  any  character  from  the  list.   For
              example,  [a-ho-z]  is any character between a and h or between o and z.  The symbol `^' inside []
              complements the list.  For example, [^i-n] denote  any  character  in  the  character  set  except
              character  'i'  to  'n'.  The symbol `^' thus has two meanings, but this is consistent with egrep.
              The symbol `.' (don't care) stands for any symbol (except for the newline symbol).

       Boolean operations
              Glimpse supports an `AND' operation denoted by the symbol `;' an `OR'  operation  denoted  by  the
              symbol  `,',  a  limited  version  of a 'NOT' operation (starting at version 4.0B1) denoted by the
              symbol `~', or any combination.  For example, glimpse 'pizza;cheeseburger' will output  all  lines
              containing both patterns.  glimpse -F 'gnu;\.c$' 'define;DEFAULT' will output all lines containing
              both 'define' and 'DEFAULT' (anywhere in the line, not necessarily in order) in files  whose  name
              contains  'gnu'  and  ends  with .c.  glimpse '{political,computer};science' will match 'political
              science' or 'science of computers'.  The NOT operation works only together with the -W option  and
              it  is  generally  applies  only  to  the whole file rather to individual records.  Its output may
              sometimes seem counterintuitive.  Use with care.  glimpse -W 'fame;~glory' will output  all  lines
              containing  'fame'  in  all files that contain 'fame' but do not contain 'glory'; This is the most
              common use of NOT, and in this case it works as expected.   glimpse  -W  '~{fame;glory}'  will  be
              limited to files that do not contain both words, and will output all lines containing one of them.

       Wild cards
              The  symbol  '#'  is used to denote a sequence of any number (including 0) of arbitrary characters
              (see LIMITATIONS).  The symbol # is equivalent to .* in egrep.  In fact, .* will work too, because
              it  is  a  valid  regular  expression  (see  below),  but unless this is part of an actual regular
              expression, # will work faster.  (Currently glimpse is experiencing some problems with #.)

       Combination of exact and approximate matching
              Any pattern inside angle brackets <> must match the text exactly even if the match is with errors.
              For  example,  <mathemat>ics matches mathematical with one error (replacing the last s with an a),
              but mathe<matics> does not match mathematical no matter how many errors are allowed.  (This option
              is buggy at the moment.)

       Regular expressions
              Since  the index is word based, a regular expression must match words that appear in the index for
              glimpse to find  it.   Glimpse  first  strips  the  regular  expression  from  all  non-alphabetic
              characters,  and  searches  the  index  for  all  remaining  words.   It  then applies the regular
              expression matching algorithm to the files found in the index.  For  example,  glimpse  'abc.*xyz'
              will  search  the  index for all files that contain both 'abc' and 'xyz', and then search directly
              for 'abc.*xyz' in those files.  (If you use glimpse -w  'abc.*xyz',  then  'abcxyz'  will  not  be
              found, because glimpse will think that abc and xyz need to be matches to whole words.)  The syntax
              of regular expressions in glimpse is in general the same as that for agrep.  The  union  operation
              `|',  Kleene  closure  `*', and parentheses () are all supported.  Currently '+' is not supported.
              Regular expressions are currently limited to approximately 30 characters (generally excluding meta
              characters).   Some  options  (-d,  -w,  -t,  -x,  -D,  -I, -S) do not currently work with regular
              expressions.  The maximal number of errors for regular expressions that use '*' or '|' is 4.  (See
              LIMITATIONS.)

       structured queries
              Glimpse  supports  some  form  of  structured queries using Harvest's SOIF format.  See STRUCTURED
              QUERIES below for details.

EXAMPLES

       (Run "glimpse '^glimpse' this-file" to get a list of all examples, some of which were given earlier.)

       glimpse -F 'haystack.h$' needle
              finds all needles in all haystack.h's files.

       glimpse -2 -F html Anestesiology
              outputs all occurrences of Anestesiology with two errors in files with  html  somewhere  in  their
              full name.

       glimpse -l -F '\.c$' variablename
              lists  the  names of all .c files that contain variablename (the -l option lists file names rather
              than output the matched lines).

       glimpse -F 'mail;1993' 'windsurfing;Arizona'
              finds all lines containing windsurfing and Arizona in all files having `mail' and '1993' somewhere
              in their full name.

       glimpse -F mail 't.j@#uk'
              finds  all mail addresses (search only files with mail somewhere in their name) from the uk, where
              the login name ends with t.j, where the . stands for any one character.  (This is very  useful  to
              find a login name of someone whose middle name you don't know.)

       glimpse -F mbox -h -G  . > MBOX
              concatenates all files whose name matches `mbox' into one big one.

SEARCHING IN COMPRESSED FILES

       Glimpse  includes  an  optional new compression program, called cast, which allows glimpse (and agrep) to
       search the compressed files without having to decompress them.   The  search  is  actually  significantly
       faster  when  the  files are compressed.  However, we have not tested cast as thoroughly as we would have
       liked, and a mishap in a compression algorithm can cause loss of data, so we recommend at this  point  to
       use  cast  very  carefully.   We do not support or maintain cast.  (Unless you specifically use cast, the
       default is to ignore it.)

GLIMPSEINDEX FILES

       All files used by glimpse are located at the directory(ies) where the index(es) is (are) stored and  have
       .glimpse_  as  a  prefix.   The  first  two  files (.glimpse_exclude and .glimpse_include) are optionally
       supplied by the user.  The other files are built and read by glimpse.

       .glimpse_exclude
              contains a list of files that glimpseindex is explicitly told to ignore.  In general,  the  syntax
              of  .glimpse_exclude/include  is  the same as that of agrep (or any other grep).  The lines in the
              .glimpse_exclude file are matched to the file names, and if they match, the  files  are  excluded.
              Notice  that  agrep matches to parts of the string!  e.g., agrep /ftp/pub will match /home/ftp/pub
              and /ftp/pub/whatever.  So, if you want to exclude /ftp/pub/core, you just list it, as is, in  the
              .glimpse_exclude file.  If you put "/home/ftp/pub/cdrom" in .glimpse_exclude, every file name that
              matches that string will be excluded, meaning all files below it.  You can use ^ to  indicate  the
              beginning  of  a file name, and $ to indicate the end of one, and you can use * and ? in the usual
              way.   For  example  /ftp/*html  will   exclude   /ftp/pub/foo.html,   but   will   also   exclude
              /home/ftp/pub/html/whatever;   if you want to exclude files that start with /ftp and end with html
              use ^/ftp*html$ Notice that putting a * at the beginning or at the end is redundant (in  fact,  in
              this case glimpseindex will remove the * when it does the indexing).  No other meta characters are
              allowed in .glimpse_exclude (e.g., don't use .* or # or |).  Lines with * or ? must have  no  more
              than  30 characters.  Notice that, although the index itself will not be indexed, the list of file
              names (.glimpse_filenames) will be indexed unless it is explicitly listed in .glimpse_exclude.

       .glimpse_filters
              See the description above for the -z option.

       .glimpse_include
              contains a list of files that glimpseindex is explicitly told to include in the index even  though
              they  may  look like non-text files.  Symbolic links are followed by glimpseindex only if they are
              specifically included here.  If a file is in both .glimpse_exclude and .glimpse_include it will be
              excluded.

       .glimpse_filenames
              contains the list of all indexed file names, one per line.  This is an ASCII file that can also be
              used with agrep to search for a file name leading to a fast find command.  For example,
              glimpse 'count#\.c$' ~/.glimpse_filenames
              will output the names of all (indexed) .c  files  that  have  'count'  in  their  name  (including
              anywhere  on  the  path  from  the  index).  Setting the following alias in the .login file may be
              useful:
              alias findfile 'glimpse -h :1 ~/.glimpse_filenames'

       .glimpse_index
              contains the index.  The index consists of lines, each starting with a word followed by a list  of
              block  numbers  (unless  the  -o or -b options are used, in which case each word is followed by an
              offset into the file .glimpse_partitions where all pointers are kept).  The block/file numbers are
              stored in binary form, so this is not an ASCII file.

       .glimpse_messages
              contains the output of the -w option (see above).

       .glimpse_partitions
              contains  the  partition of the indexed space into blocks and, when the index is built with the -o
              or -b options, some part of the index.  This file is used internally by glimpse and it is  a  non-
              ASCII file.

       .glimpse_statistics
              contains some statistics about the makeup of the index.  Useful for some advanced applications and
              customization of glimpse.

       .glimpse_turbo
              An added data structure (used under glimpseindex -o or -b only) that helps  to  speed  up  queries
              significantly for large indexes.  Its size is 0.25MB.  Glimpse will work without it if needed.

STRUCTURED QUERIES

       Glimpse  can  search for Boolean combinations of "attribute=value" terms by using the Harvest SOIF parser
       library (in glimpse/libtemplate).  To search this way, the index must be made by using the -s  option  of
       glimpseindex  (this  can  be  used  in  conjunction  with  other  glimpseindex  options). For glimpse and
       glimpseindex to recognize "structured" files, they must be in SOIF format. In this format, each value  is
       prefixed by an attribute-name with the size of the value (in bytes) present in "{}" after the name of the
       attribute.  For example, The following lines are part of an SOIF file:
       type{17}:       Directory-Listing
       md5{32}:        3858c73d68616df0ed58a44d306b12ba
       Any string can serve as an attribute name.   Glimpse  "pattern;type=Directory-Listing"  will  search  for
       "pattern"  only  in  files  whose  type  is "Directory-Listing".  The file itself is considered to be one
       "object" and its name/url appears as the first attribute with an "@" prefix; e.g., @FILE {  http://xxx...
       } The scope of Boolean operations changes from records (lines) to whole files when structured queries are
       used in glimpse (since individual query terms can look at  different  attributes  and  they  may  not  be
       "covered"  by the record/line).  Note that glimpse can only search for patterns in the value parts of the
       SOIF file: there are some attributes (like the TTL, MD5, etc.) that are interpreted by Harvest's internal
       routines.  See RFC 2655 for more detailed information of the SOIF format.

REFERENCES

       1.     U.  Manber  and S. Wu, "GLIMPSE: A Tool to Search Through Entire File Systems," Usenix Winter 1994
              Technical Conference (best paper award), San Francisco (January 1994), pp. 23-32.  Also, Technical
              Report  #TR  93-34,  Dept.  of Computer Science, University of Arizona, October 1993 (a postscript
              file is available by anonymous ftp at ftp://webglimpse.net/pub/glimpse/TR93-34.ps).

       2.     S. Wu and U. Manber, "Fast Text Searching Allowing Errors," Communications of the ACM 35  (October
              1992), pp. 83-91.

SEE ALSO

       agrep(1), ed(1), ex(1), glimpseindex(1), glimpseserver(1), grep(1), sh(1), csh(1).

LIMITATIONS

       The  index  of  glimpse is word based.  A pattern that contains more than one word cannot be found in the
       index.  The way glimpse overcomes this weakness is by splitting any multi-word pattern into  its  set  of
       words  and  looking  for  all of them in the index.  For example, glimpse 'linear programming' will first
       consult the index to find all files containing both linear and programming, and then apply agrep to  find
       the  combined  pattern.   This  is usually an effective solution, but it can be slow for cases where both
       words are very common, but their combination is not.

       As was mentioned in the section on PATTERNS above, some characters serve as meta characters  for  glimpse
       and  need  to  be  preceded  by  '\' to search for them.  The most common examples are the characters '.'
       (which stands for a wild card), and '*' (the Kleene closure).  So, "glimpse ab.de" will match abcde,  but
       "glimpse ab\.de" will not, and "glimpse ab*de" will not match ab*de, but "glimpse ab\*de" will.  The meta
       character - is translated automatically to a hypen unless it appears between [] (in which case it denotes
       a range of characters).

       The  index  of  glimpse  stores  all  patterns  in  lower case.  When glimpse searches the index it first
       converts all patterns to lower case, finds the appropriate files, and  then  searches  the  actual  files
       using the original patterns.  So, for example, glimpse ABCXYZ will first find all files containing abcxyz
       in any combination of lower and upper cases, and then searches these files directly, so  only  the  right
       cases will be found.  One problem with this approach is discovering misspellings that are caused by wrong
       cases.  For example, glimpse -B abcXYZ will first search the index for the best match to abcxyz  (because
       the  pattern is converted to lower case); it will find that there are matches with no errors, and will go
       to those files to search them directly, this time with the original upper cases.  If  the  closest  match
       is,  say  AbcXYZ, glimpse may miss it, because it doesn't expect an error.  Another problem is speed.  If
       you search for "ATT", it will look at the index for "att".  Unless you use -w to match  the  whole  word,
       glimpse may have to search all files containing, for example, "Seattle" which has "att" in it.

       There  is  no  size  limit  for  simple  patterns  and  simple patterns within Boolean expressions.  More
       complicated patterns, such as regular expressions, are currently limited to approximately 30  characters.
       Lines  are  limited  to  1024  characters.   Records are limited to 48K, and may be truncated if they are
       larger than that.  The limit of record length can be changed by modifying  the  parameter  Max_record  in
       agrep.h.

       Glimpseindex does not index words of size > 64.

BUGS

       In some rare cases, regular expressions using * or # may not match correctly.

       A  query that contains no alphanumeric characters is not recommended (unless glimpse is used as agrep and
       the file names are provided).  This is an understatement.

       The notion of "match to the whole word" (the -w option) can be tricky sometimes.  For example, glimpse -w
       'word$'  will  not  match  'word' appearing at the end of a line, because the extra '$' makes the pattern
       more than just one simple word.  The same thing can happen with ^ and with _.  To be on  the  safe  side,
       use the -w option only when the patterns are actual words.

       Please send bug reports or comments to gvelez@webglimpse.net.

DIAGNOSTICS

       Exit status is 0 if any matches are found, 1 if none, 2 for syntax errors or inaccessible files.

AUTHORS

       Udi  Manber  and  Burra  Gopal,  Department  of  Computer Science, University of Arizona, and Sun Wu, the
       National Chung-Cheng University, Taiwan. Now maintained by  Golda  Velez  at  Internet  WorkShop  (Email:
       gvelez@webglimpse.net)

                                                November 10, 1997                                     GLIMPSE(1)