lunar (1) glimpse.1.gz

Provided by: glimpse_4.18.7-6_amd64 bug

NAME

       glimpse - search quickly through entire file systems

OVERVIEW

       Glimpse  (which  stands  for  GLobal  IMPlicit SEarch) is a very popular UNIX indexing and
       query system that allows you to search through a large set of files very quickly.  Glimpse
       supports  most  of  agrep's  options  (agrep  is  our  powerful version of grep) including
       approximate matching (e.g., finding misspelled words),  Boolean  queries,  and  even  some
       limited  forms  of regular expressions.  It is used in the same way, except that you don't
       have to specify file names.  So, if you are looking for a needle  anywhere  in  your  file
       system,  all  you  have  to  do is say glimpse needle and all lines containing needle will
       appear preceded by the file name.

       To use glimpse you first need  to  index  your  files  with  glimpseindex.   For  example,
       glimpseindex  -o  ~   will  index  everything  at  or  below your home directory.  See man
       glimpseindex for more details.

       Glimpse is also available for web sites, as a set of tools called  WebGlimpse.   (The  old
       glimpseHTTP  is  no  longer supported and is not recommended.)  See http://webglimpse.net/
       for more information.

       Glimpse includes all of agrep and can be used instead of agrep by giving a file name(s) at
       the  end  of  the  command.   This will cause glimpse to ignore the index and run agrep as
       usual.  For example, glimpse -1 pattern file is the same as agrep -1 pattern file.   Agrep
       is distributed as a self-contained package within glimpse, and can be used separately.  We
       added a new option to agrep:  -r searches recursively the directory and  everything  below
       it (see agrep options below); it is used only when glimpse reverts to agrep.

       Mail  majordomo@webglimpse.net  with  SUBSCRIBE  wgusers  in  the  body to be added to the
       Webglimpse users mailing list.  This is now the location where glimpse questions are  also
       discussed.   Bugs can be reported at http://webglimpse.net/bugzilla/ HTML version of these
       manual pages can be found in  http://webglimpse.net/docs/glimpsehelp.html  Also,  see  the
       glimpse home pages in http://webglimpse.net/glimpse

SYNOPSIS

       glimpse - [almost all letters] pattern

INTRODUCTION

       We  start with simple ways to use glimpse and describe all the options in detail later on.
       Once an index is built, using glimpseindex, searching for pattern is as easy as saying

       glimpse pattern

       The output of glimpse is similar to that of agrep (or any other grep).  The pattern can be
       any agrep legal pattern including a regular expression or a Boolean query (e.g., searching
       for Tucson AND Arizona is done by glimpse 'Tucson;Arizona').

       The speed of glimpse depends mainly on the number and sizes of the files  that  contain  a
       match  and only to a second degree on the total size of all indexed files.  If the pattern
       is reasonably uncommon, then all matches will be reported in a few  seconds  even  if  the
       indexed  files total 500MB or more.  Some information on how glimpse works and a reference
       to a detailed article are given below.

       Most of agrep (and other grep's) options are supported,  including  approximate  matching.
       For example,

       glimpse -1 'Tuson;Arezona'

       will  output  all lines containing both patterns allowing one spelling error in any of the
       patterns (either insertion, deletion, or substitution), which in this case  is  definitely
       needed.

       glimpse -w -i 'parent'

       specifies  case  insensitive  (-i)  and  match  on  complete  words (-w).  So 'Parent' and
       'PARENT' will match, 'parent/child' will match, but 'parenthesis' or  'parents'  will  not
       match.   (Starting  at  version 3.0, glimpse can be much faster when these two options are
       specified, especially for very large indexes.  You may want to set an alias especially for
       "glimpse -w -i".)

       The -F option provides a pattern that must match the file name.  For example,

       glimpse -F '\.c$' needle

       will  find  the  pattern needle in all files whose name ends with .c.  (Glimpse will first
       check its index to determine which files may contain the pattern and then run agrep on the
       file names to further limit the search.)  The -F option should not be put at the end after
       the main pattern (e.g., "glimpse needle -F hay" is incorrect).

A Detailed Description of All the Options of Glimpse

       -#     # is an integer between 1 and 8 specifying the maximum number of  errors  permitted
              in  finding  the  approximate  matches  (the  default  is  zero).   Generally, each
              insertion, deletion, or substitution counts as one error.  It is possible to adjust
              the  relative  cost  of  insertions,  deletions and substitutions (see -I -D and -S
              options).   Since  the  index  stores  only  lower  case  characters,   errors   of
              substituting  upper case with lower case may be missed (see LIMITATIONS).  Allowing
              errors in the match requires more time and can slow down the match by a  factor  of
              2-4.  Be very careful when specifying more than one error, as the number of matches
              tend to grow very quickly.

       -a     prints attribute names.  This option applies only to Harvest SOIF  structured  data
              (used   with  glimpseindex  -s).   (See  http://harvest.sourceforge.net/  for  more
              information about the Harvest project.)

       -A     used for glimpse internals.

       -b     prints the byte offset (from the beginning of the file) of the end of  each  match.
              The first character in a file has offset 0.

       -B     Best  match  mode.   (Warning: -B sometimes misses matches.  It is safer to specify
              the number of errors explicitly.)  When -B is specified and no  exact  matches  are
              found,  glimpse  will  continue to search until the closest matches (i.e., the ones
              with minimum number of errors) are found, at which point the following message will
              be  shown:  "the  best  match  contains x errors, there are y matches, output them?
              (y/n)" This message refers to the number of matches found in the index.  There  may
              be  many  more  matches  in  the actual text (or there may be none if -F is used to
              filter files).  When the -#, -c, or -l options are  specified,  the  -B  option  is
              ignored.   In  general,  -B may be slower than -#, but not by very much.  Since the
              index stores only lower case characters, errors of  substituting  upper  case  with
              lower case may be missed (see LIMITATIONS).

       -c     Display  only  the  count  of  matching  records.   Only  files  with count > 0 are
              displayed.

       -C     tells glimpse to send its queries to glimpseserver.

       -d 'delim'
              Define delim to be the separator between two records.  The default  value  is  '$',
              namely  a  record  is  by  default a line.  delim can be a string of size at most 8
              (with possible use of ^ and $), but not a regular  expression.   Text  between  two
              delim's,  before  the  first  delim,  and after the last delim is considered as one
              record.  For example, -d '$$' defines paragraphs as records and -d '^From ' defines
              mail  messages  as  records.   glimpse matches each record separately.  This option
              does not currently work with regular expressions.   The  -d  option  is  especially
              useful  for  Boolean  AND queries, because the patterns need not appear in the same
              line  but  in  the  same  record.   For  example,  glimpse  -F  mail  -d   '^From '
              'glimpse;arizona;announcement'  will  output  all mail messages (in their entirety)
              that have the 3 patterns anywhere in the message (or  the  header),  assuming  that
              files  with  'mail'  in their name contain mail messages.  If you want the scope of
              the record to be the whole file, use the -W  option.   Glimpse  warning:  Use  this
              option with care.  If the delimiter is set to match mail messages, for example, and
              glimpse finds the pattern in a regular file, it may not find the delimiter and will
              therefore  output  the whole file.  (The -t option - see below - can be used to put
              the delim at the end of the record.)  Performance Note: Agrep (and glimpse) resorts
              to  more  complex  search  when  the  -d  option is used.  The search is slower and
              unfortunately no more than 32 characters can be used in the pattern.

       -Dk    Set the cost of a deletion to k (k is a positive integer).  This  option  does  not
              currently work with regular expressions.

       -e pattern
              Same as a simple pattern argument, but useful when the pattern begins with a `-'.

       -E     prints  the  lines  in  the  index  (as  they  appear in the index) which match the
              pattern.  Used mostly for debugging and maintenance of the index.  This is  not  an
              option that a user needs to know about.

       -f file_name
              this  option  has  a different meaning for agrep than for glimpse: In glimpse, only
              the files whose names are listed in file_name are matched.  (The file names have to
              appear as in .glimpse_filenames.)  In agrep, the file_name contains the list of the
              patterns that are searched.  (Starting at version 3.6, this option for  glimpse  is
              much faster for large files.)

       -F file_pattern
              limits  the  search  to  those  files whose name (including the whole path) matches
              file_pattern.  This option can be used in a  variety  of  applications  to  provide
              limited search even for one large index.  If file_pattern matches a directory, then
              all files with this directory on their path  will  be  considered.   To  limit  the
              search  to actual file names, use $ at the end of the pattern.  file_pattern can be
              a regular expression and even a Boolean pattern.  This  option  is  implemented  by
              running  agrep  file_pattern  on  the  list  of file names obtained from the index.
              Therefore, searching the index itself takes the same amount of time,  but  limiting
              the  second  phase  of  the  search  to  only  a  few files can speed up the search
              significantly.  For example,

              glimpse -F 'src#\.c$' needle

              will search for needle in all .c files with src somewhere along the path.   The  -F
              file_pattern  must appear before the search pattern (e.g., glimpse needle -F '\.c$'
              will not work).  It is possible to use some of agrep's options when  matching  file
              names.   In  this case all options as well as the file_pattern should be in quotes.
              (-B and -v do not work very well as part of a file_pattern.)  For example,

              glimpse -F '-1 \.html' pattern

              will allow one spelling error when matching .html to the file names (so ".htm"  and
              ".shtml" will match as well).

              glimpse -F '-v \.c$' counter

              will search for 'counter' in all files except for .c files.

       -g     prints  the  file  number (its position in the .glimpse_filenames file) rather than
              its name.

       -G     Output the (whole) files that contain a match.

       -h     Do not display filenames.

       -H directory_name
              searches for the index and the other .glimpse files in directory_name.  The default
              is  the  home  directory.  This option is useful, for example, if several different
              indexes are maintained for different archives (e.g., one for mail messages, one for
              source code, one for articles).

       -i     Case-insensitive  search  — e.g., "A" and "a" are considered equivalent.  Glimpse's
              index stores all patterns in lower case (see LIMITATIONS below).  Performance Note:
              When -i is used together with the -w option, the search may become much faster.  It
              is recommended to have -i and -w as defaults, for example, through  an  alias.   We
              use the following alias in our .cshrc file
              alias glwi 'glimpse -w -i'

       -Ik    Set  the cost of an insertion to k (k is a positive integer).  This option does not
              currently work with regular expressions.

       -j     If the index was constructed with the -t option, then -j will output the files last
              modification  dates in addition to everything else.  There are no major performance
              penalties for this option.

       -J host_name
              used in conjunction with glimpseserver (-C) to connect to one particular server.

       -k     No symbol in the pattern is treated as a meta character.  For example,  glimpse  -k
              'a(b|c)*d'  will  find  the occurrences of a(b|c)*d whereas glimpse 'a(b|c)*d' will
              find substrings that match the regular expression 'a(b|c)*d'.  (The only  exception
              is  ^  at  the  beginning of the pattern and $ at the end of the pattern, which are
              still interpreted in the usual way.  Use \^ or \$ if you need them verbatim.)

       -K port_number
              used in conjunction with glimpseserver (-C) to connect to one particular server  at
              the specified TCP port number.

       -l     Output  only the files names that contain a match.  This option differs from the -N
              option in that the files themselves are searched, but the matching  lines  are  not
              shown.

       -L x | x:y | x:y:z
              if  one  number  is  given,  it is a limit on the total number of matches.  Glimpse
              outputs only the first x matches.  If  -l  is  used  (i.e.,  only  file  names  are
              sought),  then  the limit is on the number of files; otherwise, the limit is on the
              number of records.  If two numbers are given (x:y), then y is an added limit on the
              total  number  of  files.   If  three numbers are given (x:y:z), then z is an added
              limit on the number of matches per file.  If any of the x, y, or z is set to 0,  it
              means  to  ignore  it  (in other words 0 = infinity in this case);  for example, -L
              0:10 will output all matches to the first 10 files  that  contain  a  match.   This
              option  is particularly useful for servers that needs to limit the amount of output
              provided to clients.

       -m     used for glimpse internals.

       -M     used for glimpse internals.

       -n     Each matching record (line) is prefixed by its record (line) number  in  the  file.
              Performance  Note: To compute the record/line number, agrep needs to search for all
              record delimiters (or line breaks), which can slow down the search.

       -N     searches only the index (so the search is faster).  If -o or -b are used  then  the
              result  is  the number of files that have a potential match plus a prompt to ask if
              you want to see the file names.  (If -y is used, then there is no  prompt  and  the
              names  of  the  files will be shown.)  This could be a way to get the matching file
              names without even having access to the files themselves.   However,  because  only
              the  index  is  searched, some potential matches may not be real matches.  In other
              words, with -N you will not miss any  file  but  you  may  get  extra  files.   For
              example,  since  the  index stores everything in lower case, a case-sensitive query
              may match a file that has only a case-insensitive match.  Boolean queries may match
              a  file that has all the keywords but not in the same line (indexing with -b allows
              glimpse to figure out whether the keywords are close, but it cannot figure out from
              the  index  whether they are exactly on the same line or in the same record without
              looking at the file).  If the index was not build with -o or -b, then  this  option
              outputs the number of blocks matching the pattern.  This is useful as an indication
              of how long the search will take.  All files are partitioned into  usually  200-250
              blocks.   The  file  .glimpse_statistics  contains  the  total number of blocks (or
              glimpse -N a will give a pretty good estimate; only blocks with no  occurrences  of
              'a' will be missed).

       -o     the  opposite  of -t: the delimiter is not output at the tail, but at the beginning
              of the matched record.

       -O     the file names are not printed before every matched record; instead, each  filename
              is printed just once, and all the matched records within it are printed after it.

       -p     (from  version  4.0B1  only)  Supports reading compressed set of filenames.  The -p
              option allows you to utilize compressed  `neighborhoods'  (sets  of  filenames)  to
              limit  your  search, without uncompressing them.  Added mostly for WebGlimpse.  The
              usage is:
              "-p filename:X:Y:Z" where "filename" is the file with compressed  neighborhoods,  X
              is  an  offset  into that file (usually 0, must be a multiple of sizeof(int)), Y is
              the length glimpse must access from that file (if 0, then whole  file;  must  be  a
              multiple  of  sizeof(int)),  and  Z must be 2 (it indicates that "filename" has the
              sparse-set representation of compressed neighborhoods: the  other  values  are  for
              internal  use  only).  Note  that any colon ":" in filename must be escaped using a
              backslash .

       -P     used for glimpse internals.

       -q     prints the offsets of the beginning and end of each matched record.  The difference
              between -q and -b is that -b prints the offsets of the actual matched string, while
              -q prints the offsets of the whole record where the  match  occurred.   The  output
              format is @x{y}, where x is the beginning offset and y is the end offset.

       -Q     when  used  together with -N glimpse not only displays the filename where the match
              occurs, but the exact occurrences (offsets) as seen in the index.  This  option  is
              relevant  only  if  the  index  was  built with -b;  otherwise, the offsets are not
              available in the index.  This option is ignored when used not with -N.

       -r     This option is an agrep option and it will be ignored in glimpse, unless glimpse is
              used  with a file name at the end which makes it run as agrep.  If the file name is
              a directory name, the -r option will search (recursively) the whole  directory  and
              everything below it.  (The glimpse index will not be used.)

       -R k   defines  the  maximum size (in bytes) of a record.  The maximum value (which is the
              default) is 48K.  Defining the maximum to be lower than the default  may  speed  up
              some searches.

       -s     Work  silently, that is, display nothing except error messages.  This is useful for
              checking the error status.

       -Sk    Set the cost of a substitution to k (k is a positive integer).   This  option  does
              not currently work with regular expressions.

       -t     Similar to the -d option, except that the delimiter is assumed to appear at the end
              of the record.  Glimpse will output the record starting from the end  of  delim  to
              (and including) the next delim.  (See warning for the -d option.)

       -T directory
              Use  directory  as a place where temporary files are built.  (Glimpse produces some
              small temporary files usually in /tmp.)   This  option  is  useful  mainly  in  the
              context  of  structured  queries for the Harvest project, where the temporary files
              may be non-trivial, and the /tmp directory may not have enough space for them.

       -U     (starting at version 4.0B1) Interprets an index created  with  the  -X  or  the  -U
              option  in glimpseindex.  Useful mostly for WebGlimpse or similar web applications.
              When glimpse outputs matches, it will display the filename, the URL, and the  title
              automatically.

       -v     (This  option  is an agrep option and it will be ignored in glimpse, unless glimpse
              is used with a file name at the end which makes  it  run  as  agrep.)   Output  all
              records/lines  that  do  not  contain  a  match.  (Glimpse does not support the NOT
              operator yet.)

       -V     prints the current version of glimpse.

       -w     Search for the pattern as a word — i.e., surrounded by non-alphanumeric characters.
              For  example, glimpse -w car will match car, but not characters and not car10.  The
              non-alphanumeric must surround the match;  they cannot be counted as errors.   This
              option  does  not work with regular expressions.  Performance Note: When -w is used
              together with the -i option, the search may become much faster.  The  -w  will  not
              work  with  $,  ^,  and _ (see BUGS below).  It is recommended to have -i and -w as
              defaults, for example, through an alias.  We use the following alias in our  .cshrc
              file
              alias glwi 'glimpse -w -i'

       -W     The  default for Boolean AND queries is that they cover one record (the default for
              a record is one line) at a time.  For example, glimpse 'good;bad' will  output  all
              lines  containing  both  'good'  and  'bad'.   The  -W  option changes the scope of
              Booleans to be the whole file.  Within a file glimpse will output  all  matches  to
              any  of  the  patterns.  So, glimpse -W 'good;bad' will output all lines containing
              'good' or 'bad', but only in files that contain both patterns.   The  NOT  operator
              '~'  can  be  used  only  with  -W.   It is described later on.  The OR operator is
              essentially unaffected  (unless  it  is  in  combination  with  the  other  Boolean
              operations).   For  structured  queries, the scope is always the whole attribute or
              file.

       -x     The pattern must match the whole line.  (This option is translated to -w  when  the
              index  is  searched and it is used only when the actual text is searched.  It is of
              limited use in glimpse.)

       -X     (from version 4.0B1 only) Output the names of files that contain a  match  even  if
              these  files  have  been  deleted  since  the index was built.  Without this option
              glimpse will simply ignore these files.

       -y     Do not prompt.  Proceed with the match as  if  the  answer  to  any  prompt  is  y.
              Servers (or any other scripts) using glimpse will probably want to use this option.

       -Y k   If the index was constructed with the -t option, then -Y x will output only matches
              to files that were created or modified within the last x days.  There are no  major
              performance penalties for this option.

       -z     Allow  customizable  filtering,  using  the  file  .glimpse_filters  to perform the
              programs listed there for each match.  The best example is compress/decompress.  If
              .glimpse_filters include the line
              *.Z   uncompress <
              (separated  by  tabs)  then before indexing any file that matches the pattern "*.Z"
              (same syntax as the one for .glimpse_exclude) the command listed is executed  first
              (assuming  input  is  from  stdin,  which is why uncompress needs <) and its output
              (assuming it goes to stdout) is indexed.  The file itself is not changed (i.e.,  it
              stays  compressed).   Then if glimpse -z is used, the same program is used on these
              files on the fly.  Any program can be used (we run 'exec').  For example,  one  can
              filter  out parts of files that should not be indexed.  Glimpseindex tries to apply
              all filters in .glimpse_filters in the order they are given.  For example,  if  you
              want  to  uncompress  a  file and then extract some part of it, put the compression
              command (the example  above)  first  and  then  another  line  that  specifies  the
              extraction.  Note that this can slow down the search because the filters need to be
              run before files are searched.  (See also glimpseindex.)

       -Z     No op.  (It's useful for glimpse's internals. Trust us.)

       The characters `$', `^', `', `[', `]', `^',  `|',  `(',  `)',  `!',  and  `\'  can  cause
       unexpected  results  when included in the pattern, as these characters are also meaningful
       to the shell.  To avoid these problems, enclose the entire pattern in single quotes, i.e.,
       'pattern'.  Do not use double quotes (").

PATTERNS

       glimpse  supports  a  large  variety  of  patterns, including simple strings, strings with
       classes of  characters,  sets  of  strings,  wild  cards,  and  regular  expressions  (see
       LIMITATIONS).

       Strings
              Strings  are  any  sequence  of  characters,  including the special symbols `^' for
              beginning of line and `$' for end of line.  The following special characters ( `$',
              `^',  `',  `[',  `^',  `|', `(', `)', `!', and `\' ) as well as the following meta
              characters special to glimpse (and agrep): `;', `,', `#', `<', `>', `-',  and  `.',
              should  be  preceded  by  `\' if they are to be matched as regular characters.  For
              example, \^abc\\ corresponds to the string ^abc\, whereas ^abc corresponds  to  the
              string abc at the beginning of a line.

       Classes of characters
              a  list  of  characters  inside [] (in order) corresponds to any character from the
              list.  For example, [a-ho-z] is any character between a and h or between o  and  z.
              The  symbol  `^'  inside  []  complements the list.  For example, [^i-n] denote any
              character in the character set except character 'i' to 'n'.  The  symbol  `^'  thus
              has  two  meanings, but this is consistent with egrep.  The symbol `.' (don't care)
              stands for any symbol (except for the newline symbol).

       Boolean operations
              Glimpse supports an `AND' operation denoted by the symbol  `;'  an  `OR'  operation
              denoted  by  the  symbol  `,',  a limited version of a 'NOT' operation (starting at
              version 4.0B1) denoted by the symbol `~', or any combination.  For example, glimpse
              'pizza;cheeseburger'  will  output  all lines containing both patterns.  glimpse -F
              'gnu;\.c$' 'define;DEFAULT' will output all  lines  containing  both  'define'  and
              'DEFAULT'  (anywhere  in  the  line,  not necessarily in order) in files whose name
              contains 'gnu' and ends with .c.  glimpse '{political,computer};science' will match
              'political  science'  or  'science  of  computers'.   The  NOT operation works only
              together with the -W option and it is generally applies  only  to  the  whole  file
              rather to individual records.  Its output may sometimes seem counterintuitive.  Use
              with care.  glimpse -W 'fame;~glory' will output all lines containing 'fame' in all
              files  that  contain 'fame' but do not contain 'glory'; This is the most common use
              of NOT, and in this case it works as expected.  glimpse -W '~{fame;glory}' will  be
              limited  to  files  that  do  not  contain  both  words,  and will output all lines
              containing one of them.

       Wild cards
              The symbol '#' is used to  denote  a  sequence  of  any  number  (including  0)  of
              arbitrary characters (see LIMITATIONS).  The symbol # is equivalent to .* in egrep.
              In fact, .* will work too, because it is a valid regular  expression  (see  below),
              but  unless  this  is  part  of  an  actual regular expression, # will work faster.
              (Currently glimpse is experiencing some problems with #.)

       Combination of exact and approximate matching
              Any pattern inside angle brackets <> must match the text exactly even if the  match
              is  with  errors.   For  example, <mathemat>ics matches mathematical with one error
              (replacing the last s with an a), but mathe<matics> does not match mathematical  no
              matter how many errors are allowed.  (This option is buggy at the moment.)

       Regular expressions
              Since the index is word based, a regular expression must match words that appear in
              the index for glimpse to find it.  Glimpse first strips the regular expression from
              all  non-alphabetic characters, and searches the index for all remaining words.  It
              then applies the regular expression matching algorithm to the files  found  in  the
              index.   For  example,  glimpse 'abc.*xyz' will search the index for all files that
              contain both 'abc' and 'xyz', and then search  directly  for  'abc.*xyz'  in  those
              files.  (If you use glimpse -w 'abc.*xyz', then 'abcxyz' will not be found, because
              glimpse will think that abc and xyz need to be matches to whole words.)  The syntax
              of  regular  expressions  in glimpse is in general the same as that for agrep.  The
              union operation `|', Kleene closure `*', and  parentheses  ()  are  all  supported.
              Currently  '+'  is  not  supported.   Regular  expressions are currently limited to
              approximately 30 characters (generally excluding meta  characters).   Some  options
              (-d,  -w,  -t, -x, -D, -I, -S) do not currently work with regular expressions.  The
              maximal number of errors for regular expressions that use '*' or  '|'  is  4.  (See
              LIMITATIONS.)

       structured queries
              Glimpse  supports some form of structured queries using Harvest's SOIF format.  See
              STRUCTURED QUERIES below for details.

EXAMPLES

       (Run "glimpse '^glimpse' this-file" to get a list of all  examples,  some  of  which  were
       given earlier.)

       glimpse -F 'haystack.h$' needle
              finds all needles in all haystack.h's files.

       glimpse -2 -F html Anestesiology
              outputs  all  occurrences  of  Anestesiology  with  two  errors  in files with html
              somewhere in their full name.

       glimpse -l -F '\.c$' variablename
              lists the names of all .c files that contain variablename (the -l option lists file
              names rather than output the matched lines).

       glimpse -F 'mail;1993' 'windsurfing;Arizona'
              finds  all  lines containing windsurfing and Arizona in all files having `mail' and
              '1993' somewhere in their full name.

       glimpse -F mail 't.j@#uk'
              finds all mail addresses (search only files with mail somewhere in their name) from
              the  uk,  where  the  login  name  ends  with  t.j,  where the . stands for any one
              character.  (This is very useful to find a login name of someone whose middle  name
              you don't know.)

       glimpse -F mbox -h -G  . > MBOX
              concatenates all files whose name matches `mbox' into one big one.

SEARCHING IN COMPRESSED FILES

       Glimpse  includes  an  optional new compression program, called cast, which allows glimpse
       (and agrep) to search the compressed files without having to decompress them.  The  search
       is  actually  significantly  faster  when  the files are compressed.  However, we have not
       tested cast as thoroughly as we would have liked, and a mishap in a compression  algorithm
       can  cause  loss of data, so we recommend at this point to use cast very carefully.  We do
       not support or maintain cast.  (Unless you specifically use cast, the default is to ignore
       it.)

GLIMPSEINDEX FILES

       All  files  used by glimpse are located at the directory(ies) where the index(es) is (are)
       stored and have .glimpse_  as  a  prefix.   The  first  two  files  (.glimpse_exclude  and
       .glimpse_include) are optionally supplied by the user.  The other files are built and read
       by glimpse.

       .glimpse_exclude
              contains a list of files that  glimpseindex  is  explicitly  told  to  ignore.   In
              general,  the  syntax  of .glimpse_exclude/include is the same as that of agrep (or
              any other grep).  The lines in the .glimpse_exclude file are matched  to  the  file
              names,  and  if  they  match, the files are excluded.  Notice that agrep matches to
              parts  of  the  string!   e.g.,  agrep  /ftp/pub  will  match   /home/ftp/pub   and
              /ftp/pub/whatever.   So, if you want to exclude /ftp/pub/core, you just list it, as
              is,  in  the  .glimpse_exclude  file.   If   you   put   "/home/ftp/pub/cdrom"   in
              .glimpse_exclude,  every  file  name  that  matches  that  string will be excluded,
              meaning all files below it.  You can use ^ to indicate  the  beginning  of  a  file
              name,  and  $ to indicate the end of one, and you can use * and ? in the usual way.
              For example /ftp/*html  will  exclude  /ftp/pub/foo.html,  but  will  also  exclude
              /home/ftp/pub/html/whatever;  if you want to exclude files that start with /ftp and
              end with html use ^/ftp*html$ Notice that putting a * at the beginning  or  at  the
              end is redundant (in fact, in this case glimpseindex will remove the * when it does
              the indexing).  No other meta characters are  allowed  in  .glimpse_exclude  (e.g.,
              don't  use  .* or # or |).  Lines with * or ? must have no more than 30 characters.
              Notice that, although the index itself will not be indexed, the list of file  names
              (.glimpse_filenames)   will   be   indexed   unless  it  is  explicitly  listed  in
              .glimpse_exclude.

       .glimpse_filters
              See the description above for the -z option.

       .glimpse_include
              contains a list of files that glimpseindex is explicitly told  to  include  in  the
              index  even  though they may look like non-text files.  Symbolic links are followed
              by glimpseindex only if they are specifically included here.  If a file is in  both
              .glimpse_exclude and .glimpse_include it will be excluded.

       .glimpse_filenames
              contains  the  list of all indexed file names, one per line.  This is an ASCII file
              that can also be used with agrep to search for a file name leading to a  fast  find
              command.  For example,
              glimpse 'count#\.c$' ~/.glimpse_filenames
              will  output  the  names  of all (indexed) .c files that have 'count' in their name
              (including anywhere on the path from the index).  Setting the  following  alias  in
              the .login file may be useful:
              alias findfile 'glimpse -h :1 ~/.glimpse_filenames'

       .glimpse_index
              contains  the  index.   The  index  consists  of  lines,  each starting with a word
              followed by a list of block numbers (unless the -o or -b options are used, in which
              case each word is followed by an offset into the file .glimpse_partitions where all
              pointers are kept).  The block/file numbers are stored in binary form, so  this  is
              not an ASCII file.

       .glimpse_messages
              contains the output of the -w option (see above).

       .glimpse_partitions
              contains  the  partition  of  the  indexed space into blocks and, when the index is
              built with the -o or -b options, some  part  of  the  index.   This  file  is  used
              internally by glimpse and it is a non-ASCII file.

       .glimpse_statistics
              contains  some  statistics about the makeup of the index.  Useful for some advanced
              applications and customization of glimpse.

       .glimpse_turbo
              An added data structure (used under glimpseindex -o or -b only) that helps to speed
              up queries significantly for large indexes.  Its size is 0.25MB.  Glimpse will work
              without it if needed.

STRUCTURED QUERIES

       Glimpse can search for Boolean  combinations  of  "attribute=value"  terms  by  using  the
       Harvest  SOIF parser library (in glimpse/libtemplate).  To search this way, the index must
       be made by using the -s option of glimpseindex (this can be used in conjunction with other
       glimpseindex  options). For glimpse and glimpseindex to recognize "structured" files, they
       must be in SOIF format. In this format, each value is prefixed by an  attribute-name  with
       the  size  of  the  value (in bytes) present in "{}" after the name of the attribute.  For
       example, The following lines are part of an SOIF file:
       type{17}:       Directory-Listing
       md5{32}:        3858c73d68616df0ed58a44d306b12ba
       Any string can serve as an attribute name.  Glimpse "pattern;type=Directory-Listing"  will
       search  for "pattern" only in files whose type is "Directory-Listing".  The file itself is
       considered to be one "object" and its name/url appears as the first attribute with an  "@"
       prefix; e.g., @FILE { http://xxx... } The scope of Boolean operations changes from records
       (lines) to whole files when structured queries are used in glimpse (since individual query
       terms  can look at different attributes and they may not be "covered" by the record/line).
       Note that glimpse can only search for patterns in the value parts of the SOIF file:  there
       are  some  attributes (like the TTL, MD5, etc.) that are interpreted by Harvest's internal
       routines.  See RFC 2655 for more detailed information of the SOIF format.

REFERENCES

       1.     U. Manber and S. Wu, "GLIMPSE: A Tool  to  Search  Through  Entire  File  Systems,"
              Usenix  Winter 1994 Technical Conference (best paper award), San Francisco (January
              1994), pp. 23-32.  Also, Technical Report #TR 93-34,  Dept.  of  Computer  Science,
              University  of  Arizona,  October 1993 (a postscript file is available by anonymous
              ftp at ftp://webglimpse.net/pub/glimpse/TR93-34.ps).

       2.     S. Wu and U. Manber, "Fast Text Searching Allowing Errors," Communications  of  the
              ACM 35 (October 1992), pp. 83-91.

SEE ALSO

       agrep(1), ed(1), ex(1), glimpseindex(1), glimpseserver(1), grep(1), sh(1), csh(1).

LIMITATIONS

       The  index of glimpse is word based.  A pattern that contains more than one word cannot be
       found in the index.  The way glimpse overcomes this weakness is by  splitting  any  multi-
       word pattern into its set of words and looking for all of them in the index.  For example,
       glimpse 'linear programming' will first consult the index to  find  all  files  containing
       both  linear  and programming, and then apply agrep to find the combined pattern.  This is
       usually an effective solution, but it can be slow for cases  where  both  words  are  very
       common, but their combination is not.

       As  was  mentioned  in  the  section  on  PATTERNS  above,  some  characters serve as meta
       characters for glimpse and need to be preceded by '\' to search for them.  The most common
       examples  are  the  characters  '.'  (which  stands  for a wild card), and '*' (the Kleene
       closure).  So, "glimpse ab.de" will match  abcde,  but  "glimpse  ab\.de"  will  not,  and
       "glimpse  ab*de" will not match ab*de, but "glimpse ab\*de" will.  The meta character - is
       translated automatically to a hyphen unless it  appears  between  []  (in  which  case  it
       denotes a range of characters).

       The  index  of glimpse stores all patterns in lower case.  When glimpse searches the index
       it first converts all patterns to lower  case,  finds  the  appropriate  files,  and  then
       searches  the  actual  files using the original patterns.  So, for example, glimpse ABCXYZ
       will first find all files containing abcxyz in any combination of lower and  upper  cases,
       and  then  searches  these  files  directly,  so  only the right cases will be found.  One
       problem with this approach is discovering misspellings that are  caused  by  wrong  cases.
       For  example,  glimpse  -B abcXYZ will first search the index for the best match to abcxyz
       (because the pattern is converted to lower case); it will find that there are matches with
       no errors, and will go to those files to search them directly, this time with the original
       upper cases.  If the closest match is, say AbcXYZ, glimpse may miss it, because it doesn't
       expect  an error.  Another problem is speed.  If you search for "ATT", it will look at the
       index for "att".  Unless you use -w to match the whole word, glimpse may  have  to  search
       all files containing, for example, "Seattle" which has "att" in it.

       There is no size limit for simple patterns and simple patterns within Boolean expressions.
       More  complicated  patterns,  such  as  regular  expressions,  are  currently  limited  to
       approximately  30  characters.  Lines are limited to 1024 characters.  Records are limited
       to 48K, and may be truncated if they are larger than that.  The limit of record length can
       be changed by modifying the parameter Max_record in agrep.h.

       Glimpseindex does not index words of size > 64.

BUGS

       In some rare cases, regular expressions using * or # may not match correctly.

       A  query  that  contains  no alphanumeric characters is not recommended (unless glimpse is
       used as agrep and the file names are provided).  This is an understatement.

       The notion of "match to the whole word" (the -w option)  can  be  tricky  sometimes.   For
       example,  glimpse -w 'word$' will not match 'word' appearing at the end of a line, because
       the extra '$' makes the pattern more than just one simple word.  The same thing can happen
       with  ^  and with _.  To be on the safe side, use the -w option only when the patterns are
       actual words.

       Please send bug reports or comments to gvelez@webglimpse.net.

DIAGNOSTICS

       Exit status is 0 if any matches are found, 1 if none, 2 for syntax errors or  inaccessible
       files.

AUTHORS

       Udi Manber and Burra Gopal, Department of Computer Science, University of Arizona, and Sun
       Wu, the National Chung-Cheng University, Taiwan. Now maintained by Golda Velez at Internet
       WorkShop (Email:  gvelez@webglimpse.net)

                                        November 10, 1997                              GLIMPSE(1)