Ubuntu Manpage: lookup - interactive file search and display

NAME

   lookup - interactive file search and display

SYNOPSIS

   lookup [ args ] [ file ...  ]

DESCRIPTION

   Lookup  allows  the quick interactive search of text files.  It supports ASCII, JIS-ROMAN, and
   Japanese EUC Packed formated text, and has an integrated romaji→kana converter.

THIS MANUAL

   Lookup is flexible for a variety of applications. This manual  will,  however,  focus  on  the
   application  of  searching Jim Breen's edict (Japanese-English dictionary) and kanjidic (kanji
   database). Being familiar with the content and format of these files would be helpful. See the
   INFO  section  near  the  end  of this manual for information on how to obtain these files and
   their documentation.

OVERVIEW OF MAJOR FEATURES

   The following just mentions some major features to whet your appetite  to  actually  read  the
   whole manual (-:

   Romaji-to-Kana Converter
      Lookup can convert romaji to kana for you, even“on the fly”as you type.

   Fuzzy Searching
      Searches  can be a bit“vague”or“fuzzy”, so that you'll be able to find“東京”even if you try
      to search for“ときょ”(the proper yomikata being“とうきょう”).

   Regular Expressions
      Uses the powerful and expressive regular expression for searching. One can  easily  specify
      complex  searches that affect“I want lines that look like such-and-such, but not like this-
      and-that, but that also have this particular characteristic....”

   Wildcard ``Glob'' Patterns
      Optionally, can use well-known filename wildcard patterns instead of  full-fledged  regular
      expressions.

   Filters
      You  can have lookup not list certain lines that would otherwise match your search, yet can
      optionally save them for quick review. For example, you could have  all  name-only  entries
      from edict filtered from normal output.

   Automatic Modifications
      Similarly,  you  can  do  a  standard  search-and-replace  on lines just before they print,
      perhaps to remove information you don't care to see  on  most  searches.  For  example,  if
      you're  generally  not interested in kanjidic's info on Chinese readings, you can have them
      removed from lines before printing.

   Smart Word-Preference Mode
      You can have lookup list only entries with whole words that match your search  (as  opposed
      to  an  embedded  match,  such  as  finding“the”inside“them”), but if no whole-word matches
      exist, will go ahead and list any entry that matches the search.

   Handy Features
      Other handy features include a dynamically settable  and  parameterized  prompt,  automatic
      highlighting  of that part of the line that matches your search, an output pager, readline-
      like input  with  horizontal  scrolling  for  long  input  lines,  a“.lookup”startup  file,
      automated programability, and much more. Read on!

REGULAR EXPRESSIONS

   Lookup  makes  liberal  use of regular expressions (or regex for short) in controlling various
   aspects of the searches. If you are not familiar with the important concepts of regexes,  read
   the tutorial appendix of this manual before continuing.

JAPANESE CHARACTER ENCODING METHODS

   Internally, lookup works with Japanese packed-format EUC, and all files loaded must be encoded
   similarly. If you have files encoded in JIS or Shift-JIS, you must first convert them  to  EUC
   before loading (see the INFO section for programs that can do this).

   Interactive  input  and  output encoding, however, may be be selected via the -jis, -sjis, and
   -euc invocation flags (default is -euc), or by various  commands  to  the  program  (described
   later).

   Make  sure to use the encoding appropriate for your system.  If you're using kterm under the X
   Window System, you can use lookup's -jis flag to match kterm's default JIS encoding.  Or,  you
   might use kterm's“-km euc”startup option (or menu selection) to put kterm into EUC mode. Also,
   I have found kterm's scrollbar (“-sb -sl 500”) to be quite useful.

   With many“English”fonts in Japan, the character that normally prints as a backslash (halfwidth
   version  of  ＼)  in The States appears as a yen symbol (the half-width version of ￥). How it
   will appear on your system is a function of what font you use and what output encoding  method
   you choose, which may be different from the font and method that was used to print this manual
   (both of which may be different from what's printed on your keyboard's appropriate key).  Make
   sure to keep this in mind while reading.

STARTUP

   Let's assume that your copy of edict is in ~/lib/edict. You can start the program simply with

           lookup ~/lib/edict

   You'll  note that lookup spends some time building an index before the default“lookup> ”prompt
   appears.

   Lookup gains much of its search speed by constructing an index of the file(s) to be  searched.
   Since  building  the  index  can be time consuming itself, you can have lookup write the built
   index to a file that can be quickly loaded the next time you run  the  program.   Index  files
   will be given a“.jin”(Jeffrey's Index) ending.

   Let's build the indices for edict and kanjidic now:

           lookup -write ~/lib/edict ~/lib/kanjidic

   This will create the index files
          ~/lib/edict.jin
          ~/lib/kanjidic.jin
   and exit.

   You can now re-start lookup , automatically using the pre-computed index files as:

          lookup ~/lib/edict ~/lib/kanjidic

   You  should  then  be  presented  with  the  prompt without having to wait for the index to be
   constructed (but see the section on Operating System concerns for possible reasons of delay).

INPUT

   There are basically two types of input: searches and commands.  Commands  do  such  things  as
   tell  lookup  to load more files or set flags. Searches report lines of a file that match some
   search specifier (where lines to search for are specified by one or more regular expressions).

   The input syntax may perhaps at first seem odd, but has  been  designed  to  be  powerful  and
   concise. A bit of time invested to learn it well will pay off greatly when you need it.

BRIEF EXAMPLE

Assuming you've started lookup with edict and kanjidic as noted above, let's try a few
searches. In these examples, the
“search [edict]> ”
is the prompt. Note that the space after the‘>’is part of the prompt.

Given the input:

search [edict]> tranquil

lookup will report all lines with the string“tranquil”in them. There are currently about a
dozen such lines, two of which look like:

安らか [やすらか] /peaceful (an)/tranquil/calm/restful/
安らぎ [やすらぎ] /peace/tranquility/

Notice that lines with“tranquil”and“tranquility”matched? This is because“tranquil”was embedded
in the word“tranquility”. You could restrict the search to only the word“tranquil”by
prepending the special“start of word”symbol‘<’and appending the special“end of
word”symbol‘>’to the regex, as in:

search [edict]> <tranquil>

This is the regular expression that says“the beginning of a word, followed by a‘t’,‘r’,
...,‘l’, which is at the end of a word.”The current version of edict has just three matching
entries.

Let's try another:

search [edict]> fukushima

This is a search for the“English”fukushima -- ways to search for kana or kanji will be
explored later. Note that among the several lines selected and printed are:

副島 [ふくしま] /Fukushima (pn,pl)/
木曽福島 [きそふくしま] /Kisofukushima (pl)/

By default, searches are done in a case-insensitive manner --‘F’and‘f’are treated the same by
lookup, at least so far as the matching goes. This is called case folding.

Let's give a command to turn this option off, so that‘f’and‘F’won't be considered the same.
Here's an odd point about lookup's input syntax: the default setting is that all command lines
must begin with a space. The space is the (default) command-introduction character and tells
the input parser to expect a command rather than a search regular expression. It is a common
mistake at first to forget the leading space when issuing a command. Be careful.

Try the command“ fold”to report the current status of case-folding. Notice that as soon as
you type the space, the prompt changes to
“lookup command> ”
as a reminder that now you're typing a command rather than a search specification.

lookup command> fold

The reply should be“file #0's case folding is on”

You can actually turn it off with“ fold off”. Now try the search for“fukushima”again. Notice
that this time the entries with“Fukushima”aren't listed? Now try the search
string“Fukushima”and see that the entries with“fukushima”aren't listed.

Case folding is usually very convenient (it also makes corresponding katakana and hiragana
match the same), so don't forget to turn it back on:

lookup command> fold on

JAPANESE INPUT

Lookup has an automatic romaji→kana converter. A leading‘/’indicates that romaji is to follow.
Try typing“/tokyo”and you'll see it convert to“/ときょ”as you type. When you hit return,
lookup will list all lines that have a“ときょ”somewhere in them. Well, sort of. Look
carefully at the lines which match. Among them (if you had case folding back on) you'll see:

キリスト教 [キリストきょう] /Christianity/
東京 [とうきょう] /Toukyou (pl)/Tokyo/current capital of Japan/
凸鏡 [とっきょう] /convex lens/

The first one has“ときょ”in it (as“トきょ”, where the katakana“ト”matches in a case-
insensitive manner to the hiragana“と”), but you might consider the others unexpected, since
they don't have“ときょ”in them. They're close (“とうきょ”and“とっきょ”), but not exact. This
is the result of lookup's“fuzzification”. Try the command“ fuzz”(again, don't forget the
command-introduction space). You'll see that fuzzification is turned on. Turn it off
with“ fuzz off”and try“/tokyo”(which will convert as you type) again. This time you only get
the lines which have“ときょ”exactly (well, case folding is still on, so it might match
katakana as well).

In a fuzzy search, length of vowels is ignored --“と”is considered the same as“とう”, for
example. Also, the presence or absence of any“っ”character is ignored, and the pairs じ ぢ, ず
づ, え ゑ, and お を are considered identical in a fuzzy search.

It might be convenient to consider a fuzzy search to be a“pronunciation search”. Special
note: fuzzification will not be performed if a regular expression“*”,“+”,or“?”modifies a non-
ASCII character. This is not an issue when input patterns are filename-like wildcard patterns
(discussed below).

In addition to kana fuzziness, there's one special case for kanji when fuzziness is on. The
kanji repeater mark“々”will be recognized such that“時々”and“時時”will match each-other.

Turn fuzzification back on (“fuzz on”), and search for all whole words which sound
like“tokyo”. That search would be specified as:

search [edict]> /<tokyo>

(again, the“tokyo”will be converted to“ときょ”as you type). My copy of edict has the three
lines

東京 [とうきょう] /Toukyou (pl)/Tokyo/current capital of Japan/
特許 [とっきょ] /special permission/patent/
凸鏡 [とっきょう] /convex lens/

This kind of whole-word romaji-to-kana search is so common, there's a special short cut.
Instead of typing“/<tokyo>”, you can type“[tokyo]”. The leading‘[’means“start
romaji”and“start of word”. Were you to type“<tokyo>”instead (without a leading‘/’or‘[’to
indicate romaji-to-kana conversion), you would get all lines with the English whole-
word“tokyo”in them. That would be a reasonable request as well, but not what we want at the
moment.

Besides the kana conversion, you can use any cut-and-paste that your windowing system might
provide to get Japanese text onto the search line. Cut“ときょ”from somewhere and paste onto
the search line. When hitting enter to run the search, you'll notice that it is done without
fuzzification (even if the fuzzification flag was“on”). That's because there's no leading‘/’.
Not only does a leading‘/’ndicate that you want the romaji-to-kana conversion, but that you
want it done fuzzily.

So, if you'd like fuzzy cut-and-paste, just type a leading‘/’efore pasting (or go back and
prepend one after pasting).

These examples have all been pretty simple, but you can use all the power that regexes have to
offer. As a slightly more complex example, the search“<gr[ea]y>”would look for all lines with
the words“grey”or“gray”in them. Since the‘[’isn't the first character of the line, it doesn't
mean what was mentioned above (start-of-word romaji). In this case, it's just the regular-
expression“class”indicator.

If you feel more comfortable using filename-like“*.txt”wildcard patterns, you can use
the“wildcard on”command to have patterns be considered this way.

This has been a quick introduction to the basics of lookup.

It can be very powerful and much more complex. Below is a detailed description of its various
parts and features.

READLINE INPUT

   The  actual keystrokes are read by a readline-ish package that is pretty standard. In addition
   to just typing away, the following keystrokes are available:

     ^B  / ^F     move left/right one character on the line
     ^A  / ^E     move to the start/end of the line
     ^H  / ^G     delete one character to the left/right of the cursor
     ^U  / ^K     delete all characters to the left/right of the cursor
     ^P  / ^N     previous/next lines on the history list
     ^L or ^R     redraw the line
     ^D           delete char under the cursor, or EOF if line is empty
     ^space       force romaji conversion (^@ on some systems)

   If automatic romaji-to-kana conversion is turned on (as it is by default), there  are  certain
   situations  where  the  conversion  will  be  done, as we saw above. Lower-case romaji will be
   converted to hiragana, while upper-case  romaji  to  katakana.   This  usually  won't  matter,
   though, as case folding will treat hiragana and katakana the same in the searches.

   In  exactly  what  situations  the  automatic conversion will be done is intended to be rather
   intuitive once the basic idea is learned.  However, at any time, one can use control-space  to
   convert  the  ASCII  to  the  left of the cursor to kana. This can be particularly useful when
   needing to enter kana on a command line (where auto conversion is never done; see below)

ROMAJI FLAVOR

   Most flavors of romaji are recognized. Special  or  non-obvious  items  are  mentioned  below.
   Lowercase are converted to hiragana, uppercase to katakana.

   Long vowels can be entered by repeating the vowel, or with‘-’or‘^’.

   In  situations  where  an“n”could  be  vague, as in“na”being な or んあ, use a single quote to
   force ん.  Therefore,「kenichi」→けにち while「ken'ichi」→けんいち.

   The romaji has been richly extended with many non-standard combinations such as ふぁ or  ちぇ,
   which are represented in intuitive ways:「fa」→ふぁ,「che」→ちぇ. etc.

   Various other mappings of interest:

     wo →を     we→ゑ      wi→ゐ
     VA →ヴァ   VI→ヴィ    VU→ヴ      VE→ヴェ    VO→ヴォ
     di →ぢ     dzi→ぢ     dya→ぢゃ   dyu→ぢゅ   dyo→ぢょ
     du →づ     tzu→づ     dzu→づ

   (the following kana are all smaller versions of the regular kana)

     xa →ぁ     xi→ぃ      xu→ぅ      xe→ぇ      xo→ぉ
     xu →ぅ     xtu→っ     xwa→ゎ     xka→ヵ     xke→ヶ
     xya→ゃ     xyu→ゅ     xyo→ょ

INPUT SYNTAX

   Any  input  line  beginning  with  a  space  (or  whichever  character  is set as the command-
   introduction character) is processed as a  command  to  lookup  rather  than  a  search  spec.
   Automatic  kana  conversion  is never done on these lines (but forced conversion with control-
   space may be done at any time).

   Other lines are taken as search regular expressions, with the following special cases:

   ?  A line consisting of a single question mark will report  the  current  command-introduction
      character (the default is a space, but can be changed with the“cmdchar”command).

   =  If  a  line  begins  with‘=’,  the  line  (without  the‘=’)  is  taken  as a search regular
      expression, and no automatic (or internal -- see below) kana conversion is done anywhere on
      the line (although again, conversion can always be forced with control-space).  This can be
      used to initiate a search where the beginning of  the  regex  is  the  command-introduction
      character,  or  in  certain  situations  where automatic kana conversion is temporarily not
      desired.

   /  A line beginning with‘/’indicates romaji input for  the  whole  line.   If  automatic  kana
      conversion  is turned on, the conversion will be done in real-time, as the romaji is typed.
      Otherwise it will be done internally once the line is entered.  Regardless, the presence of
      the  leading‘/’indicates  that  any  kana  (either  converted  or cut-and-pasted in) should
      be“fuzzified”if fuzzification is turned on.

      As an addition to the above, if the line doesn't begin with‘=’or  the  command-introduction
      character  (and  automatic  conversion  is  turned  on),‘/’  anywhere on the line initiates
      automatic conversion for the following word.

   [  A line beginning with‘[’is taken to be romaji (just as a line beginning  with‘/’,  and  the
      converted  romaji is subject to fuzzification (if turned on).  However, if‘[’is used rather
      than‘/’, an implied‘<’“beginning of word”is prepended to the resulting kana  regex.   Also,
      any ending‘]’on such a line is converted to the“ending of word”specifier‘>’in the resulting
      regex.

   In addition to the above, lines may have certain prefixes and suffixes to control  aspects  of
   the search or command:

   !  Various  flags  can  be  toggled  for  the  duration  of  a particular search by prepending
      a“!!”sequence to the input line.

      Sequences are shown below, along with commands related to each:

       !F! …  Filtration is toggled for this line (filter)
       !M! …  Modification is toggled for this line (modify)
       !w! …  Word-preference mode is toggled for this line (word)
       !c! …  Case folding is toggled for this line (fold)
       !f! …  Fuzzification is toggled for this line (fuzz)
       !W! …  Wildcard-pattern mode is toggled for this line (wildcard)
       !r! …  Raw. Force fuzzification off for this line
       !h! …  Highlighting is toggled for this line (highlight)
       !t! …  Tagging is toggled for this line (tag)
       !d! …  Displaying is on for this line (display)

      The letters can be combined, as in“!cf!”.

      The final‘!’ can be omitted if the first character after  the  sequence  is  not  an  ASCII
      letter.

      If no letters are given (“!!”).“!f!”is the default.

      These  last  two  points  can be conveniently combined in the common case of“!/romaji”which
      would be the same as“!f!/romaji”.

      The special sequence“!?”lists the above, as well as indicates which  are  currently  turned
      on.

      Note  that  the  letters  accepted  in  a“!!”sequence  are  many of the indicators shown by
      the“files”command.

   +  A‘+’prepended to anything above will cause the final search regex to be printed.  This  can
      be  useful  to  see  when and what kind of fuzzification and/or internal kana conversion is
      happening. Consider:

        search [edict]> +/わかる
        a match is“わ[ぁあー]*っ?か[ぁあー]*る[ぅうおぉー]*”

      Due to the“leading”/ the kana is fuzzified, which explains the somewhat  complex  resulting
      regex. For comparison, note:

        search [edict]> +わかる
        a match is“わかる”
        search [edict]> +!/わかる
        a match is“わかる”

      As  the‘+’shows,  these  are  not  fuzzified. The first one has no leading‘/’or‘[’to induce
      fuzzification, while the second  has  the‘!’line  prefix  (which  is  the  default  version
      of“!f!”), which toggles fuzzification mode to“off”for that line.

   ,  The  default of all searches and most commands is to work with the first file loaded (edict
      in these examples). One can change this default (see the“select”command) or, by appending a
      comma+digit  sequence  at  the  end  of an input line, force that line to work with another
      previously-loaded file. An  appended“,1”works  with  first  extra  file  loaded  (in  these
      examples, kanjidic).  An appended“,2”works with the 2nd extra file loaded, etc.

      An  appended“,0”works  with  the original first file (and can be useful if the default file
      has been changed via the“select”command).

      The following sequence shows a common usage:

        search [edict]> [ときょと]
        東京都 [とうきょうと] /Tokyo Metropolitan area/

      cutting and pasting the 都 from above, and adding a“,1”to search kanjidic:

        search [edict]> 都,1
        都 4554 N4769 S11  ..... ト ツ みやこ {metropolis} {capital}

FILENAME-LIKE WILDCARD MATCHING

   When wildcard-pattern mode is selected, patterns are considered as extended.Q "*.txt"  "-like"
   patterns.  This  is  often more convenient for users not familiar with regular expressions. To
   have this mode selected by default, put

      default wildcard on

   into your“.lookup”file (see“STARTUP FILE”below).

   When wildcard mode is on,  only  “*”,“?”,“+”,and“.”,are  effected.   See  the  entry  for  the
   “wildcard”command below for details.

   Other  features,  such  as  the multiple-pattern searches (described below) and other regular-
   expression metacharacters are available.

MULTIPLE-PATTERN SEARCHES

   You can put multiple patterns in a single search specifier.  For example consider

     search [edict]> china||japan

   The first part (“china”) will select all lines that have“china”in them. Then, from among those
   lines,  the  second part will select lines that have“japan”in them.  The“||”is not part of any
   pattern -- it is lookup's“pipe”mechanism.

   The above example is very different from the single pattern  “china|japan”which  would  select
   any   line   that   had   either“china”or“japan”.   With“china||japan”,  you  get  lines  that
   have“china”and then also have“japan”as well.

   Note that it is also different  from  the  regular  expression“china.*japan”(or  the  wildcard
   pattern“china*japan”)which  would  select  lines  having“china,  then  maybe  some stuff, then
   japan”.  But consider the case when“japan”comes on  the  line  before“china”.  Just  for  your
   comparison,  the multiple-pattern specifier“china||japan”is pretty much the same as the single
   regular expression“china.*japan|japan.*china”.

   If you use“|!|”instead of“||”, it will mean“...and then lines not matching...”.

   Consider a way to find all lines of kanjidic that do have a Halpern number, but don't  have  a
   Nelson number:

       search [edict]> <H\d+>|!|<N\d+>

   If  you  then  wanted  to  restrict  the  listing  to  those that also had a“jinmeiyou”marking
   (kanjidic's“G9”field) and had a reading of あき, you could make it:

       search [edict]> <H\d+>|!|<N\d+>||<G9>||<あき>

   A prepended‘+’would explain:

       a match is“<H\d+>”
       and not“<N\d+>”
       and“<G9>”
       and“<あき>”

   The“|!|”and“||”can be used to make up to ten separate regular expressions in  any  one  search
   specification.

   Again,  it  is  important  to  stress  that“||”does not mean“or”(as it does in a C program, or
   as‘|’does within a regular expression).  You might find it convenient to read“||”as“and also”,
   while reading“|!|”as“but not”.

   It  is  also  important  to  stress that any whitespace around the“||”and“|!|”construct is not
   ignored, but kept as part of the regex on either side.

COMBINATION SLOTS

Each file, when loaded, is assigned to a“slot”via which subsequent references to the file are
then made. The slot may then be searched, have filters and flags set, etc.

A special kind of slot, called a“combination slot”,rather than representing a single file, can
represent multiple previously-loaded slots. Searches against a combination slot (or“combo
slot”for short) search all those previously-loaded slots associated with it (called“component
slots”). Combo slots are set up with the combine command.

A Combo slot has no filter or modify spec, but can have a local prompt and flags just like
normal file slots. The flags, however, have special meanings with combo slots. Most combo-
slot flags act as a mask against the component-slot flags; when acted upon as a member of the
combo, a component-slot's flag will be disabled if the corresponding combo-slot's flag is
disabled.

Exceptions to this are the autokana, fuzz, and tag flags.

The autokana and fuzz flags governs a combo slot exactly the same as a regular file slot.
When a slot is searched as a component of a combination slot, the component slot's fuzz (and
autokana) flags, or lack thereof, are ignored.

The tag flag is quite different altogether; see the tag command for complete information.

Consider the following output from the files command:

┏━┳━━━━┯━━┳━━━┳━━━━━━━━━━━━━━
┃ 0┃F wcfh d│a I ┃ 2762k┃/usr/jfriedl/lib/edict
┃ 1┃FM cf d│a I ┃ 705k┃/usr/jfriedl/lib/kanjidic
┃ 2┃F cfh@d│a ┃ 1k┃/usr/jfriedl/lib/local.words
┃*3┃FM cfhtd│a ┃ combo┃kotoba (#2, #0)
┗━┻━━━━┷━━┻━━━┻━━━━━━━━━━━━━━

See the discussion of the files command below for basic explanation of the output.

As can be seen, slot #3 is a combination slot with the name“kotoba”with component slots two
and zero. When a search is initiated on this slot, first slot #2“local.words”will be searched,
then slot #0“edict”. Because the combo slot's filter flag is on, the component slots' filter
flag will remain on during the search. The combo slot's word flag is off, however, so slot
#0's word flag will be forced off during the search.

See the combine command for information about creating combo slots.

PAGER

   Lookup has a built in pager (a'la more).  Upon filling a screen with text, the string
       --MORE [space,return,c,q]--
   is shown. A space will allow another screen of text; a return will allow one more  line.  A‘c’
   will  allow  output text to continue unpaged until the next command. A‘q’ will flush output of
   the current command.

   If supported by the OS, lookup's idea of the screen size is automatically set upon startup and
   window  resize.   Lookup must know the width of the screen in doing both the horizontal input-
   line scrolling, and for knowing when a long line wraps on the screen.

   The pager parameters can be set manually with the“pager”command.

COMMANDS

Any line intended to be a command must begin with the command-introduction character (the
default is a space, but can be set via the“cmdchar”command). However, that character is not
part of the command itself and won't be shown in the following list of commands.

There are a number of commands that work with the selected file or selected slot (both meaning
the same thing). The selected file is the one indicated by an appended comma+digit, as
mentioned above. If no such indication is given, the default selected file is used (usually
the first file loaded, but can be changed with the“select”command).

Some commands accept a boolean argument, such as to turn a flag on or off. In all such cases,
a“1”or“on”means to turn the flag on, while a“0”or“off”is used to turn it off. Some flags are
per-file (“fuzz”,“fold”, etc.), and a command to set such a flag normally sets the flag for
the selected file only. However, the default value inherited by subsequently loaded files can
be set by prepending“default”to the command. This is particularly useful in the startup file
before any files are loaded (see the section STARTUP FILE).

Items separated by‘|’are mutually exclusive possibilities (i.e. a boolean argument
is“1|on|0|off”).

Items shown in brackets (‘[’and‘]’) are optional. All commands that accept a boolean argument
to set a flag or mode do so optionally -- with no argument the command will report the current
status of the mode or flag.

Any command that allows an argument in quotes (such as load, etc.) allow the use of single or
double quotes.

The commands:

[default] autokana [boolean]
Automatic romaji → kana conversion for the selected file is turned on or off (default is
on). However, if“default”is specified, the value to be inherited as the default by
subsequently-loaded files is set (or reported).

Can be temporarily disabled by a prepended‘=’,as described in the INPUT SYNTAX section.

clear|cls
Attempts to clear the screen. If you're using a kterm it'll just output the appropriate tty
control sequence. Otherwise it'll try to run the“clear”command.

cmdchar ['one-byte-char']
The default command-introduction character is a space, but it may be changed via this
command. The single quotes surrounding the character are required. If no argument is given,
the current value is printed.

An input line consisting of a single question mark will also print the current value
(useful for when you don't know the current value).

Woe to the one that sets the command-introduction character to one of the other special
input-line characters, such as‘+’,‘/’, etc.

combine ["name"] [ num += ] slotnum ...
Creates or adds file slots to a combination slot (see the COMBINATION SLOTS section for
general information). Note that“combo”may be used as the command as well.

Assuming for this example that slots 0-2 are loaded with the files curly, moe, and larry,
we can create a combination slot that will reference all three:

combo "three stooges" 2, 0, 1

The command will report

creating combo slot #3 (three stooges): 2 0 1

The name is optional, and will appear in the files list, and also maybe be used to specify
the slot as an argument to the select command.

A search via the newly created combo slot would search in the order specified on the combo
command line: first larry, then curly, and finally moe.

If you later load another file (say, jeffrey to slot #4), you can then add it to the
previously made combo:

combo 3 += 4

(the“+=”wording comes from the C programming language where it means“add on to”). Adding
to a combination always adds slots to the end of the list.

You can take the opportunity of adding the slot to also change the name, if you like:

combo "four stooges" 3 += 4

The reply would be
adding to combo slot #3(four stooges): 4

A file slot can be a component of any particular combo slot only once. When reporting the
created or added slot numbers, the number will appear in parenthesis if it had already been
a member of the list.

Furthermore, only file slots can be component members of combo slots. Attempting to combine
combo slot X to combo slot Y will result in having X's component file slots (rater than the
combo slot itself) added to Y.

command debug [boolean]
Sets the internal command parser debugging flag on or off (default is off).

debug [boolean]
Sets the internal general-debugging flag on or off (default is off).

describe specifier
This command will tell you how a character (or each character in a string) is encoded in
the various encoding methods:

lookup command> describe "気"
“気”as EUC is 0xb5a4 (181 164; 265 \244)
as JIS is 0x3524 ( 53 36; 65 \044 "5$")
as KUTEN is 2104 ( 0x1504; 25 \004)
as S-JIS is 0x8b1f (139 31; 213 \037)

The quotes surrounding the character or string to describe are optional. You can also give
a regular ASCII character and have the double-width version of the character described....
indicating“A”, for example, would describe“Ａ”. Specifier can also be a four-digit kuten
value, in which case the character with that kuten will be described.

If a four-digit specifier has a hex digit in it, or if it is preceded by“0x”, the value is
taken as a JIS code. You can precede the value by“jis”,“sjis”,“euc”, or“kuten”to force
interpretation to the requested code.

Finally, specifier can be a string of stripped JIS (JIS w/o the kanji-in and kanji-out
codes, or with the codes but without the escape characters in them). For
example“F|K\”would describe the two characters 日 and 本.

encoding [euc|sjis|jis]
The same as the -euc, -jis, and -sjis command-line options, sets the encoding method for
interactive input and output (or reports the current status). More detail over the output
encoding can be achieved with the output encoding command. A separate encoding for input
can be set with the input encoding command.

files [ - | long ]
Lists what files are loaded in what slots, and some status information about them, as with:

┃*0┃F wcfh d│a I ┃ 3749k┃/usr/jeff/lib/edict
┃ 1┃FM cf d│a I ┃ 754k┃/usr/jeff/lib/kanjidic

┏━┳━━━━━┯━━┳━━━┳━━━━━━━━━━━━━━
┃ 0┃F wcf h d │a I ┃ 2762k┃/usr/jfriedl/lib/edict
┃ 1┃FM cf d │a I ┃ 705k┃/usr/jfriedl/lib/kanjidic
┃ 2┃F cfWh@d │a ┃ 1k┃/usr/jfriedl/lib/local.words
┃*3┃FM cf htd │a ┃ combo┃kotoba (#2, #0)
┃ 4┃ cf d │a ┃ 205k┃/usr/dict/words
┗━┻━━━━━┷━━┻━━━┻━━━━━━━━━━━━━━

The first section is the slot number, with a“*”beside the default slot (as set by the
select command).

The second section shows per-slot flags and status. Letters are shown if the flag is on,
omitted if off. In the list below, related commands are given for each item:

F … if there is a filter {but '#' if disabled}. (filter)
M … if there is a modify spec {but '%' if disabled}. (modify)
w … if word-preference mode is turned on. (word)
c … if case folding is turned on. (fold)
f … if fuzzification is turned on. (fuzz)
W … if wildcard-pattern mode is turned on (wildcard)
h … if highlighting is turned on. (highlight)
t … if there is a tag {but @ if disabled} (tag)
d … if found lines should be displayed (display)
─────────────────────────────────
a … if autokana is turned on (autokana)
P … if there is a file-specific local prompt (prompt)
I … if the file is loaded with a precomputed index (load)
d … if the display flag is on (display)
Note that the letters in the upper section directly correspond to the“!!”sequence
characters described in the INPUT SYNTAX section.

If there is a digit at the end of the flag section, it indicates that only #/10 of the file
is actually loaded into memory (as opposed to the file having been completely loaded).
Unloaded files will be loaded while lookup is idle, or when first used.

If the slot is a combination slot (as slot #3 is in the example above), that is noted in
the third section, and the combination name and component slot numbers are noted in the
fourth. Also, for combination slots (which have no filter or modify specifications, only
the flags), F and/or M are shown if the corresponding mode is allowed during searches via
the combo slot. See the tag command for info about t with respect to combination slots.

If an argument (either“-”or“long”will work) is given to the command, a short message about
what the flags mean is also printed.

filter ["label"] [!] /regex/[i]
Sets the filter for the selected slot (which must contain a file and not a combination).
If a filter is set and active for a file, any line matching the given regex is filtered
from the output (if the‘!’is put before the regex, any line not matching the regex is
filtered). The label , which isn't required, merely acts as documentation in various
diagnostics.

As an example, consider that edict lines often have“(pn)”on them to indicate that the given
English is a place name. Often these place names can be a bother, so it would be nice to
elide them from the output unless specifically requested. Consider the example:

lookup command> filter "name" /(pn)/
search [edict]> [きの]
機能 [きのう] /function/faculty/
帰納 [きのう] /inductive/
昨日 [きのう] /yesterday/
≫3 "name" lines filtered≪

In the example,‘/’characters are used to delimit the start and stop of the regex (as is
common with many programs). However, any character can be used. A final‘i’, if present,
indicates that the regex should be applied in a case-insensitive manner.

The filter, once set, can be enabled or disabled with the other form of the“filter”command
(described below). It can also be temporarily turned off (or, if disabled, temporarily
turned on) by the“!F!”line prefix.

Filtered lines can optionally be saved and then displayed if you so desire. See the“saved
list size”and“show”commands.

Note that if you have saving enabled and only one line would be filtered, it is simply
printed at the end (rather than print a one line message about how one line was filtered).

By the way, a better“name”filter for edict would be:

filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$#

as it would filter all entries that had only one English section, that section being a
name. It is also an example of using something other than‘/’to delimit a regex, as it
makes things a bit easier to read.

filter [boolean]
Enables or disables the filter for the selected slot. If no argument is given, displays
the current filter and status.

[default] fold [boolean]
The selected slot's case folding is turned on or off (default is on), or reported if no
argument given. However, if“default”is specified, the value to be inherited as the default
by subsequently-loaded files is set (or reported).

Can be temporarily toggled by the“!c!”line prefix.

[default] fuzz [boolean]
The selected slot's fuzzification is turned on or off (default is on), or reported if no
argument given. However, if“default”is specified, the value to be inherited as the default
by subsequently-loaded files is set (or reported).

Can be temporarily toggled by the“!f!”line prefix.

help [regex]
Without an argument gives a short help list. With an argument, lists only commands whose
help string is picked up by the given regex.

[default] highlight [boolean]
Sets matched-string highlighting on or off for the selected slot (default off), or reports
the current status if no argument is given. However, if“default”is specified, the value to
be inherited as the default by subsequently-loaded files is set (or reported).

If on, shows in bold or reverse video (see below) that part of the line which was matched
by the search regex. If multiple regexes were given, that part matched by the first regex
is show.

Note that a regex might match a portion of a line which is later removed by a modify
parameter. In this case, no highlighting is done.

Can be temporarily toggled by the“!h!”line prefix.

highlight style [bold | inverse | standout | <___>]
Sets the style of highlighting for when highlighting is done. Inverse (inverse video) and
standout are the same. The default is bold. You can also give an HTML tag, such
as“<BOLD>”and items will be wrapped by <BOLD>...</BOLD>. This would be particularly useful
when the output is going to a CGI, as when lookup has been built in a server configuration.

Note that the highlighting is affected by using raw VT100/xterm control sequences. This
isn't particularly very nice if your terminal doesn't understand them. Sorry.

if {expression} command...

If the evaluated expression is non-zero, the command will be executed.

Note that {} rather than () surround the expression.

Expression may be comprised of numbers, operators, parenthesis, etc. In addition to the
normal +, -, *, and /, are:

!x … yields 0 if x is non-zero, 1 if x is zero.
x && y …
!x …‘not’Yields 1 if x is zero, 0 if non-zero.
x & y …‘and’Yields 1 if both x and y are non-zero, 0 otherwise.
x | y …‘or’ Yields 1 if x or y (or both) is non-zero, 0 otherwise

There may also be the special tokens true and false which are 1 and 0 respectively.

There are also checked, matched, printed, nonword, and filtered which correspond to the
values printed by the stats command.

An example use might be the following kind of thing in an computer-generated script:

!d!expect this line
if {!printed} msg Oops! couldn't find "expect this line"

input encoding [ euc | sjis ]
Used to set (or report) what encoding to use when 8-bit bytes are found in the interactive
input (all flavors of JIS are always recognized). Also see the encoding and output
encoding commands.

limit [value]
Sets the number of lines to print during any search before aborting (or reports the current
number if no value given). Default is 100.

Output limiting is disabled if set to zero.

log [ to [+] file ]
Begins logging the program output to file (the Japanese encoding method being the same as
for screen output). If“+”is given, the log is appended to any text that might have
previously been in file, in which case a leading dashed line is inserted into the file.

If no arguments are given, reports the current logging status.

log - | off
If only“-”or off is given, any currently-opened log file is closed.

load [-now|-whenneeded] "filename"
Loads the named file to the next available slot. If a precomputed index is found
(as“filename.jin”)it is loaded as well. Otherwise, an index is generated internally.

The file to be loaded (and the index, if loaded) will be loaded during idle times. This
allows a startup file to list many files to be loaded, but not have to wait for each of
them to load in turn. Using the “-now”flag causes the load to happen immediately, while
using the “-whenneeded”option (can be shortened to “-wn”)causes the load to happen only
when the slot is first accessed.

Invoke lookup as
% lookup -writeindex filename
to generate and write an index file, which will then be automatically used in the future.

If the file has already been loaded, the file is not re-read, but the previously-read file
is shared. The new slot will, however, have its own separate flags, prompt, filter, etc.

modify /regex/replace/[ig]
Sets the modify parameter for the selected file. If a file has a modify parameter
associated with it, each line selected during a search will have that part of the line
which matches regex (if any) replaced by the replacement string before being printed.

Like the filter command, the delimiter need not be‘/’; any non-space character is fine. If
a final‘i’is given, the regex is applied in a case-insensitive manner. If a final‘g’is
given, the replacement is done to all matches in the line, not just the first part that
might match regex.

The replacement may have embedded“1”, etc. in it to refer to parts of the matched text (see
the tutorial on regular expressions).

The modify parameter, once set, may be enabled or disabled with the other form of the
modify command (described below). It may also be temporarily toggled via the“!m!”line
prefix.

A silly example for the ultra-nationalist might be:
modify /<Japan>/Dainippon Teikoku/g
So that a line such as
日銀 [にちぎん] /Bank of Japan/
would come out as
日銀 [にちぎん] /Bank of Dainippon Teikoku/

As a real example of the modify command with kanjidic, consider that it is likely that one
is not interested in all the various fields each entry has. The following can be used to
remove the info on the U, N, Q, M, E, B, C, and Y fields from the output:

modify /( [UNQMECBY]\S+)+//g,1

It's sort of complex, but works. Note that here the replacement part is empty, meaning to
just remove those parts which matched. The result of such a search of 日 would normally
print

日 467c U65e5 N2097 B72 B73 S4 G1 H3027 F1 Q6010.0 MP5.0714 ＼
MN13733 E62 Yri4 P3-3-1 ニチ ジツ ひ -び -か {day}

but with the above modify spec, appears more simply as

日 467c S4 G1 H3027 F1 P3-3-1 ニチ ジツ ひ -び -か {day}

modify [boolean]
Enables or disables the modify parameter for the selected file, or report the current
status if no argument is given.

msg string
The given string is printed.

Most likely used in a script as the target command of an if command.

output encoding [ euc | sjis | jis...]
Used to set exactly what kind of encoding should be used for program output (also see the
input encoding command). Used when the encoding command is not detailed enough for one's
needs.

If no argument is given, reports the current output encoding. Otherwise, arguments can
usually be any reasonable dash-separated combination of:

euc
Selects EUC for the output encoding.

sjis
Selects Shift-JIS for the output encoding.

jis[78|83|90][-ascii|-roman]
Selects JIS for the output encoding. If no year (78, 83, or 90) given, 78 is used.
Can optionally specify that“English”should be encoded as regular ASCII (the default
when JIS selected) or as JIS-ROMAN.

212
Indicates that JIS X0212-1990 should be supported (ignored for Shift-JIS output).

no212
Indicates that JIS X0212-1990 should be not be supported (default setting). This
places JIS X0212-1990 characters under the domain of disp, nodisp, code, or mark
(described below).

hwk
Indicates that half width kana should be left as-is (default setting).

nohwk
Indicates that half width kana should be stripped from the output. (not yet
implemented).

foldhwk
Indicates that half width kana should be folded to their full-width counterparts.
(not yet implemented).

disp
Indicates that non-displayable characters (such as JIS X0212-1990 while the output
encoding method is Shift-JIS) should be passed along anyway (most likely resulting in
screen garbage).

nodisp
Indicates that non-displayable characters should be quietly stripped from the output.

code
Indicates that non-displayable characters should be printed as their octal codes
(default setting).

mark
Indicates that non-displayable characters should be printed as“★”.

Of course, not all options make sense in all combinations, or at all times. When the
current (or new) output encoding is reported, a complete and exact specifier representing
the output encoding selected. An example might be“jis78-ascii-no212-hwk-code”.

pager [ boolean | size ]
Turns on or off an output pager, sets it's idea of the screen size, or reports the current
status.

Size can be a single number indicating the number of lines to be printed
between“MORE?”prompts (usually a few lines less than the total screen height, the default
being 20 lines). It can also be two numbers in the form“#x#”where the first number is the
width (in half-width characters; default 80) and the second is the lines-per-page as above.

If the pager is on, every page of output will result in a“MORE?”prompt, at which there are
four possible responses. A space will allow one more full page to print. A return will
allow one more line. A‘c’(for“continue”) will all the rest of the output (for the current
command) to proceed without pause, while a‘q’(for“quit”) will flush the output for the
current command.

If supported by the OS, the pager size parameters are set appropriately from the window
size upon startup or window resize.

The default pager status is“off”.

[local] prompt "string"
Sets the prompt string. If“local”is indicated, sets the prompt string for the selected
slot only. Otherwise, sets the global default prompt string.

Prompt strings may have the special %-sequences shown below, with related commands given in
parenthesis:

%N … the default slot's file or combo name.
%n … like %N, but any leading path is not shown if a filename.
%# … the default slot's number.
%S … the“command-introduction”character (cmdchar)
%0 … the running program's name
%F='string' … string shown if filtering enabled (filter)
%M='string' … string shown if modification enabled (modify)
%w='string' … string shown if word mode on (word)
%c='string' … string shown if case folding on (fold)
%f='string' … string shown if fuzzification on (fuzz).
%W='string' … string shown if wildcard-pat. mode on (wildcard).
%d='string' … string shown if displaying on (display).
%C='string' … string shown if currently entering a command.
%l='string' … string shown if logging is on (log).
%L … the name of the current output log, if any (log)

For the tests (%f, etc), you can put‘!’just after the‘%’to reverse the sense of the test
(i.e. %!f="no fuzz"). The reverse of %F is if a filter is installed but disabled (i.e.
string will never be shown if there is no filter for the default file). The modify %M
works comparably.

Also, you can use an alternative form for the items that take an argument string. Replacing
the quotes with parentheses will treat string as a recursive prompt specifier. For example,
the specifier

%C='command'%!C(%f='fuzzy 'search:)

would result in a“command”prompt if entering a command, while it would result in either
a“fuzzy search:”or a“search:”prompt if not entering a command. The parenthesized
constructs may be nested.

Note that the letters of the test constructs are the same as the letters for
the“!!”sequences described in INPUT SYNTAX.

An example of a nice prompt command might be:

prompt "%C(%0 command)%!C(%w'*'%!f'raw '%n)> "

With this prompt specification, the prompt would normally appear as“filename> ”but when
fuzzification is turned off as“raw filename> ”. And if word-preference mode is on, the
whole thing has a“*”prepended. However if a command is being entered, the prompt would
then become“name command”, where name was the program's name (system dependent, but most
likely“lookup”).

The default prompt format string is“%C(%0 command)%!C(search [%n])> ”.

regex debug [boolean]
Sets the internal regex debugging flag (turn on if you want billions of lines of stuff
spewed to your screen).

saved list size [value]
During a search, lines that match might be elided from the output due to filters or word-
preference mode. This command sets the number of such lines to remember during any one
search, such that they may be later displayed (before the next search) by the show command.

The default is 100.

select [ num | name | . ]
If num is given, sets the default slot to that slot number. If name is given, sets the
default slot to the first slot found with a file (or combination) loaded with that name.
The incantation“select .”merely sets the default slot to itself, which can be useful in
script files where you want to indicate that any subsequent flags changes should work with
whatever file was the default at the time the script was sourced.

If no argument is given, simply reports the current default slot (also see the files
command).

In command files loaded via the source command, or as the startup file, commands dealing
with per-slot items (flags, local prompt, filters, etc.) work with the file or slot last
selected. The last such selected slot remains selected once the load is complete.

Interactively, the default slot will become the selected slot for subsequent searches and
commands that aren't augmented with an appended“,#”(as described in the INPUT SYNTAX
section).

show
Shows any lines elided from the previous search (either due to a filter or word-preference
mode).

Will apply any modifications (see the“modify”command) if modifications are enabled for the
file. You can use the“!m!”line prefix as well with this command (in this case, put
the“!m!”before the command-indicator character).

The length of the list is controlled by the“saved list size”command.

source "filename"
Commands are read from filename and executed.

In the file, all lines beginning with“#”are ignored as comments (note that comments must
appear on a line by themselves, as“#”is a reasonable character to have within commands).

Lines whose first non-blank characters is“=”,“!”,or“+”are considered searches, while all
other non-blank lines are considered lookup commands. Therefore, there is no need for
lines to begin with the command-introduction character. However, leading whitespace is
always OK.

For search lines, take care that any trailing whitespace is deleted if undesired, as
trailing whitespace (like all non-leading whitespace) is kept as part of the regular
expression.

Within a command file, commands that modify per-file flags and such always work with the
most-recently loaded (or selected) file. Therefore, something along the lines of

load "my.word.list"
set word on

load "my.kanji.list"
set word off
set local prompt "enter kanji> "

would word as might make intuitive sense.

Since a script file must have a load, or select before any per-slot flag is set, one can
use“select .”to facilitate command scripts that are to work with“the current slot”.

spinner [value]
Set the value of the spinner (A silly little feature). If set to a non-zero value, will
cause a spinner to spin while a file is being checked, one increment per value lines in the
file actually checked against the search specifier. Default is off (i.e. zero).

stats
Shows information about how many lines of the text file were checked against the last
search specifier, and how many lines matched and were printed.

tag [boolean] ["string"]
Enable, disable, or set the tag for the selected slot.

If the slot is not a combination slot, a tag string may be set (the quotes are required).

If a tag string is set and enabled for a file, the string is prepended to each matching
output line printed.

Unlike the filter and modify commands which automatically enable the function when a
parameter is set, a tag is not automatically enabled when set. It can be enabled while
being set via“'tag”onor could be enabled subsequently via just“tag on” If the selected slot
is a combination slot, only the enable/disable status may be changed (on by default). No
tag string may be set.

The reason for the special treatment lies in the special nature of how tags work in
conjunction with combination files.

During a search when the selected slot is a combination slot, each file which is a member
of the combination has its per-file flags disabled if their corresponding flag is disabled
in the original combination slot. This allows the combination slot's flags to act as
a“mask”to blot out each component file's per-file flags.

The tag flag, however, is special in that the component file's tag flag is turned on if the
combination slot's tag flag is turned on (and, of course, the component file has a tag
string registered).

The intended use of this is that one might set a (disabled) tag to a file, yet direct
searches against that file will have no prepended tag. However, if the file is searched as
part of a combination slot (and the combination slot's tag flag is on), the tag will be
prepended, allowing one to easily understand from which file an output line comes.

verbose [boolean]
Sets verbose mode on or off, or reports the current status (default on). Many commands
reply with a confirmation if verbose mode is turned on.

version
Reports the current version of the program.

[default] wildcard [boolean]
The selected slot's patterns are considerd wildcard patterns if turned on, regular
expressions if turned off. The current status is reported if no argument given. However,
if“default”is specified, the pattern-type to be inherited as the default by subsequently-
loaded files is set (or reported).

Can be temporarily toggled by the“!W!”line prefix.

When wildcard patterns are selected, the changed metacharacters are:“*”means“any
stuff”,“?”means“any one character”,while“+”and“.”become unspecial. Other regex items such
as“|”,“(”,“[”,etc. are unchanged.

What“*”and“?”will actually match depends upon the status of word-mode, as well as on the
pattern itself. If word-mode is on, or if the pattern begins with the start-of-
word“<”or“[”,only non-spaces will be matched. Otherwise, any character will be matched.

In summary,when wildcard mode is on, the input pattern is effected in the following ways:

* is changed to the regular expression .* or
? is changed to the regular expression . or + is changed to the regular expression +
. is changed to the regular expression .

Because filename patterns are often called“filename globs”,the command“glob”can be used in
place of“wildcard”.

[default] word|wordpreference [boolean]
The selected file's word-preference mode is turned on or off (default is off), or reports
the current setting if no argument is specified. However, if“default”is specified, the
value to be inherited as the default by subsequently-loaded files is set (or reported).

In word-preference mode, entries are searched for as if the search regex had a
leading‘<’and a trailing‘>’, resulting in a list of entries with a whole-word match of the
regex. However, if there are none, but there are non-word entries, the non-word entries
are shown (the“saved list”is used for this -- see that command). This make it an“if there
are whole words like this, show me, otherwise show me whatever you've got”mode.

If there are both word and non-word entries, the non-word entries are remembered in the
saved list (rather than any possible filtered entries being remembered there).

One caveat: if a search matches a line in more than one place, and the first is not a
whole-word, while one of the others is, the line will be listed considered non-whole word.
For example, the search「japan」with word-preference mode on will not list an entry such
as“/Japanese/language in Japan/”, as the first“Japan”is part of“Japanese”and not a whole
word. If you really need just whole-word entries, use the‘<’and‘>’yourself.

The mode may be temporarily toggled via the“!w!”line prefix.

The rules defining what lines are filtered, remembered, discarded, and shown for each
permutation of search are rather complex, but the end result is rather intuitive.

quit | leave | bye | exit
Exits the program.

STARTUP FILE

   If the file“~/.lookup”is present, commands are read from it during lookup startup.

   The  file  is  read in the same way as the source command reads files (see that entry for more
   information on file format, etc.)

   However, if there had been files  loaded  via  command-line  arguments,  commands  within  the
   startup  file  to load files (and their associated commands such as to set per-file flags) are
   ignored.

   Similarly, any use of the command-line flags -euc, -jis, or -sjis will disable in the  startup
   file the commands dealing with setting the input and/or output encodings.

   The  special  treatment  mentioned in the above two paragraphs only applies to commands within
   the startup file itself, and does not apply to commands in command-files that might be sourced
   from within the startup file.

   The following is a reasonable example of a startup file:
     ## turn verbose mode off during startup file processing
     verbose off

     prompt "%C([%#]%0)%!C(%w'*'%!f'raw '%n)> "
     spinner 200
     pager on

     ## The filter for edict will hit for entries that
     ## have only one English part, and that English part
     ## having a pl or pn designation.
     load ~/lib/edict
     filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$#
     highlight on
     word on

     ## The filter for kanjidic will hit for entries without a
     ## frequency-of-use number.  The modify spec will remove
     ## fields with the named initial code (U,N,Q,M,E, and Y)
     load ~/lib/kanjidic
     filter "uncommon" !/<F\d+>/
     modify /( [UNQMEY])+//g

     ## Use the same filter for my local word file,
     ## but turn off by default.
     load ~/lib/local.words
     filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$#
     filter off
     highlight on
     word on
     ## Want a tag for my local words, but only when
     ## accessed via the combo below
     tag off "》"

     combine "words" 2 0
     select words

     ## turn verbosity back on for interactive use.
     verbose on

COMMAND-LINE ARGUMENTS

With the use of a startup file, command-line arguments are rarely needed. In practical use,
they are only needed to create an index file, as in:

lookup -write textfile

Any command line arguments that aren't flags are taken to be files which are loaded in turn
during startup. In this case, any“load”,“filter”, etc. commands in the startup file are
ignored.

The following flags are supported:

-help
Reports a short help message and exits.

-write Creates index files for the named files and exits. No
startup file is read.

-euc
Sets the input and output encoding method to EUC (currently the default). Exactly the same
as the“encoding euc”command.

-jis
Sets the input and output encoding method to JIS. Exactly the same as the“encoding
jis”command.

-sjis
Sets the input and output encoding method to Shift-JIS. Exactly the same as the“encoding
sjis”command.

-v -version
Prints the version string and exits.

-norc
Indicates that the startup file should not be read.

-rc file
The named file is used as the startup file, rather than the default“~/.lookup”. It is an
error for the file not to exist.

-percent num
When an index is built, letters that appear on more than num percent (default 50) of the
lines are elided from the index. The thought is that if a search will have to check most
of the lines in a file anyway, one may as well save the large amount of space in the index
file needed to represent that information, and the time/space tradeoff shifts, as the
indexing of oft-occurring letters provides a diminishing return.

Smaller indexes can be made by using a smaller number.

-noindex
Indicates that any files loaded via the command line should not be loaded with any
precomputed index, but recalculated on the fly.

-verbose
Has metric tons of stats spewed whenever an index is created.

-port ###
For the (undocumented) server configuration only, tells which port to listen on.

OPERATING SYSTEM CONSIDERATIONS

I/O primitives and behaviors vary with the operating system. On my operating system, I
can“read”a file by mapping it into memory, which is a pretty much instant procedure regardless
of the size of the file. When I later access that memory, the appropriate sections of the
file are automatically read into memory by the operating system as needed.

This results in lookup starting up and presenting a prompt very quickly, but causes the first
few searches that need to check a lot of lines in the file to go more slowly (as lots of the
file will need to be read in). However, once the bulk of the file is in, searches will go very
fast. The win here is that the rather long file-load times are amortized over the first few
(or few dozen, depending upon the situation) searches rather than always faced right at
command startup time.

On the other hand, on an operating system without the mapping ability, lookup would start up
very slowly as all the files and indexes are read into memory, but would then search quickly
from the beginning, all the file already having been read.

To get around the slow startup, particularly when many files are loaded, lookup uses lazy
loading if it can: a file is not actually read into memory at the time the load command is
given. Rather, it will be read when first actually accessed. Furthermore, files are loaded
while lookup is idle, such as when waiting for user input. See the files command for more
information.

REGULAR EXPRESSIONS, A BRIEF TUTORIAL

Regular expressions (“regex”for short) are a“code”used to indicate what kind of text you're
looking for. They're how one searches for things in the editors“vi”,“stevie”,“mifes”etc., or
with the grep commands. There are differences among the various regex flavors in use -- I'll
describe the flavor used by lookup here. Also, in order to be clear for the common case, I
might tell a few lies, but nothing too heinous.

The regex「a」means“any line with an‘a’in it.” Simple enough.

The regex「ab」means“any line with an‘a’immediately followed by a‘b’”. So the line
I am feeling flabby
would“match”the regex「ab」because, indeed, there's an“ab”on that line. But it wouldn't match
the line

this line has no a followed _immediately_ by a b

because, well, what the lines says is true.

In most cases, letters and numbers in a regex just mean that you're looking for those letters
and numbers in the order given. However, there are some special characters used within a
regex.

A simple example would be a period. Rather than indicate that you're looking for a period, it
means“any character”. So the silly regex「.」would mean“any line that has any character on
it.”Well, maybe not so silly... you can use it to find non-blank lines.

But more commonly it's used as part of a larger regex. Consider the regex「gray」. It wouldn't
match the line

The sky was grey and cloudy.

because of the different spelling (grey vs. gray). But the regex「gr.y」asks for“any line
with a‘g’,‘r’, some character, and then a‘y’”. So this would get“grey”and“gray”. A special
construct somewhat similar to‘.’would be the character class. A character class starts with
a‘[’and ends with a‘]’, and will match any character given in between. An example might be

gr[ea]y

which would match lines with a‘g’,‘r’, an‘e’or an‘a’, and then a‘y’. Inside a character class
you can list as many characters as you want to.

For example the simple regex「x[0123456789]y」would match any line with a digit sandwiched
between an‘x’and a‘y’.

The order of the characters within the character class doesn't really
matter...「[513467289]」would be the same as「[0123456789]」.

But as a short cut, you could put「[0-9]」instead of「[0123456789]」. So the character
class「[a-z]」would match any lower-case letter, while the character class「[a-zA-Z0-9]」would
match any letter or digit.

The character‘-’is special within a character class, but only if it's not the first thing.
Another character that's special in a character class is‘^’, if it is the first thing.
It“inverts”the class so that it will match any character not listed. The
class「[^a-zA-Z0-9]」would match any line with spaces or punctuation on them.

There are some special short-hand sequences for some common character classes. The
sequence「\d」means“digit”, and is the same as「[0-9]」. 「\w」means“word element”and is the
same as「[0-9a-zA-Z_]」. 「\s」means“space-type thing”and is the same as「[ \t]」(「\t」means
tab).

You can also use「\D」,「\W」, and「\S」to mean things not a digit, word element, or space-
type thing.

Another special character would be‘?’. This means“maybe one of whatever was just before it,
not is fine too”. In the regex 「bikes? for rent」, the“whatever”would be the‘s’, so this
would match lines with either“bikes for rent”or“bike for rent”.

Parentheses are also special, and can group things together. In the regex

big (fat harry)? deal

the“whatever”for the‘?’would be“fat harry”. But be careful to pay attention to details...
this regex would match
I don't see what the big fat harry deal is!
but not
I don't see what the big deal is!

That's because if you take away the“whatever”of the‘?’, you end up with
big deal
Notice that there are two spaces between the words, and the regex didn't allow for that. The
regex to get either line above would be
big (fat harry )?deal
or
big( fat harry)? deal
Do you see how they're essentially the same?

Similar to‘?’is‘*’, which means“any number, including none, of whatever's right in front”. It
more or less means that whatever is tagged with‘*’is allowed, but not required, so something
like
I (really )*hate peas
would match“I hate peas”,“I really hate peas!”,“I really really hate peas”, etc.

Similar to both‘?’and‘*’is‘+’, which means“at least one of whatever just in front, but more is
fine too”. The regex「mis+pelling」would match“mispelling”,“misspelling”,“missspelling”, etc.
Actually, it's just the same as「miss*pelling」but more simple to type. The
regex「ss*」means“an‘s’, followed by zero or more‘s’”, while「s+」means“one or more‘s’”. Both
really the same.

The special character‘|’means“or”. Unlike‘+’,‘*’, and‘?’which act on the thing immediately
before, the‘|’is more“global”.
give me (this|that) one
Would match lines that had“give me this one”or“give me that one”in them.

You can even combine more than two:
give me (this|that|the other) one

How about:
[Ii]t is a (nice |sunny |bright |clear )*day

Here, the“whatever”immediately before the‘*’is
(nice |sunny |bright |clear )
So this regex would match all the following lines:
It is a day.
I think it is a nice day.
It is a clear sunny day today.
If it is a clear sunny nice sunny sunny sunny bright day then....
Notice how the「[Ii]t」matches either“It”or“it”?

Note that the above regex would also match
fruit is a day
because it indeed fulfills all requirements of the regex, even though the“it”is really part of
the word“fruit”. To answer concerns like this, which are common, are‘<’and‘>’, which
mean“word break”. The regex「<it」would match any line with“it”beginning a word,
while「it>」would match any line with“it”ending a word. And, of course,「<it>」would match
any line with the word“it”in it.

Going back to the regex to find grey/gray, that would make more sense, then, as
<gr[ae]y>
which would match only the words“grey”and“gray”. Somewhat similar are‘^’and‘$’, which
mean“beginning of line”and“end of line”, respectively (but, not in a character class, of
course). So the regex「^fun」would find any line that begins with the letters“fun”,
while「^fun>」would find any line that begins with the word“fun”. 「^fun$」would find any
line that was exactly“fun”.

Finally,「^\s*fun\s*$」would match any line that“fun”exactly, but perhaps also had leading
and/or trailing whitespace.

That's pretty much it. There are more complex things, some of which I'll mention in the list
below, but even with these few simple constructs one can specify very detailed and complex
patterns.

Let's summarize some of the special things in regular expressions:

Items that are basic units:
char any non-special character matches itself.
\char special chars, when proceeded by \, become non-special.
. Matches any one character (except \n).
\n Newline
\t Tab.
\r Carriage Return.
\f Formfeed.
\d Digit. Just a short-hand for [0-9].
\w Word element. Just a short-hand for [0-9a-zA-Z_].
\s Whitespace. Just a short-hand for [\t \n\r\f].
\## \### Two or three digit octal number indicating a single byte.
[chars] Matches a character if it's one of the characters listed.
[^chars] Matches a character if it's not one of the ones listed.

The \char items above can be used within a character class,
but not the items below.

\D Anything not \d.
\W Anything not \w.
\S Anything not \s.
\a Any ASCII character.
\A Any multibyte character.
\k Any (not half-width) katakana character (including ー).
\K Any character not \k (except \n).
\h Any hiragana character.
\H Any character not \h (except \n).
(regex) Parens make the regex one unit.
(?:regex) [from perl5] Grouping-only parens -- can't use for \# (below)
\c Any JISX0208 kanji (kuten rows 16-84)
\C Any character not \c (except \n).
\# Match whatever was matched by the #th paren from the left.

With“☆”to indicate one“unit”as above, the following may be used:

☆? A ☆ allowed, but not required.
☆+ At least one ☆ required, but more ok.
☆* Any number of ☆ ok, but none required.

There are also ways to match“situations”:

\b A word boundary.
< Same as \b.
> Same as \b.
^ Matches the beginning of the line.
$ Matches the end of the line.

Finally, the“or”is

reg1|reg2 Match if either reg1 or reg2 match.

Note that“\k”and the like aren't allowed in character classes, so
something such as「[\k\h]」to try to get all kana won't work.
Use 「(\k|\h)」instead.

BUGS

   Needs full support for half-width katakana and JIS X 0212-1990.
   Non-EUC (JIS & SJIS) items not tested well.
   Probably won't work on non-UNIX systems.
   Screen control codes (for clear and highlight commands) are hard-coded for ANSI/VT100/kterm.

AUTHOR

   Jeffrey Friedl (jfriedl@nff.ncl.omron.co.jp)

INFO

   Jim  Breen's  text  files  edict  and  kanjidic  and  their   documentation   can   be   found
   in“pub/nihongo”on ftp.cc.monash.edu.au (130.194.1.106

   Information  on  input and output encoding and codes can be found in Ken Lunde's Understanding
   Japanese Information Processing (日本語情報処理) published by O'Reilly and  Associates.   ISBN
   1-56592-043-0.  There is also a Japanese edition published by SoftBank.

   A  program  to convert files among the various encoding methods is Dr. Ken Lunde'sjconv, which
   can also be found on ftp.cc.monash.edu.au.  Jconv is  also  useful  for  converting  halfwidth
   katakana (which lookup doesn't yet support well) to full-width.

                                                                                        LOOKUP(1)