Provided by: wordnet-base_3.0-11ubuntu0.1_all bug

NAME

       index.noun,  data.noun,  index.verb,  data.verb,  index.adj,  data.adj,
       index.adv, data.adv - WordNet database files

       noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists

       sentidx.vrb, sents.vrb - files used by search code to display sentences
       illustrating the use of some specific verbs

DESCRIPTION

       For  each  syntactic  category,  two  files are needed to represent the
       contents of the WordNet database - index.pos and data.pos, where pos is
       noun,  verb,  adj  and  adv.  The other auxiliary files are used by the
       WordNet library’s searching functions and are needed to run the various
       WordNet browsers.

       Each  index  file  is  an  alphabetized  list of all the words found in
       WordNet in the corresponding part of speech.  On each  line,  following
       the   word,   is  a  list  of  byte  offsets  (synset_offsets)  in  the
       corresponding data file, one  for  each  synset  containing  the  word.
       Words  in the index file are in lower case only, regardless of how they
       were  entered  in  the  lexicographer  files.    This   folds   various
       orthographic  representations  of  the  word  into  one  line  enabling
       database searches to be  case  insensitive.   See  wninput(5WN)  for  a
       detailed description of the lexicographer files

       A data file for a syntactic category contains information corresponding
       to the synsets that were specified in  the  lexicographer  files,  with
       relational  pointers resolved to synset_offsets.  Each line corresponds
       to a synset.  Pointers are followed and hierarchies traversed by moving
       from one synset to another via the synset_offsets.

       The  exception  list files, pos.exc, are used to help the morphological
       processor find base forms from irregular inflections.

       The files sentidx.vrb and sents.vrb contain sentences illustrating  the
       use  of  specific  senses  of  some verbs.  These files are used by the
       searching software in response to a request for verb  sentence  frames.
       Generic  sentence frames are displayed when an illustrative sentence is
       not present.

       The various database files are in ASCII formats that are easily read by
       both  humans  and  machines.   All  fields, unless otherwise noted, are
       separated by one space character, and all lines  are  terminated  by  a
       newline  character.   Fields enclosed in italicized square brackets may
       not be present.

       See wngloss(7WN) for a glossary of WordNet terminology and a discussion
       of the database’s content and logical organization.

   Index File Format
       Each  index  file  begins  with  several  lines  containing a copyright
       notice, version number and license agreement.  These  lines  all  begin
       with  two  spaces and the line number so they do not interfere with the
       binary search algorithm that is used to look up entries  in  the  index
       files.   All  other  lines  are  in the following format.  In the field
       descriptions,  number  always  refers  to  a  decimal  integer   unless
       otherwise defined.

       lemma  pos  synset_cnt  p_cnt  [ptr_symbol...]  sense_cnt  tagsense_cnt   synset_offset  [synset_offset...]

       lemma          lower   case   ASCII   text   of  word  or  collocation.
                      Collocations are formed by joining individual words with
                      an underscore (_) character.

       pos            Syntactic  category: n for noun files, v for verb files,
                      a for adjective files, r for adverb files.

       All remaining fields are with respect to senses of lemma in pos.

       synset_cnt     Number of synsets that lemma is in.  This is the  number
                      of  senses  of  the  word  in WordNet. See Sense Numbers
                      below for a discussion of how sense numbers are assigned
                      and the order of synset_offsets in the index files.

       p_cnt          Number  of  different  pointers  that  lemma  has in all
                      synsets containing it.

       ptr_symbol     A space separated  list  of  p_cnt  different  types  of
                      pointers  that  lemma  has in all synsets containing it.
                      See wninput(5WN) for a list of pointer_symbols.  If  all
                      senses  of lemma have no pointers, this field is omitted
                      and p_cnt is 0.

       sense_cnt      Same as sense_cnt above.  This  is  redundant,  but  the
                      field was preserved for compatibility reasons.

       tagsense_cnt   Number  of  senses of lemma that are ranked according to
                      their frequency of occurrence  in  semantic  concordance
                      texts.

       synset_offset  Byte  offset  in  data.pos  file  of a synset containing
                      lemma.  Each synset_offset in the list corresponds to  a
                      different  sense  of lemma in WordNet.  synset_offset is
                      an 8 digit, zero-filled decimal integer that can be used
                      with fseek(3) to read a synset from the data file.  When
                      passed to  read_synset(3WN)  along  with  the  syntactic
                      category,  a data structure containing the parsed synset
                      is returned.

   Data File Format
       Each data file begins with several lines containing a copyright notice,
       version  number  and license agreement.  These lines all begin with two
       spaces and the line number.  All  other  lines  are  in  the  following
       format.  Integer fields are of fixed length, and are zero-filled.

       synset_offset  lex_filenum  ss_type  w_cnt  word  lex_id  [word  lex_id...]  p_cnt  [ptr...]  [frames...]  |  gloss

       synset_offset  Current  byte  offset  in  the  file represented as an 8
                      digit decimal integer.

       lex_filenum    Two  digit  decimal   integer   corresponding   to   the
                      lexicographer  file  name  containing  the  synset.  See
                      lexnames(5WN)  for  the  list  of  filenames  and  their
                      corresponding numbers.

       ss_type        One character code indicating the synset type:

                      n    NOUN
                      v    VERB
                      a    ADJECTIVE
                      s    ADJECTIVE SATELLITE
                      r    ADVERB

       w_cnt          Two  digit  hexadecimal integer indicating the number of
                      words in the synset.

       word           ASCII form of a word as entered in  the  synset  by  the
                      lexicographer,   with   spaces  replaced  by  underscore
                      characters (_).  The text of the word is case sensitive,
                      in  contrast  to its form in the corresponding index.pos
                      file, that contains only lower-case forms.  In data.adj,
                      a  word  is  followed  by  a syntactic marker if one was
                      specified in the lexicographer file.  A syntactic marker
                      is  appended,  in  parentheses,  onto  word  without any
                      intervening spaces.  See wninput(5WN) for a list of  the
                      syntactic markers for adjectives.

       lex_id         One  digit  hexadecimal integer that, when appended onto
                      lemma,   uniquely   identifies   a   sense   within    a
                      lexicographer  file.   lex_id numbers usually start with
                      0, and are incremented as additional senses of the  word
                      are  added  to  the  same  file,  although  there  is no
                      requirement that the numbers  be  consecutive  or  begin
                      with  0.   Note  that  a  value of 0 is the default, and
                      therefore is not present in lexicographer files.

       p_cnt          Three digit decimal integer  indicating  the  number  of
                      pointers from this synset to other synsets.  If p_cnt is
                      000 the synset has no pointers.

       ptr            A pointer from this synset to another.  ptr  is  of  the
                      form:

                      pointer_symbol  synset_offset  pos  source/target

                      where  synset_offset  is  the  byte offset of the target
                      synset in the data file corresponding to pos.

                      The  source/target  field  distinguishes   lexical   and
                      semantic  pointers.  It is a four byte field, containing
                      two  two-digit  hexadecimal  integers.   The  first  two
                      digits indicates the word number in the current (source)
                      synset, the last two digits indicate the word number  in
                      the   target   synset.   A  value  of  0000  means  that
                      pointer_symbol represents a  semantic  relation  between
                      the  current  (source)  synset  and  the  target  synset
                      indicated by synset_offset.

                      A  lexical  relation  between  two  words  in  different
                      synsets  is represented by non-zero values in the source
                      and target word numbers.  The first and last  two  bytes
                      of  this  field  indicate the word numbers in the source
                      and target  synsets,  respectively,  between  which  the
                      relation  holds.   Word numbers are assigned to the word
                      fields in a synset, from left to right,  beginning  with
                      1.

                      See  wninput(5WN)  for  a  list  of pointer_symbols, and
                      semantic and lexical pointer classifications.

       frames         In data.verb only, a list of  numbers  corresponding  to
                      the  generic  verb  sentence  frames  for  words  in the
                      synset.  frames is of the form:

                      f_cnt   +   f_num  w_num  [ +   f_num  w_num...]

                      where f_cnt a two digit decimal integer  indicating  the
                      number  of  generic  frames listed, f_num is a two digit
                      decimal integer frame number, and w_num is a  two  digit
                      hexadecimal  integer  indicating  the word in the synset
                      that the frame applies to.  As with  pointers,  if  this
                      number  is 00, f_num applies to all words in the synset.
                      If  non-zero,  it  is  applicable  only  to   the   word
                      indicated.   Word  numbers are assigned as described for
                      pointers.  Each f_num  w_num pair is preceded  by  a  +.
                      See  wninput(5WN)  for  the text of the generic sentence
                      frames.

       gloss          Each synset contains a gloss.  A gloss is represented as
                      a  vertical  bar  (|),  followed  by  a text string that
                      continues until the end of  the  line.   The  gloss  may
                      contain  a definition, one or more example sentences, or
                      both.

   Sense Numbers
       Senses in WordNet are generally ordered from most to  least  frequently
       used,  with  the  most  common  sense  numbered 1.  Frequency of use is
       determined by the number of times a sense  is  tagged  in  the  various
       semantic  concordance  texts.   Senses that are not semantically tagged
       follow the ordered senses.  The tagsense_cnt field for  each  entry  in
       the  index.pos  files indicates how many of the senses in the list have
       been tagged.

       The cntlist(5WN) file provided with the database lists  the  number  of
       times each sense is tagged in the semantic concordances.  The data from
       cntlist is used by grind(1WN) to order the senses of each  word.   When
       the  index.pos  files  are  generated, the synset_offsets are output in
       sense number order, with sense 1 first in the list.   Senses  with  the
       same  number of semantic tags are assigned unique but consecutive sense
       numbers.  The WordNet  OVERVIEW  search  displays  all  senses  of  the
       specified word, in all syntactic categories, and indicates which of the
       senses are represented in the semantically tagged texts.

   Exception List File Format
       Exception lists are alphabetized lists of inflected forms of words  and
       their  base  forms.  The first field of each line is an inflected form,
       followed by a space separated list of one or more  base  forms  of  the
       word.  There is one exception list file for each syntactic category.

       Note  that  the  noun  and  verb  exception  lists  were  automatically
       generated from a machine-readable dictionary, and  contain  many  words
       that  are  not in WordNet.  Also, for many of the inflected forms, base
       forms could be easily derived using the standard  rules  of  detachment
       programmed  into  Morphy (See morph(7WN)).  These anomalies are allowed
       to remain in the exception list files, as they do no harm.

   Verb Example Sentences
       For some verb senses, example sentences illustrating  the  use  of  the
       verb  sense  can  be  displayed.   Each  line  of  the file sentidx.vrb
       contains a sense_key followed by a space and a comma separated list  of
       example  sentence  template  numbers,  in  decimal.  The file sents.vrb
       lists all of the example sentence templates.  Each line begins with the
       template  number followed by a space.  The rest of the line is the text
       of a template example sentence, with %s used as a  placeholder  in  the
       text  for  the  verb.  Both files are sorted alphabetically so that the
       sense_key and template sentence number can  be  used  as  indices,  via
       binsrch(3WN), into the appropriate file.

       When  a  request  for FRAMES is made, the WordNet search code looks for
       the sense in sentidx.vrb.  If found, the sentence template(s) listed is
       retrieved from sents.vrb, and the %s is replaced with the verb.  If the
       sense is not found, the applicable generic sentence frame(s) listed  in
       frames is displayed.

NOTES

       Information  in  the data.pos and index.pos files represents all of the
       word senses and synsets in the WordNet database.  The word, lex_id, and
       lex_filenum  fields  together  uniquely  identify  each  word  sense in
       WordNet.   These  can  be  encoded  in  a  sense_key  as  described  in
       senseidx(5WN).   Each synset in the database can be uniquely identified
       by combining the synset_offset for the  synset  with  a  code  for  the
       syntactic  category  (since  it  is  possible  for synsets in different
       data.pos files to have the same synset_offset).

       The WordNet system provide both command line and  window-based  browser
       interfaces  to  the database.  Both interfaces utilize a common library
       of search and morphology code.  The source code  for  the  library  and
       interfaces is included in the WordNet package.  See wnintro(3WN) for an
       overview of the WordNet source code.

ENVIRONMENT VARIABLES (UNIX)

       WNHOME              Base   directory   for   WordNet.     Default    is
                           /usr/local/WordNet-3.0.

       WNSEARCHDIR         Directory  in  which  the WordNet database has been
                           installed.  Default is WNHOME/dict.

REGISTRY (WINDOWS)

       HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome
                           Base   directory   for   WordNet.     Default    is
                           C:\Program Files\WordNet\3.0.

FILES

       index.pos           database index files

       data.pos            database data files

       *.vrb               files of sentences illustrating the use of verbs

       pos.exc             morphology exception lists

SEE ALSO

       grind(1WN),     wn(1WN),    wnb(1WN),    wnintro(3WN),    binsrch(3WN),
       wnintro(5WN), cntlist(5WN), lexnames(5WN), senseidx(5WN), wninput(5WN),
       morphy(7WN), wngloss(7WN), wngroups(7WN), wnstats(7WN).