bionic (5) wndb.5WN.gz

Provided by: wordnet-base_3.0-35_all bug

NAME

       index.noun, data.noun, index.verb, data.verb, index.adj, data.adj, index.adv, data.adv - WordNet database
       files

       noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists

       sentidx.vrb, sents.vrb - files used by search code to display sentences  illustrating  the  use  of  some
       specific verbs

DESCRIPTION

       For  each  syntactic  category,  two files are needed to represent the contents of the WordNet database -
       index.pos and data.pos, where pos is noun, verb, adj and adv.  The other auxiliary files are used by  the
       WordNet library's searching functions and are needed to run the various WordNet browsers.

       Each  index  file  is an alphabetized list of all the words found in WordNet in the corresponding part of
       speech.  On each  line,  following  the  word,  is  a  list  of  byte  offsets  (synset_offsets)  in  the
       corresponding  data  file, one for each synset containing the word.  Words in the index file are in lower
       case only, regardless of  how  they  were  entered  in  the  lexicographer  files.   This  folds  various
       orthographic representations of the word into one line enabling database searches to be case insensitive.
       See wninput(5WN) for a detailed description of the lexicographer files

       A data file for a syntactic  category  contains  information  corresponding  to  the  synsets  that  were
       specified  in  the  lexicographer  files, with relational pointers resolved to synset_offsets.  Each line
       corresponds to a synset.  Pointers are followed and hierarchies traversed by moving from  one  synset  to
       another via the synset_offsets.

       The  exception  list  files,  pos.exc,  are used to help the morphological processor find base forms from
       irregular inflections.

       The files sentidx.vrb and sents.vrb contain sentences illustrating the use of  specific  senses  of  some
       verbs.  These files are used by the searching software in response to a request for verb sentence frames.
       Generic sentence frames are displayed when an illustrative sentence is not present.

       The various database files are in ASCII formats that are easily read by both humans  and  machines.   All
       fields,  unless  otherwise noted, are separated by one space character, and all lines are terminated by a
       newline character.  Fields enclosed in italicized square brackets may not be present.

       See wngloss(7WN) for a glossary of WordNet terminology and a discussion of  the  database's  content  and
       logical organization.

   Index File Format
       Each  index  file  begins  with  several  lines containing a copyright notice, version number and license
       agreement.  These lines all begin with two spaces and the line number so they do not interfere  with  the
       binary  search  algorithm that is used to look up entries in the index files.  All other lines are in the
       following format.  In the field descriptions, number always refers to a decimal integer unless  otherwise
       defined.

       lemma  pos  synset_cnt  p_cnt  [ptr_symbol...]  sense_cnt  tagsense_cnt   synset_offset  [synset_offset...]

       lemma          lower  case  ASCII  text  of  word  or  collocation.   Collocations  are formed by joining
                      individual words with an underscore (_) character.

       pos            Syntactic category: n for noun files, v for verb files,  a  for  adjective  files,  r  for
                      adverb files.

       All remaining fields are with respect to senses of lemma in pos.

       synset_cnt     Number  of synsets that lemma is in.  This is the number of senses of the word in WordNet.
                      See Sense Numbers below for a discussion of how sense numbers are assigned and  the  order
                      of synset_offsets in the index files.

       p_cnt          Number of different pointers that lemma has in all synsets containing it.

       ptr_symbol     A  space separated list of p_cnt different types of pointers that lemma has in all synsets
                      containing it. See wninput(5WN) for a list of pointer_symbols.  If  all  senses  of  lemma
                      have no pointers, this field is omitted and p_cnt is 0.

       sense_cnt      Same as sense_cnt above.  This is redundant, but the field was preserved for compatibility
                      reasons.

       tagsense_cnt   Number of senses of lemma that are ranked according to their frequency  of  occurrence  in
                      semantic concordance texts.

       synset_offset  Byte offset in data.pos file of a synset containing lemma.  Each synset_offset in the list
                      corresponds to a different sense of lemma in WordNet.  synset_offset is an 8 digit,  zero-
                      filled decimal integer that can be used with fseek(3) to read a synset from the data file.
                      When passed to read_synset(3WN) along  with  the  syntactic  category,  a  data  structure
                      containing the parsed synset is returned.

   Data File Format
       Each  data  file  begins  with  several  lines  containing a copyright notice, version number and license
       agreement.  These lines all begin with two spaces and the line  number.   All  other  lines  are  in  the
       following format.  Integer fields are of fixed length, and are zero-filled.

       synset_offset  lex_filenum  ss_type  w_cnt  word  lex_id  [word  lex_id...]  p_cnt  [ptr...]  [frames...]  |  gloss

       synset_offset  Current byte offset in the file represented as an 8 digit decimal integer.

       lex_filenum    Two  digit  decimal  integer  corresponding  to the lexicographer file name containing the
                      synset.  See lexnames(5WN) for the list of filenames and their corresponding numbers.

       ss_type        One character code indicating the synset type:

                      n    NOUN
                      v    VERB
                      a    ADJECTIVE
                      s    ADJECTIVE SATELLITE
                      r    ADVERB

       w_cnt          Two digit hexadecimal integer indicating the number of words in the synset.

       word           ASCII form of a word as entered in the synset by the lexicographer, with  spaces  replaced
                      by  underscore characters (_).  The text of the word is case sensitive, in contrast to its
                      form in the corresponding  index.pos  file,  that  contains  only  lower-case  forms.   In
                      data.adj,  a  word  is  followed  by  a  syntactic  marker  if  one  was  specified in the
                      lexicographer file.  A syntactic marker is appended, in parentheses, onto word without any
                      intervening spaces.  See wninput(5WN) for a list of the syntactic markers for adjectives.

       lex_id         One  digit hexadecimal integer that, when appended onto lemma, uniquely identifies a sense
                      within a lexicographer file.  lex_id numbers usually start with 0, and are incremented  as
                      additional senses of the word are added to the same file, although there is no requirement
                      that the numbers be consecutive or begin with 0.  Note that a value of 0 is  the  default,
                      and therefore is not present in lexicographer files.

       p_cnt          Three  digit  decimal  integer indicating the number of pointers from this synset to other
                      synsets.  If p_cnt is 000 the synset has no pointers.

       ptr            A pointer from this synset to another.  ptr is of the form:

                      pointer_symbol  synset_offset  pos  source/target

                      where synset_offset is the byte offset of the target synset in the data file corresponding
                      to pos.

                      The  source/target  field  distinguishes lexical and semantic pointers.  It is a four byte
                      field, containing two two-digit hexadecimal integers.  The first two digits indicates  the
                      word  number  in the current (source) synset, the last two digits indicate the word number
                      in the target synset.  A value of 0000 means that  pointer_symbol  represents  a  semantic
                      relation  between  the  current  (source)  synset  and  the  target  synset  indicated  by
                      synset_offset.

                      A lexical relation between two words in  different  synsets  is  represented  by  non-zero
                      values  in the source and target word numbers.  The first and last two bytes of this field
                      indicate the word numbers in the source and target synsets,  respectively,  between  which
                      the  relation  holds.  Word numbers are assigned to the word fields in a synset, from left
                      to right, beginning with 1.

                      See wninput(5WN)  for  a  list  of  pointer_symbols,  and  semantic  and  lexical  pointer
                      classifications.

       frames         In data.verb only, a list of numbers corresponding to the generic verb sentence frames for
                      words in the synset.  frames is of the form:

                      f_cnt   +   f_num  w_num  [ +   f_num  w_num...]

                      where f_cnt a two digit decimal integer indicating the number of  generic  frames  listed,
                      f_num  is  a  two digit decimal integer frame number, and w_num is a two digit hexadecimal
                      integer indicating the word in the synset that the frame applies to.  As with pointers, if
                      this  number  is  00,  f_num  applies  to  all  words  in  the synset.  If non-zero, it is
                      applicable only to the word  indicated.   Word  numbers  are  assigned  as  described  for
                      pointers.   Each  f_num  w_num  pair is preceded by a +.  See wninput(5WN) for the text of
                      the generic sentence frames.

       gloss          Each synset contains a gloss.  A gloss is represented as a vertical bar (|), followed by a
                      text string that continues until the end of the line.  The gloss may contain a definition,
                      one or more example sentences, or both.

   Sense Numbers
       Senses in WordNet are generally ordered from most to least frequently used, with the  most  common  sense
       numbered  1.   Frequency  of  use  is  determined by the number of times a sense is tagged in the various
       semantic concordance texts.  Senses that are not semantically tagged  follow  the  ordered  senses.   The
       tagsense_cnt  field  for  each  entry in the index.pos files indicates how many of the senses in the list
       have been tagged.

       The cntlist(5WN) file provided with the database lists the number of times each sense is  tagged  in  the
       semantic  concordances.   The  data  from cntlist is used by grind(1WN) to order the senses of each word.
       When the index.pos files are generated, the synset_offsets are output in sense number order, with sense 1
       first  in  the  list.   Senses  with the same number of semantic tags are assigned unique but consecutive
       sense numbers.  The WordNet OVERVIEW search displays all senses of the specified word, in  all  syntactic
       categories, and indicates which of the senses are represented in the semantically tagged texts.

   Exception List File Format
       Exception lists are alphabetized lists of inflected forms of words and their base forms.  The first field
       of each line is an inflected form, followed by a space separated list of one or more base  forms  of  the
       word.  There is one exception list file for each syntactic category.

       Note  that  the  noun  and  verb  exception  lists  were  automatically generated from a machine-readable
       dictionary, and contain many words that are not in WordNet.  Also, for many of the inflected forms,  base
       forms  could  be  easily  derived  using  the  standard  rules  of detachment programmed into Morphy (See
       morph(7WN)).  These anomalies are allowed to remain in the exception list files, as they do no harm.

   Verb Example Sentences
       For some verb senses, example sentences illustrating the use of the verb sense can  be  displayed.   Each
       line  of  the  file  sentidx.vrb  contains  a sense_key followed by a space and a comma separated list of
       example sentence template numbers, in decimal.  The file sents.vrb lists  all  of  the  example  sentence
       templates.   Each  line begins with the template number followed by a space.  The rest of the line is the
       text of a template example sentence, with %s used as a placeholder in the text for the verb.  Both  files
       are  sorted alphabetically so that the sense_key and template sentence number can be used as indices, via
       binsrch(3WN), into the appropriate file.

       When a request for FRAMES is made, the WordNet search code looks for the sense in sentidx.vrb.  If found,
       the  sentence  template(s)  listed is retrieved from sents.vrb, and the %s is replaced with the verb.  If
       the sense is not found, the applicable generic sentence frame(s) listed in frames is displayed.

NOTES

       Information in the data.pos and index.pos files represents all of the word  senses  and  synsets  in  the
       WordNet database.  The word, lex_id, and lex_filenum fields together uniquely identify each word sense in
       WordNet.  These can be encoded in a sense_key as described in senseidx(5WN).  Each synset in the database
       can  be  uniquely  identified by combining the synset_offset for the synset with a code for the syntactic
       category (since it is possible for synsets in different data.pos files to have the same synset_offset).

       The WordNet system provide both command line and window-based browser interfaces to the  database.   Both
       interfaces  utilize  a common library of search and morphology code.  The source code for the library and
       interfaces is included in the WordNet package.  See wnintro(3WN) for an overview of  the  WordNet  source
       code.

ENVIRONMENT VARIABLES (UNIX)

       WNHOME              Base directory for WordNet.  Default is /usr/local/WordNet-3.0.

       WNSEARCHDIR         Directory in which the WordNet database has been installed.  Default is WNHOME/dict.

REGISTRY (WINDOWS)

       HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome
                           Base directory for WordNet.  Default is C:\Program Files\WordNet\3.0.

FILES

       index.pos           database index files

       data.pos            database data files

       *.vrb               files of sentences illustrating the use of verbs

       pos.exc             morphology exception lists

SEE ALSO

       grind(1WN),  wn(1WN),  wnb(1WN),  wnintro(3WN),  binsrch(3WN), wnintro(5WN), cntlist(5WN), lexnames(5WN),
       senseidx(5WN), wninput(5WN), morphy(7WN), wngloss(7WN), wngroups(7WN), wnstats(7WN).