Ubuntu Manpage: index.sense, sense.idx

Provided by: wordnet-sense-index_3.0-37_all

NAME

       index.sense, sense.idx - WordNet's sense index

DESCRIPTION

The WordNet sense index provides an alternate method for accessing synsets and word senses
in the WordNet database. It is useful to applications that retrieve synsets or other
information related to a specific sense in WordNet, rather than all the senses of a word
or collocation. It can also be used with tools like grep and Perl to find all senses of a
word in one or more parts of speech. A specific WordNet sense, encoded as a sense_key,
can be used as an index into this file to obtain its WordNet sense number, the database
byte offset of the synset containing the sense, and the number of times it has been tagged
in the semantic concordance texts.

Concatenating the lemma and lex_sense fields of a semantically tagged word (represented in
a <wf ... > attribute/value pair) in a semantic concordance file, using % as the
concatenation character, creates the sense_key for that sense, which can in turn be used
to search the sense index file.

A sense_key is the best way to represent a sense in semantic tagging or other systems that
refer to WordNet senses. sense_keys are independent of WordNet sense numbers and
synset_offsets, which vary between versions of the database. Using the sense index and a
sense_key, the corresponding synset (via the synset_offset) and WordNet sense number can
easily be obtained. A mapping from noun sense_keys in WordNet 1.6 to corresponding 2.0
sense_keys is provided with version 2.0, and is described in sensemap(5WN).

See wndb(5WN) for a thorough discussion of the WordNet database files.

File Format
The sense index file lists all of the senses in the WordNet database with each line
representing one sense. The file is in alphabetical order, fields are separated by one
space, and each line is terminated with a newline character.

Each line is of the form:

sense_key synset_offset sense_number tag_cnt

sense_key is an encoding of the word sense. Programs can construct a sense key in this
format and use it as a binary search key into the sense index file. The format of a
sense_key is described below.

synset_offset is the byte offset that the synset containing the sense is found at in the
database "data" file corresponding to the part of speech encoded in the sense_key.
synset_offset is an 8 digit, zero-filled decimal integer, and can be used with fseek(3) to
read a synset from the data file. When passed to the WordNet library function
read_synset() along with the syntactic category, a data structure containing the parsed
synset is returned.

sense_number is a decimal integer indicating the sense number of the word, within the part
of speech encoded in sense_key, in the WordNet database. See wndb(5WN) for information
about how sense numbers are assigned.

tag_cnt represents the decimal number of times the sense is tagged in various semantic
concordance texts. A tag_cnt of 0 indicates that the sense has not been semantically
tagged.

Sense Key Encoding
A sense_key is represented as:

lemma%lex_sense

where lex_sense is encoded as:

ss_type:lex_filenum:lex_id:head_word:head_id

lemma is the ASCII text of the word or collocation as found in the WordNet database index
file corresponding to pos. lemma is in lower case, and collocations are formed by joining
individual words with an underscore (_) character.

ss_type is a one digit decimal integer representing the synset type for the sense. See
Synset Type below for a listing of the numbers corresponding to each synset type.

lex_filenum is a two digit decimal integer representing the name of the lexicographer file
containing the synset for the sense. See lexnames(5WN) for the list of lexicographer file
names and their corresponding numbers.

lex_id is a two digit decimal integer that, when appended onto lemma, uniquely identifies
a sense within a lexicographer file. lex_id numbers usually start with 00, and are
incremented as additional senses of the word are added to the same file, although there is
no requirement that the numbers be consecutive or begin with 00. Note that a value of 00
is the default, and therefore is not present in lexicographer files. Only non-default
lex_id values must be explicitly assigned in lexicographer files. See wninput(5WN) for
information on the format of lexicographer files.

head_word is only present if the sense is in an adjective satellite synset. It is the
lemma of the first word of the satellite's head synset.

head_id is a two digit decimal integer that, when appended onto head_word, uniquely
identifies the sense of head_word within a lexicographer file, as described for lex_id.
There is a value in this field only if head_word is present.

Synset Type
The synset type is encoded as follows:

1 NOUN
2 VERB
3 ADJECTIVE
4 ADVERB
5 ADJECTIVE SATELLITE

NOTES

       For  non-satellite  senses  the  head_word  and head_id fields have no values, however the
       field separator character (:) is present.

ENVIRONMENT VARIABLES (UNIX)

       WNHOME              Base directory for WordNet.  Default is /usr/local/WordNet-3.0.

       WNSEARCHDIR         Directory in which the WordNet database has been  installed.   Default
                           is WNHOME/dict.

REGISTRY (WINDOWS)

       HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome
                           Base directory for WordNet.  Default is C:\Program Files\WordNet\3.0.

FILES