Ubuntu Manpage: hspell - Hebrew spellchecker (C API)

NAME

       hspell - Hebrew spellchecker (C API)

SYNOPSIS

       #include <hspell.h>

       int hspell_init(struct dict_radix **dictp, int flags);

       void hspell_uninit(struct dict_radix *dictp);

       int hspell_check_word(struct dict_radix *dict, const char *word, int *preflen);

       void hspell_trycorrect(struct dict_radix *dict, const char *word, struct corlist *cl);

       int corlist_init(struct corlist *cl);

       int corlist_free(struct corlist *cl);

       int corlist_n(struct corlist *cl);

       char *corlist_str(struct corlist *cl, int i);

       unsigned int hspell_is_canonic_gimatria(const char *word);

       typedef  int  hspell_word_split_callback_func(const  char *word, const char *baseword, int
       preflen, int prefspec);

       int     hspell_enum_splits(struct     dict_radix     *dict,     const     char      *word,
       hspell_word_split_callback_func *enumf);

       void hspell_set_dictionary_path(const char *path);

       const char *hspell_get_dictionary_path(void);

DESCRIPTION

       This  manual  describes  the  C  API  of  the  Hspell Hebrew spellchecker. Please refer to
       hspell(1) for a description of the Hspell project,  its  spelling  standard,  and  how  it
       works.

       The  hspell_init() function must be called first to initialize the Hspell library. It sets
       up some global structures (see CAVEATS section) and then reads  the  necessary  dictionary
       files  (whose  places  are  fixed  when  the library is built). The 'dictp' parameter is a
       pointer to a struct dict_radix* object, which is modified to point to  a  newly  allocated
       dictionary.  A typical hspell_init() call therefore looks like

          struct dict_radix *dict;
          hspell_init(&dict, flags);

       Note  that  the  (struct  dict_radix*) type is an opaque pointer - the library user has no
       access to the separate fields in this structure.

       The 'flags' parameter can contain a bitwise or'ing of several flags that  modify  Hspell's
       default   behavior;  Turning  on  HSPELL_OPT_HE_SHEELA  allows  Hspell  to  recognize  the
       interrogative He prefix (he ha-she'ela). HSPELL_OPT_DEFAULT is a synonym for turning on no
       special flag, i.e., it evaluates to 0.

       hspell_init()  returns  0  on  success, or negative numbers on errors. Currently, the only
       error is -1, meaning the dictionary files could not be read.

       The hspell_uninit() function undoes the effects of hspell_init(), freeing any memory  that
       was allocated during initialization.

       The  hspell_check_word()  function  checks whether a certain word is a correct Hebrew word
       (possibly with prefix particles attached in a syntacticly-correct manner). 1  is  returned
       if the word is correct, or 0 if it is incorrect.

       The  'word'  parameter should be a single Hebrew word, in the iso8859-8 encoding, possibly
       containing the ASCII quote or double-quote characters (signifying the geresh and gershayim
       used  in  Hebrew  for  abbreviations,  acronyms, and a few foreign sounds). If the calling
       programs works with other encodings, it must convert  the  word  to  iso8859-8  first.  In
       particular  cp1255  (the  MS-Windows  Hebrew encoding) extensions to iso8859-8 like niqqud
       characters, geresh or gershayim, are currently not recognized and must be removed from the
       word prior to calling hspell_check_word().

       Into  the  'preflen'  parameter,  the  function  writes  back  the number of characters it
       recognized as a prefix particle - the rest of the 'word' is a stand-alone  word.   Because
       Hebrew  words  typically  can  be read in several different ways, this feature (of getting
       just one prefix from one possible reading) is usually not very useful, and it is likely to
       be removed in a future version.

       The  hspell_enum_splits()  function  provides  a  way to get all possible splitting of the
       given 'word' into an optional prefix particle and a stand-alone word.  For  each  possible
       (and  legal,  as some words cannot accept certain prefixes) split, a user-defined callback
       function is called. This callback function is given the whole  word,  the  length  of  the
       prefix,  the  stand-alone  word,  and  a bitfield which describes what types of words this
       prefix can get.  Note that in some cases, a word beginning with the letter waw  gets  this
       waw doubled before a prefix, so sometimes strlen(word)!=strlen(baseword)+preflen.

       The  hspell_trycorrect()  tries  to  find  a list of possible corrections for an incorrect
       word.  Because in Hebrew the word density is high (a random string of letters,  especially
       if  short,  has a high probability of being a correct word), this function attempts to try
       corrections based on the assumption of a spelling error (replacement of letters that sound
       alike, missing or spurious immot qri'a), not typo (slipped finger on the keyboard, etc.) -
       see also CAVEATS.

       hspell_trycorrect() returns the correction list into a structure of type  struct  corlist.
       This  structure  must  be  first  allocated with a call to corlist_init() and subsequently
       freed with corlist_free().  The corlist_n() macro returns the number of words held  in  an
       allocated  corlist,  and  corlist_str()  returns  the  i'th  word. Accordingly, here is an
       example usage of hspell_trycorrect():

          struct corlist cl;
          printf ("Found misspelled word %s. Possible corrections:\n", w);
          corlist_init (&cl);
          hspell_trycorrect (dict, w, &cl);
          for (i=0; i<corlist_n(&cl); i++) {
              printf ("%s\n", corlist_str(&cl, i));
          }

       The hspell_is_canonic_gimatria() function checks whether  the  given  word  is  a  canonic
       gimatria  - i.e., the proper way to write in gimatria the number it represents. The caller
       might want to accept canonic gimatria as proper Hebrew words, even if  hspell_check_word()
       previously  reported  such  word  to be a non-existent word.  hspell_is_canonic_gimatria()
       returns the number represented as gimatria in 'word' if it is indeed proper  gimatria  (in
       canonic form), or 0 otherwise.

       hspell_init()  normally  reads the dictionary files from a path compiled into the library.
       This makes sense when the library's code and the dictionaries  are  distributed  together,
       but  in some scenarios the library user might want to use the Hspell dictionaries that are
       already   present   on   the   system    in    an    arbitrary    path.    The    function
       hspell_set_dictionary_path()  can  be  used  to  set  this path, and should be used before
       calling hspell_init().  The given path is that of the word list,  and  other  input  files
       have  that path with an appended prefix.  hspell_get_dictionary_path() can be used to find
       the     current     path.     On     many     installations,     this     defaults      to
       "/usr/local/share/hspell/hebrew.wgz".

LINKING

       On  most  systems,  the Hspell library is compiled to use the Zlib library for reading the
       compressed dictionaries. Therefore, a program linking with the Hspell library must also be
       linked with the Zlib library (usually, by adding "-lz" to the compilation line).

       Programs  that  use  autoconf  to  search  for the Hspell library, should remember to tell
       AC_CHECK_LIB to also link with the -lz library when checking for -lhspell.

CAVEATS

While the API described here has been stable for years, it may change in the future. Users
are encouraged to compare the values of the integer macros HSPELL_VERSION_MAJOR and
HSPELL_VERSION_MINOR to those expected by the writer of the program. A third macro,
HSPELL_VERSION_EXTRA contains a string which can describe subrelease modifications (e.g.,
beta versions).

The current Hspell C API is very low-level, in the sense that it leaves the user to
implement many features that some users take for granted that a spell-checker should
provide. For example it doesn't provide any facilities for a user-defined personal
dictionary. It also has separate functions for checking valid Hebrew words and valid
gimatria, and no function to do both. It is assumed that the caller - a bigger spell-
checking library or word processor (for example), will already have these facilities. If
not, you may wish to look at the sources of hspell(1) for an example implementation.

Currently there is no concept of separate Hspell "contexts" in an application. Some of
the context is now global for the entire application: currently, a single list of legal
prefix-particles is kept, and the dictionary read by hspell_init() is always read from the
global default place. This may be solved in a later version, e.g., by switching to an API
like:

context = hspell_new_context();
hspell_set_dictionary_path(context, "/some/path/hebrew.wgz");
hspell_init(context, flags);
...
hspell_check_word(context, word, preflenp);

Note that despite the global context mentioned above, after initialization all functions
described here are thread-safe, because they only read the dictionary data, not write to
it.

hspell_trycorrect() is not as powerful as it could have been, with typos or certain kinds
of spelling mistakes not giving useful correction suggestions. Along with more types of
corrections, hspell_trycorrect() needs a better way to order the likelihood of the
corrections, as an unordered list of 100 corrections would be just as useful (or rather,
useless) as none.

In some cases of errors during hspell_init(), warning messages are printed to the standard
errors. This is a bad thing for a library to do.

There are too many CAVEATS in this manual.

VERSION

       The version of hspell described by this manual page is 1.2.

COPYRIGHT

       Copyright  (C)  2000-2012,  Nadav  Har'El  <nyh@math.technion.ac.il>  and  Dan  Kenigsberg
       <danken@cs.technion.ac.il>.

       Hspell is free software, released under the  GNU  Affero  General  Public  License  (AGPL)
       version  3.   Note that not only the programs in the distribution, but also the dictionary
       files and the generated word lists, are licensed under the AGPL.  There is no warranty  of
       any kind.

       See the LICENSE file for more information and the exact license terms.

       The latest version of this software can be found in http://hspell.ivrix.org.il/