Provided by: spamassassin_3.4.2-0ubuntu0.14.04.1_all bug

NAME

       Mail::SpamAssassin::Plugin::TextCat - TextCat language guesser

SYNOPSIS

         loadplugin     Mail::SpamAssassin::Plugin::TextCat

DESCRIPTION

       This plugin will try to guess the language used in the message body text.

       You can use the "ok_languages" directive to set which languages are considered okay for incoming mail and
       if the guessed language is not okay, "UNWANTED_LANGUAGE_BODY" is triggered.

       It will always add the results to a "X-Language" name-value pair in the message metadata data structure.
       This may be useful as Bayes tokens and can also be used in rules for scoring. The results can also be
       added to marked-up messages using "add_header", with the _LANGUAGES_ tag. See Mail::SpamAssassin::Conf
       for details.

       Note: the language cannot always be recognized with sufficient confidence.  In that case, no action is
       taken.

       You can use _TEXTCATRESULTS_ tag to view the internal ngram-scoring, it might help fine-tuning settings.

USER OPTIONS

       ok_languages xx [ yy zz ... ]      (default: all)
           This option is used to specify which languages are considered okay for incoming mail.  SpamAssassin
           will try to detect the language used in the message body text.

           Note that the language cannot always be recognized with sufficient confidence. In that case, no
           action is taken.

           The rule "UNWANTED_LANGUAGE_BODY" is triggered if none of the languages detected are in the "ok"
           list. Note that this is the only effect of the "ok" list. It does not act as a whitelist against any
           other form of spam scanning.

           In your configuration, you must use the two or three letter language specifier in lowercase, not the
           English name for the language.  You may also specify "all" if a desired language is not listed, or if
           you want to allow any language.  The default setting is "all".

           Examples:

             ok_languages all         (allow all languages)
             ok_languages en          (only allow English)
             ok_languages en ja zh    (allow English, Japanese, and Chinese)

           Note: if there are multiple ok_languages lines, only the last one is used.

           Select the languages to allow from the list below:

           af   - Afrikaans
           am   - Amharic
           ar   - Arabic
           be   - Byelorussian
           bg   - Bulgarian
           bs   - Bosnian
           ca   - Catalan
           cs   - Czech
           cy   - Welsh
           da   - Danish
           de   - German
           el   - Greek
           en   - English
           eo   - Esperanto
           es   - Spanish
           et   - Estonian
           eu   - Basque
           fa   - Persian
           fi   - Finnish
           fr   - French
           fy   - Frisian
           ga   - Irish Gaelic
           gd   - Scottish Gaelic
           he   - Hebrew
           hi   - Hindi
           hr   - Croatian
           hu   - Hungarian
           hy   - Armenian
           id   - Indonesian
           is   - Icelandic
           it   - Italian
           ja   - Japanese
           ka   - Georgian
           ko   - Korean
           la   - Latin
           lt   - Lithuanian
           lv   - Latvian
           mr   - Marathi
           ms   - Malay
           ne   - Nepali
           nl   - Dutch
           no   - Norwegian
           pl   - Polish
           pt   - Portuguese
           qu   - Quechua
           rm   - Rhaeto-Romance
           ro   - Romanian
           ru   - Russian
           sa   - Sanskrit
           sco  - Scots
           sk   - Slovak
           sl   - Slovenian
           sq   - Albanian
           sr   - Serbian
           sv   - Swedish
           sw   - Swahili
           ta   - Tamil
           th   - Thai
           tl   - Tagalog
           tr   - Turkish
           uk   - Ukrainian
           vi   - Vietnamese
           yi   - Yiddish
           zh   - Chinese (both Traditional and Simplified)
           zh.big5   - Chinese (Traditional only)
           zh.gb2312 - Chinese (Simplified only)

       inactive_languages xx [ yy zz ... ]          (default: see below)
           This option is used to specify which languages will not be considered when trying to guess the
           language.  For performance reasons, supported languages that have fewer than about 5 million speakers
           are disabled by default.  Note that listing a language in "ok_languages" automatically enables it for
           that user.

           The default setting is:

           bs cy eo et eu fy ga gd is la lt lv rm sa sco sl yi

           That list is Bosnian, Welsh, Esperanto, Estonian, Basque, Frisian, Irish Gaelic, Scottish Gaelic,
           Icelandic, Latin, Lithuanian, Latvian, Rhaeto-Romance, Sanskrit, Scots, Slovenian, and Yiddish.

       textcat_max_languages N (default: 3)
           The maximum number of languages any one message can simultaneously match before its classification is
           considered unknown.  You can try reducing this to 2 or possibly even 1 for more confident results, as
           it's unusual for a message to contain multiple languages.

           Read description for textcat_acceptable_score also, as these settings are closely related.  Scoring
           affects how many languages might be matched and here we set the "false positive limit" where we think
           the engine can't decide what languages message really contain.

       textcat_optimal_ngrams N (default: 0)
           If the number of ngrams is lower than this number then they will be removed.  This can be used to
           speed up the program for longer inputs.  For shorter inputs, this should be set to 0.

       textcat_max_ngrams N (default: 400)
           The maximum number of ngrams that should be compared with each of the languages models (note that
           each of those models is used completely).

       textcat_acceptable_score N (default: 1.02)
           Include any language that scores at least "textcat_acceptable_score" in the returned list of
           languages.

           This setting is basically a percentile range. Any language having internal ngram-score within
           N-percent of the best score is included into results.  Larger values than 1.05 are not recommended as
           it can generate many false matches.  A setting of 1.00 would mean a single best scoring language is
           always forcibly selected, but this is not recommended as then textcat_max_languages can't do its job
           classifying language as uncertain.

           Read the description for textcat_max_languages, as these are settings are closely related.

           You can use _TEXTCATRESULTS_ tag to view the internal ngram-scoring, it might help fine-tuning
           settings.