Ubuntu Manpage: Lingua::EN::Sentence - split text into sentences

Provided by: liblingua-en-sentence-perl_0.33-2_all

NAME

       Lingua::EN::Sentence - split text into sentences

SYNOPSIS

               use Lingua::EN::Sentence qw( get_sentences add_acronyms );

               add_acronyms('lt','gen');               ## adding support for 'Lt. Gen.'
               my $text = q{
               A sentence usually ends with a dot, exclamation or question mark optionally followed by a space!
               A string followed by 2 carriage returns denotes a sentence, even though it doesn't end in a dot

               Dots after single letters such as U.S.A. or in numbers like -12.34 will not cause a split
               as well as common abbreviations such as Dr. I. Smith, Ms. A.B. Jones, Apr. Calif. Esq.
               and (some text) ellipsis such as ... or . . are ignored.
               Some valid cases canot be deteected, such as the answer is X. It cannot easily be
               differentiated from the single letter-dot sequence to abbreviate a person's given name.
               Numbered points within a sentence will not cause a split 1. Like this one.
               See the code for all the rules that apply.
               This string has 7 sentences.
               };

               my $sentences=get_sentences($text);     # Get the sentences.
               foreach my $sent (@$sentences)
               {
                       $i++;
                       print("SENTENCE $i:$sent\n");
               }

DESCRIPTION

       The "Lingua::EN::Sentence" module contains the function get_sentences, which splits text
       into its constituent sentences, based on a regular expression and a list of abbreviations
       (built in and given).

       Certain well know exceptions, such as abbreviations, may cause incorrect segmentations.
       But some of them are already integrated into this code and are being taken care of. Still,
       if you see that there are words causing the get_sentences function to fail, you can add
       those to the module, so it notices them.  Note that abbreviations are case sensitive, so
       'Mrs.' is recognised but not 'mrs.'

ALGORITHM

       The first step is to mark  the dot ending an abbreviation by changing it to a special
       character. Now it won't cause a sentence split. The original dot is restored after the
       sentences are split

       Basically, I use a 'brute' regular expression to split the text into sentences.  (Well,
       nothing is yet split - I just mark the end-of-sentence). Then I look into a set of rules
       which decide when an end-of-sentence is justified and when it's a mistake. In case of a
       mistake, the end-of-sentence mark is removed. What are such mistakes?

       Letter-dot sequences:  U.S.A. ,  i.e. , e.g.  Dot sequences: '..' or '...'  or 'text . .
       more text' Two carriage returns denote the end of a sentence even if it doesn't end with a
       dot

LIMITATIONS

       1) John F. Kennedy was a former president 2) The answer is F. That ends the quiz

       In the first sentence, F. is detected as a persons initial and not the end of a sentence.
       But this means we cannot detect the true end of sentence 2, which is after the 'F'. This
       case is not common though.

FUNCTIONS

       All functions used should be requested in the 'use' clause. None is exported by default.

       get_sentences( $text )
           The get_sentences function takes a scalar containing ascii text as an argument and
           returns a reference to an array of sentences that the text has been split into.
           Returned sentences will be trimmed (beginning and end of sentence) of white space.
           Strings with no alpha-numeric characters in them, won't be returned as sentences.

       add_acronyms( @acronyms )
           This function is used for adding acronyms not supported by this code.  The input
           should be regular expressions for matching the desired acronyms, but should not
           include the final period ("."). So, for example, "blv?d" matches "blvd." and "bld.".
           "a\.mlf" will match "a.mlf.". You do not need to bother with acronyms consisting of
           single letters and dots (e.g. "U.S.A."), as these are found automatically. Note also
           that acronyms are searched for on a case insensitive basis.

           Please see 'Acronym/Abbreviations list' section for the abbreviations already
           supported by this module.

       get_acronyms( )
           This function will return the defined list of acronyms.

       set_acronyms( @my_acronyms )
           This function replaces the predefined acronym list with the given list. See
           "add_acronyms" for details on the input specifications.

       get_EOS( )
           This function returns the value of the string used to mark the end of sentence.  You
           might want to see what it is, and to make sure your text doesn't contain it.  You can
           use set_EOS() to alter the end-of-sentence string to whatever you desire.

       set_EOS( $new_EOS_string )
           This function alters the end-of-sentence string used to mark the end of sentences.

       set_locale( $new_locale ) Receives language locale in the form
       language.country.character-set for example: "fr_CA.ISO8859-1" for Canadian French using
       character set ISO8859-1.
           Returns a reference to a hash containing the current locale formatting values.
           Returns undef if got undef.

           The following will set the LC_COLLATE behaviour to Argentinian Spanish.  NOTE: The
           naming and availability of locales depends on your operating sysem.  Please consult
           the perllocale manpage for how to find out which locales are available in your system.

           $loc = set_locale( "es_AR.ISO8859-1" );

           This actually does this:

           $loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );

Acronym/Abbreviations list

       You can use the get_acronyms() function to get acronyms.  It has become too long to
       specify in the documentation.

       If I come across a good general-purpose list - I'll incorporate it into this module.  Feel
       free to suggest such lists.

FUTURE WORK

               [1] Object Oriented like usage
               [2] Supporting more than just English/French
               [3] Code optimization. Currently everything is RE based and not so optimized RE
               [4] Possibly use more semantic heuristics for detecting a beginning of a sentence

REPOSITORY

       <https://github.com/kimryan/Lingua-EN-Sentence>

AUTHOR

       Shlomo Yona shlomo@cs.haifa.ac.il

       Currently being maintained by Kim Ryan, kimryan at CPAN d o t org

COPYRIGHT AND LICENSE

       Copyright (c) 2001-2016 Shlomo Yona. All rights reserved.  Copyright (c) 2022 Kim Ryan.
       All rights reserved.

       This library is free software; you can redistribute it and/or modify it under the same
       terms as Perl itself.

NAME

SYNOPSIS

DESCRIPTION

ALGORITHM

LIMITATIONS

FUNCTIONS

Acronym/Abbreviations list

FUTURE WORK

SEE ALSO

REPOSITORY

AUTHOR

COPYRIGHT AND LICENSE