Ubuntu Manpage: Text::Ngram - Ngram analysis of text

name
synopsis
description
to do
see also
author
copyright and license

Provided by: libtext-ngram-perl_0.15-2_amd64

NAME

       Text::Ngram - Ngram analysis of text

SYNOPSIS

         use Text::Ngram qw(ngram_counts add_to_counts);
         my $text   = "abcdefghijklmnop";
         my $hash_r = ngram_counts($text, 3); # Window size = 3
         # $hash_r => { abc => 1, bcd => 1, ... }

         add_to_counts($more_text, 3, $hash_r);

DESCRIPTION

       n-Gram analysis is a field in textual analysis which uses sliding window character sequences in order to
       aid topic analysis, language determination and so on. The n-gram spectrum of a document can be used to
       compare and filter documents in multiple languages, prepare word prediction networks, and perform
       spelling correction.

       The neat thing about n-grams, though, is that they're really easy to determine. For n=3, for instance, we
       compute the n-gram counts like so:

           the cat sat on the mat
           ---                     $counts{"the"}++;
            ---                    $counts{"he "}++;
             ---                   $counts{"e c"}++;
              ...

       This module provides an efficient XS-based implementation of n-gram spectrum analysis.

       There are two functions which can be imported:

   ngram_counts
       This first function returns a hash reference with the n-gram histogram of the text for the given window
       size. The default window size is 5.

           $href = ngram_counts(\%config, $text, $window_size);

       As of version 0.14, the %config may instead be passed in as named arguments:

           $href = ngram_counts($text, $window_size, %config);

       The only necessary parameter is $text.

       The possible value for %config are:

       flankbreaks

       If set to 1 (default), breaks are flanked by spaces; if set to 0, they're not. Breaks are punctuation and
       other non-alphabetic characters, which, unless you use "punctuation => 0" in your configuration, do not
       make it into the returned hash.

       Here's an example, supposing you're using the default value for punctuation (1):

         my $text = "Hello, world";
         my $hash = ngram_counts($text, 5);

       That produces the following ngrams:

         {
           'Hello' => 1,
           'ello ' => 1,
           ' worl' => 1,
           'world' => 1,
         }

       On the other hand, this:

         my $text = "Hello, world";
         my $hash = ngram_counts({flankbreaks => 0}, $text, 5);

       Produces the following ngrams:

         {
           'Hello' => 1,
           ' worl' => 1,
           'world' => 1,
         }

       lowercase

       If set to 0, casing is preserved. If set to 1, all letters are lowercased before counting ngrams. Default
       is 1.

           # Get all ngrams of size 4 preserving case
           $href_p = ngram_counts( {lowercase => 0}, $text, 4 );

       punctuation

       If set to 0 (default), punctuation is removed before calculating the ngrams.  Set to 1 to preserve it.

           # Get all ngrams of size 2 preserving punctuation
           $href_p = ngram_counts( {punctuation => 1}, $text, 2 );

       spaces

       If set to 0 (default is 1), no ngrams containing spaces will be returned.

          # Get all ngrams of size 3 that do not contain spaces
          $href = ngram_counts( {spaces => 0}, $text, 3);

       If you're going to request both types of ngrams, than the best way to avoid calculating the same thing
       twice is probably this:

           $href_with_spaces = ngram_counts($text[, $window]);
           $href_no_spaces = $href_with_spaces;
           for (keys %$href_no_spaces) { delete $href->{$_} if / / }

   add_to_counts
       This incrementally adds to the supplied hash; if $window is zero or undefined, then the window size is
       computed from the hash keys.

           add_to_counts($more_text, $window, $href)

TO DO

       •     Look further into the tests. Sort them and add more.

AUTHOR

       Maintained by Alberto Simoes, "ambs@cpan.org".

       Previously maintained by Jose Castro, "cog@cpan.org".  Originally created by Simon Cozens,
       "simon@cpan.org".

COPYRIGHT AND LICENSE

       Copyright 2006 by Alberto Simoes

       Copyright 2004 by Jose Castro

       Copyright 2003 by Simon Cozens

       This library is free software; you can redistribute it and/or modify it under the same terms as Perl
       itself.

NAME

SYNOPSIS

DESCRIPTION

TO DO

SEE ALSO

AUTHOR

COPYRIGHT AND LICENSE