Ubuntu Manpage: String::Approx - Perl extension for approximate matching (fuzzy matching)

name
synopsis
description
match
substitute
index
slice
distance
controlling the cache
notes
version
changes from version 2
acknowledgements
author
copyright and license

Provided by: libstring-approx-perl_3.28-2build4_amd64

NAME

       String::Approx - Perl extension for approximate matching (fuzzy matching)

SYNOPSIS

         use String::Approx 'amatch';

         print if amatch("foobar");

         my @matches = amatch("xyzzy", @inputs);

         my @catches = amatch("plugh", ['2'], @inputs);

DESCRIPTION

String::Approx lets you match and substitute strings approximately. With this you can emulate errors:
typing errors, spelling errors, closely related vocabularies (colour color), genetic mutations (GAG ACT),
abbreviations (McScot, MacScot).

NOTE: String::Approx suits the task of string matching, not string comparison, and it works for strings,
not for text.

If you want to compare strings for similarity, you probably just want the Levenshtein edit distance
(explained below), the Text::Levenshtein and Text::LevenshteinXS modules in CPAN. See also
Text::WagnerFischer and Text::PhraseDistance. (There are functions for this in String::Approx, e.g.
adist(), but their results sometimes differ from the bare Levenshtein et al.)

If you want to compare things like text or source code, consisting of words or tokens and phrases and
sentences, or expressions and statements, you should probably use some other tool than String::Approx,
like for example the standard UNIX diff(1) tool, or the Algorithm::Diff module from CPAN.

The measure of approximateness is the Levenshtein edit distance. It is the total number of "edits":
insertions,

word world

deletions,

monkey money

and substitutions

sun fun

required to transform a string to another string. For example, to transform "lead" into "gold", you need
three edits:

lead gead goad gold

The edit distance of "lead" and "gold" is therefore three, or 75%.

String::Approx uses the Levenshtein edit distance as its measure, but String::Approx is not well-suited
for comparing strings of different length, in other words, if you want a "fuzzy eq", see above.
String::Approx is more like regular expressions or index(), it finds substrings that are close matches.>

MATCH

               use String::Approx 'amatch';

               $matched     = amatch("pattern")
               $matched     = amatch("pattern", [ modifiers ])

               $any_matched = amatch("pattern", @inputs)
               $any_matched = amatch("pattern", [ modifiers ], @inputs)

               @match       = amatch("pattern")
               @match       = amatch("pattern", [ modifiers ])

               @matches     = amatch("pattern", @inputs)
               @matches     = amatch("pattern", [ modifiers ], @inputs)

       Match pattern approximately.  In list context return the matched @inputs.  If no inputs are given, match
       against the $_.  In scalar context return true if any of the inputs match, false if none match.

       Notice that the pattern is a string.  Not a regular expression.  None of the regular expression notations
       (^, ., *, and so on) work.  They are characters just like the others.  Note-on-note: some limited form of
       "regular expressionism" is planned in future: for example character classes ([abc]) and any-chars (.).
       But that feature will be turned on by a special modifier (just a guess: "r"), so there should be no
       backward compatibility problem.

       Notice also that matching is not symmetric.  The inputs are matched against the pattern, not the other
       way round.  In other words: the pattern can be a substring, a submatch, of an input element.  An input
       element is always a superstring of the pattern.

   MODIFIERS
       With the modifiers you can control the amount of approximateness and certain other control variables.
       The modifiers are one or more strings, for example "i", within a string optionally separated by
       whitespace.  The modifiers are inside an anonymous array: the [ ] in the syntax are not notational, they
       really do mean [ ], for example [ "i", "2" ].  ["2 i"] would be identical.

       The implicit default approximateness is 10%, rounded up.  In other words: every tenth character in the
       pattern may be an error, an edit.  You can explicitly set the maximum approximateness by supplying a
       modifier like

               number
               number%

       Examples: "3", "15%".

       Note that "0%" is not rounded up, it is equal to 0.

       Using a similar syntax you can separately control the maximum number of insertions, deletions, and
       substitutions by prefixing the numbers with I, D, or S, like this:

               Inumber
               Inumber%
               Dnumber
               Dnumber%
               Snumber
               Snumber%

       Examples: "I2", "D20%", "S0".

       You can ignore case ("A" becames equal to "a" and vice versa) by adding the "i" modifier.

       For example

               [ "i 25%", "S0" ]

       means ignore case, allow every fourth character to be "an edit", but allow no substitutions.  (See NOTES
       about disallowing substitutions or insertions.)

       NOTE: setting "I0 D0 S0" is not equivalent to using index().  If you want to use index(), use index().

SUBSTITUTE

               use String::Approx 'asubstitute';

               @substituted = asubstitute("pattern", "replacement")
               @substituted = asubstitute("pattern", "replacement", @inputs)
               @substituted = asubstitute("pattern", "replacement", [ modifiers ])
               @substituted = asubstitute("pattern", "replacement",
                                          [ modifiers ], @inputs)

       Substitute approximate pattern with replacement and return as a list <copies> of @inputs, the
       substitutions having been made on the elements that did match the pattern.  If no inputs are given,
       substitute in the $_.  The replacement can contain magic strings $&, $`, $' that stand for the matched
       string, the string before it, and the string after it, respectively.  All the other arguments are as in
       amatch(), plus one additional modifier, "g" which means substitute globally (all the matches in an
       element and not just the first one, as is the default).

       See "BAD NEWS" about the unfortunate stinginess of asubstitute().

INDEX

               use String::Approx 'aindex';

               $index   = aindex("pattern")
               @indices = aindex("pattern", @inputs)
               $index   = aindex("pattern", [ modifiers ])
               @indices = aindex("pattern", [ modifiers ], @inputs)

       Like amatch() but returns the index/indices at which the pattern matches approximately.  In list context
       and if @inputs are used, returns a list of indices, one index for each input element.  If there's no
       approximate match, -1 is returned as the index.

       NOTE: if there is character repetition (e.g. "aa") either in the pattern or in the text, the returned
       index might start "too early".  This is consistent with the goal of the module of matching "as early as
       possible", just like regular expressions (that there might be a "less approximate" match starting later
       is of somewhat irrelevant).

       There's also backwards-scanning arindex().

SLICE

               use String::Approx 'aslice';

               ($index, $size)   = aslice("pattern")
               ([$i0, $s0], ...) = aslice("pattern", @inputs)
               ($index, $size)   = aslice("pattern", [ modifiers ])
               ([$i0, $s0], ...) = aslice("pattern", [ modifiers ], @inputs)

       Like aindex() but returns also the size (length) of the match.  If the match fails, returns an empty list
       (when matching against $_) or an empty anonymous list corresponding to the particular input.

       NOTE: size of the match will very probably be something you did not expect (such as longer than the
       pattern, or a negative number).  This may or may not be fixed in future releases. Also the beginning of
       the match may vary from the expected as with aindex(), see above.

       If the modifier

               "minimal_distance"

       is used, the minimal possible edit distance is returned as the third element:

               ($index, $size, $distance) = aslice("pattern", [ modifiers ])
               ([$i0, $s0, $d0], ...)     = aslice("pattern", [ modifiers ], @inputs)

DISTANCE

               use String::Approx 'adist';

               $dist = adist("pattern", $input);
               @dist = adist("pattern", @input);

       Return the edit distance or distances between the pattern and the input or inputs.  Zero edit distance
       means exact match.  (Remember that the match can 'float' in the inputs, the match is a substring match.)
       If the pattern is longer than the input or inputs, the returned distance or distances is or are negative.

               use String::Approx 'adistr';

               $dist = adistr("pattern", $input);
               @dist = adistr("pattern", @inputs);

       Return the relative edit distance or distances between the pattern and the input or inputs.  Zero
       relative edit distance means exact match, one means completely different.  (Remember that the match can
       'float' in the inputs, the match is a substring match.)  If the pattern is longer than the input or
       inputs, the returned distance or distances is or are negative.

       You can use adist() or adistr() to sort the inputs according to their approximateness:

               my %d;
               @d{@inputs} = map { abs } adistr("pattern", @inputs);
               my @d = sort { $d{$a} <=> $d{$b} } @inputs;

       Now @d contains the inputs, the most like "pattern" first.

CONTROLLING THE CACHE

"String::Approx" maintains a LU (least-used) cache that holds the 'matching engines' for each instance of
a pattern+modifiers. The cache is intended to help the case where you match a small set of patterns
against a large set of string. However, the more engines you cache the more you eat memory. If you have
a lot of different patterns or if you have a lot of memory to burn, you may want to control the cache
yourself. For example, allowing a larger cache consumes more memory but probably runs a little bit
faster since the cache fills (and needs flushing) less often.

The cache has two parameters: max and purge. The first one is the maximum size of the cache and the
second one is the cache flushing ratio: when the number of cache entries exceeds max, max times purge
cache entries are flushed. The default values are 1000 and 0.75, respectively, which means that when the
1001st entry would be cached, 750 least used entries will be removed from the cache. To access the
parameters you can use the calls

$now_max = String::Approx::cache_max();
String::Approx::cache_max($new_max);

$now_purge = String::Approx::cache_purge();
String::Approx::cache_purge($new_purge);

$limit = String::Approx::cache_n_purge();

To be honest, there are actually two caches: the first one is used far the patterns with no modifiers,
the second one for the patterns with pattern modifiers. Using the standard parameters you will therefore
actually cache up to 2000 entries. The above calls control both caches for the same price.

To disable caching completely use

String::Approx::cache_disable();

Note that this doesn't flush any possibly existing cache entries, to do that use

String::Approx::cache_flush_all();

NOTES

       Because matching is by substrings, not by whole strings, insertions and substitutions produce often very
       similar results: "abcde" matches "axbcde" either by insertion or substitution of "x".

       The maximum edit distance is also the maximum number of edits.  That is, the "I2" in

               amatch("abcd", ["I2"])

       is useless because the maximum edit distance is (implicitly) 1.  You may have meant to say

               amatch("abcd", ["2D1S1"])

       or something like that.

       If you want to simulate transposes

               feet fete

       you need to allow at least edit distance of two because in terms of our edit primitives a transpose is
       first one deletion and then one insertion.

   TEXT POSITION
       The starting and ending positions of matching, substituting, indexing, or slicing can be changed from the
       beginning and end of the input(s) to some other positions by using either or both of the modifiers

               "initial_position=24"
               "final_position=42"

       or the both the modifiers

               "initial_position=24"
               "position_range=10"

       By setting the "position_range" to be zero you can limit (anchor) the operation to happen only once (if a
       match is possible) at the position.

VERSION

       Major release 3.

CHANGES FROM VERSION 2

   GOOD NEWS
       The version 3 is 2-3 times faster than version 2
       No pattern length limitation
           The algorithm is independent on the pattern length: its time complexity is O(kn), where k is the
           number of edits and n the length of the text (input).  The preprocessing of the pattern will of
           course take some O(m) (m being the pattern length) time, but amatch() and asubstitute() cache the
           result of this preprocessing so that it is done only once per pattern.

   BAD NEWS
       You do need a C compiler to install the module
           Perl's regular expressions are no more used; instead a faster and more scalable algorithm written in
           C is used.

       asubstitute() is now always stingy
           The string matched and substituted is now always stingy, as short as possible.  It used to be as long
           as possible.  This is an unfortunate change stemming from switching the matching algorithm.  Example:
           with edit distance of two and substituting for "word" from "cork" and "wool" previously did match
           "cork" and "wool".  Now it does match "or" and "wo".  As little as possible, or, in other words, with
           as much approximateness, as many edits, as possible.  Because there is no need to match the "c" of
           "cork", it is not matched.

       no more aregex() because regular expressions are no more used
       no more "compat1" for String::Approx version 1 compatibility

ACKNOWLEDGEMENTS

       The following people have provided valuable test cases, documentation clarifications, and other feedback:

       Jared August, Arthur Bergman, Anirvan Chatterjee, Steve A. Chervitz, Aldo Calpini, David Curiel, Teun van
       den Dool, Alberto Fontaneda, Rob Fugina, Dmitrij Frishman, Lars Gregersen, Kevin Greiner, B. Elijah
       Griffin, Mike Hanafey, Mitch Helle, Ricky Houghton, 'idallen', Helmut Jarausch, Damian Keefe, Ben
       Kennedy, Craig Kelley, Franz Kirsch, Dag Kristian, Mark Land, J. D. Laub, John P. Linderman, Tim Maher,
       Juha Muilu, Sergey Novoselov, Andy Oram, Ji Y Park, Eric Promislow, Nikolaus Rath, Stefan Ram, Slaven
       Rezic, Dag Kristian Rognlien, Stewart Russell, Slaven Rezic, Chris Rosin, Pasha Sadri, Ilya Sandler, Bob
       J.A. Schijvenaars, Ross Smith, Frank Tobin, Greg Ward, Rich Williams, Rick Wise.

       The matching algorithm was developed by Udi Manber, Sun Wu, and Burra Gopal in the Department of Computer
       Science, University of Arizona.

AUTHOR

       Jarkko Hietaniemi <jhi@iki.fi>

COPYRIGHT AND LICENSE

       Copyright 2001-2013 by Jarkko Hietaniemi

       This library is free software; you can redistribute it and/or modify under either the terms of the
       Artistic License 2.0, or the GNU Library General Public License, Version 2.  See the files Artistic and
       LGPL for more details.

       Furthermore: no warranties or obligations of any kind are given, and the separate file COPYRIGHT must be
       included intact in all copies and derived materials.