Ubuntu Manpage: Text::Unidecode -- US-ASCII transliterations of Unicode text

Provided by: libtext-unidecode-perl_0.04-2_all

NAME

       Text::Unidecode -- US-ASCII transliterations of Unicode text

SYNOPSIS

         use utf8;
         use Text::Unidecode;
         print unidecode(
           "\x{5317}\x{4EB0}\n"
            # those are the Chinese characters for Beijing
         );

         # That prints: Bei Jing

DESCRIPTION

It often happens that you have non-Roman text data in Unicode, but you can't display it -- usually
because you're trying to show it to a user via an application that doesn't support Unicode, or because
the fonts you need aren't accessible. You could represent the Unicode characters as "???????" or
"\15BA\15A0\1610...", but that's nearly useless to the user who actually wants to read what the text
says.

What Text::Unidecode provides is a function, "unidecode(...)" that takes Unicode data and tries to
represent it in US-ASCII characters (i.e., the universally displayable characters between 0x00 and 0x7F).
The representation is almost always an attempt at transliteration -- i.e., conveying, in Roman letters,
the pronunciation expressed by the text in some other writing system. (See the example in the synopsis.)

Unidecode's ability to transliterate is limited by two factors:

• The amount and quality of data in the original

So if you have Hebrew data that has no vowel points in it, then Unidecode cannot guess what vowels
should appear in a pronounciation. S f y hv n vwls n th npt, y wn't gt ny vwls n th tpt. (This is a
specific application of the general principle of "Garbage In, Garbage Out".)

• Basic limitations in the Unidecode design

Writing a real and clever transliteration algorithm for any single language usually requires a lot of
time, and at least a passable knowledge of the language involved. But Unicode text can convey more
languages than I could possibly learn (much less create a transliterator for) in the entire rest of
my lifetime. So I put a cap on how intelligent Unidecode could be, by insisting that it support only
context-insensitive transliteration. That means missing the finer details of any given writing
system, while still hopefully being useful.

Unidecode, in other words, is quick and dirty. Sometimes the output is not so dirty at all: Russian and
Greek seem to work passably; and while Thaana (Divehi, AKA Maldivian) is a definitely non-Western writing
system, setting up a mapping from it to Roman letters seems to work pretty well. But sometimes the
output is very dirty: Unidecode does quite badly on Japanese and Thai.

If you want a smarter transliteration for a particular language than Unidecode provides, then you should
look for (or write) a transliteration algorithm specific to that language, and apply it instead of (or at
least before) applying Unidecode.

In other words, Unidecode's approach is broad (knowing about dozens of writing systems), but shallow (not
being meticulous about any of them).

FUNCTIONS

       Text::Unidecode provides one function, "unidecode(...)", which is exported by default.  It can be used in
       a variety of calling contexts:

       "$out = unidecode($in);" # scalar context
           This returns a copy of $in, transliterated.

       "$out = unidecode(@in);" # scalar context
           This is the same as "$out = unidecode(join '', @in);"

       "@out = unidecode(@in);" # list context
           This returns a list consisting of copies of @in, each transliterated.  This is the same  as  "@out  =
           map scalar(unidecode($_)), @in;"

       "unidecode(@items);" # void context
       "unidecode(@bar, $foo, @baz);" # void context
           Each  item on input is replaced with its transliteration.  This is the same as "for(@bar, $foo, @baz)
           { $_ = unidecode($_) }"

       You should make a minimum of assumptions about the output  of  "unidecode(...)".   For  example,  if  you
       assume  an  all-alphabetic  (Unicode)  string  passed  to  "unidecode(...)" will return an all-alphabetic
       string, you're wrong -- some alphabetic Unicode  characters  are  transliterated  as  strings  containing
       punctuation (e.g., the Armenian letter at 0x0539 currently transliterates as "T`".

       However, these are the assumptions you can make:

       •   Each character 0x0000 - 0x007F transliterates as itself.  That is, "unidecode(...)" is 7-bit pure.

       •   The  output  of  "unidecode(...)" always consists entirely of US-ASCII characters -- i.e., characters
           0x0000 - 0x007F.

       •   All Unicode characters translate to a sequence of (any number of) characters that are newline  ("\n")
           or  in  the  range  0x0020-0x007E.   That is, no Unicode character translates to "\x01", for example.
           (Altho if you have a "\x01" on input, you'll get a "\x01" in output.)

       •   Yes, some transliterations produce a "\n" -- but just a few, and only with good  reason.   Note  that
           the value of newline ("\n") varies from platform to platform -- see "perlport" in perlport.

       •   Some Unicode characters may transliterate to nothing (i.e., empty string).

       •   Very  many Unicode characters transliterate to multi-character sequences.  E.g., Han character 0x5317
           transliterates as the four-character string "Bei ".

       •   Within these constraints, I may change the transliteration of characters  in  future  versions.   For
           example,  if  someone  convinces  me  that the Armenian letter at 0x0539, currently transliterated as
           "T`", would be better transliterated as "D", I may well make that change.

DESIGN GOALS AND CONSTRAINTS

       Text::Unidecode is meant to be a transliterator-of-last resort, to be used once you've decided  that  you
       can't  just  display  the  Unicode  data  as  is,  and  once you've decided you don't have a more clever,
       language-specific transliterator available.  It transliterates context-insensitively -- that is, a  given
       character  is  replaced  with the same US-ASCII (7-bit ASCII) character or characters, no matter what the
       surrounding character are.

       The main reason I'm making Text::Unidecode work with only context-insensitive substitution is  that  it's
       fast,  dumb,  and  straightforward enough to be feasable.  It doesn't tax my (quite limited) knowledge of
       world languages.  It doesn't require me writing a hundred lines of code to get the  Thai  syllabification
       right  (and  never  knowing  whether I've gotten it wrong, because I don't know Thai), or spending a year
       trying to get Text::Unidecode to use the ChaSen algorithm for Japanese, or trying to write heuristics for
       telling the difference between Japanese, Chinese, or Korean, so it knows how to transliterate  any  given
       Uni-Han  glyph.  And moreover, context-insensitive substitution is still mostly useful, but still clearly
       couldn't be mistaken for authoritative.

       Text::Unidecode is an example of the 80/20 rule in action -- you get 80% of the usefulness using just 20%
       of a "real" solution.

       A "real" approach to transliteration  for  any  given  language  can  involve  such  increasingly  tricky
       contextual factors as these

       The previous / preceding character(s)
           What  a given symbol "X" means, could depend on whether it's followed by a consonant, or by vowel, or
           by some diacritic character.

       Syllables
           A character "X" at end of a syllable could mean something different from when it's at  the  start  --
           which is especially problematic when the language involved doesn't explicitly mark where one syllable
           stops and the next starts.

       Parts of speech
           What  "X"  sounds  like  at  the end of a word, depends on whether that word is a noun, or a verb, or
           what.

       Meaning
           By semantic context, you can tell that this ideogram "X" means "shoe" (pronounced one  way)  and  not
           "time"  (pronounced  another),  and  that's  how  you know to transliterate it one way instead of the
           other.

       Origin of the word
           "X" means one thing in loanwords and/or placenames (and derivatives thereof), and another  in  native
           words.

       "It's just that way"
           "X"  normally  makes  the  /X/  sound, except for this list of seventy exceptions (and words based on
           them, sometimes indirectly).  Or: you never can tell which of the three ways to  pronounce  "X"  this
           word actually uses; you just have to know which it is, so keep a dictionary on hand!

       Language
           The  character  "X" is actually used in several different languages, and you have to figure out which
           you're looking at before you can determine how to transliterate it.

       Out of a desire to avoid being mired in any of these kinds of contextual factors, I chose to exclude  all
       of them and just stick with context-insensitive replacement.

TODO

       Things  that  need  tending  to are detailed in the TODO.txt file, included in this distribution.  Normal
       installs probably don't leave the TODO.txt lying  around,  but  if  nothing  else,  you  can  see  it  at
       http://search.cpan.org/search?dist=Text::Unidecode

MOTTO

       The Text::Unidecode motto is:

         It's better than nothing!

       ...in  both  meanings:  1)  seeing  the  output  of "unidecode(...)" is better than just having all font-
       unavailable Unicode characters replaced with "?"'s, or rendered as gibberish;  and  2)  it's  the  worst,
       i.e., there's nothing that Text::Unidecode's algorithm is better than.

CAVEATS

       If you get really implausible nonsense out of "unidecode(...)", make sure that the input data really is a
       utf8 string.  See "perlunicode" in perlunicode.

THANKS

       Thanks to Harald Tveit Alvestrand, Abhijit Menon-Sen, and Mark-Jason Dominus.

COPYRIGHT AND DISCLAIMERS

       Copyright (c) 2001 Sean M. Burke. All rights reserved.

       This library is free software; you can redistribute it and/or modify it under  the  same  terms  as  Perl
       itself.

       This  program  is  distributed in the hope that it will be useful, but without any warranty; without even
       the implied warranty of merchantability or fitness for a particular purpose.

       Much of Text::Unidecode's internal data is based on data from The Unicode Consortium,  with  which  I  am
       unafiliated.

AUTHOR

       Sean M. Burke "sburke@cpan.org"

perl v5.10.0                                       2001-07-14                               Text::Unidecode(3pm)