Ubuntu Manpage: Text::Unidecode -- plain ASCII transliterations of Unicode text

Provided by: libtext-unidecode-perl_1.27-1_all

NAME

       Text::Unidecode -- plain ASCII transliterations of Unicode text

SYNOPSIS

         use utf8;
         use Text::Unidecode;
         print unidecode(
           "XX\n"
           # Chinese characters for Beijing (U+5317 U+4EB0)
         );

         # That prints: Bei Jing

DESCRIPTION

       It often happens that you have non-Roman text data in Unicode, but you can't display it-- usually because
       you're trying to show it to a user via an application that doesn't support Unicode, or because the fonts
       you need aren't accessible.  You could represent the Unicode characters as "???????" or
       "\15BA\15A0\1610...", but that's nearly useless to the user who actually wants to read what the text
       says.

       What Text::Unidecode provides is a function, "unidecode(...)" that takes Unicode data and tries to
       represent it in US-ASCII characters (i.e., the universally displayable characters between 0x00 and 0x7F).
       The representation is almost always an attempt at transliteration-- i.e., conveying, in Roman letters,
       the pronunciation expressed by the text in some other writing system.  (See the example in the synopsis.)

       NOTE:

       To make sure your perldoc/Pod viewing setup for viewing this page is working: The six-letter word
       "resume" should look like "resume" with an "/" accent on each "e".

       For further tests, and help if that doesn't work, see below, "A POD ENCODING TEST".

DESIGN PHILOSOPHY

Unidecode's ability to transliterate from a given language is limited by two factors:

• The amount and quality of data in the written form of the original language

So if you have Hebrew data that has no vowel points in it, then Unidecode cannot guess what vowels
should appear in a pronunciation. S f y hv n vwls n th npt, y wn't gt ny vwls n th tpt. (This is a
specific application of the general principle of "Garbage In, Garbage Out".)

• Basic limitations in the Unidecode design

Writing a real and clever transliteration algorithm for any single language usually requires a lot of
time, and at least a passable knowledge of the language involved. But Unicode text can convey more
languages than I could possibly learn (much less create a transliterator for) in the entire rest of
my lifetime. So I put a cap on how intelligent Unidecode could be, by insisting that it support only
context-insensitive transliteration. That means missing the finer details of any given writing
system, while still hopefully being useful.

Unidecode, in other words, is quick and dirty. Sometimes the output is not so dirty at all: Russian and
Greek seem to work passably; and while Thaana (Divehi, AKA Maldivian) is a definitely non-Western writing
system, setting up a mapping from it to Roman letters seems to work pretty well. But sometimes the
output is very dirty: Unidecode does quite badly on Japanese and Thai.

If you want a smarter transliteration for a particular language than Unidecode provides, then you should
look for (or write) a transliteration algorithm specific to that language, and apply it instead of (or at
least before) applying Unidecode.

In other words, Unidecode's approach is broad (knowing about dozens of writing systems), but shallow (not
being meticulous about any of them).

FUNCTIONS

       Text::Unidecode provides one function, "unidecode(...)", which is exported by default.  It can be used in
       a variety of calling contexts:

       "$out = unidecode( $in );" # scalar context
           This returns a copy of $in, transliterated.

       "$out = unidecode( @in );" # scalar context
           This is the same as "$out = unidecode(join "", @in);"

       "@out = unidecode( @in );" # list context
           This  returns  a  list consisting of copies of @in, each transliterated.  This is the same as "@out =
           map scalar(unidecode($_)), @in;"

       "unidecode( @items );" # void context
       "unidecode( @bar, $foo, @baz );" # void context
           Each item on input is replaced with its transliteration.  This is the same as "for(@bar, $foo,  @baz)
           { $_ = unidecode($_) }"

       You  should  make  a  minimum  of  assumptions about the output of "unidecode(...)".  For example, if you
       assume an all-alphabetic (Unicode) string  passed  to  "unidecode(...)"  will  return  an  all-alphabetic
       string,  you're  wrong--  some  alphabetic  Unicode  characters  are transliterated as strings containing
       punctuation (e.g., the Armenian letter "X" (U+0539), currently transliterates as "T`" (capital-T  then  a
       backtick).

       However, these are the assumptions you can make:

       •   Each character 0x0000 - 0x007F transliterates as itself.  That is, "unidecode(...)" is 7-bit pure.

       •   The  output  of  "unidecode(...)"  always consists entirely of US-ASCII characters-- i.e., characters
           0x0000 - 0x007F.

       •   All Unicode characters translate to a sequence of (any number of) characters that are newline  ("\n")
           or  in  the  range  0x0020-0x007E.   That is, no Unicode character translates to "\x01", for example.
           (Although if you have a "\x01" on input, you'll get a "\x01" in output.)

       •   Yes, some transliterations produce a "\n" but it's just a few, and only with good reason.  Note  that
           the value of newline ("\n") varies from platform to platform-- see perlport.

       •   Some Unicode characters may transliterate to nothing (i.e., empty string).

       •   Very  many  Unicode  characters  transliterate  to multi-character sequences.  E.g., Unihan character
           U+5317, "X", transliterates as the four-character string "Bei ".

       •   Within these constraints, I may change the transliteration of characters  in  future  versions.   For
           example, if someone convinces me that that the Armenian letter "X", currently transliterated as "T`",
           would be better transliterated as "D", I may well make that change.

       •   Unfortunately,  there are many characters that Unidecode doesn't know a transliteration for.  This is
           generally because the character has been added since I last revised the Unidecode data  tables.   I'm
           always catching up!

DESIGN GOALS AND CONSTRAINTS

       Text::Unidecode  is  meant to be a transliterator of last resort, to be used once you've decided that you
       can't just display the Unicode data as is, and  once  you've  decided  you  don't  have  a  more  clever,
       language-specific transliterator available, or once you've already applied smarter algorithms or mappings
       that you prefer and you now just want Unidecode to do cleanup.

       Unidecode transliterates context-insensitively-- that is, a given character is replaced with the same US-
       ASCII (7-bit ASCII) character or characters, no matter what the surrounding characters are.

       The  main  reason I'm making Text::Unidecode work with only context-insensitive substitution is that it's
       fast, dumb, and straightforward enough to be feasible.  It doesn't tax my (quite  limited)  knowledge  of
       world  languages.   It doesn't require me writing a hundred lines of code to get the Thai syllabification
       right (and never knowing whether I've gotten it wrong, because I don't know Thai),  or  spending  a  year
       trying to get Text::Unidecode to use the ChaSen algorithm for Japanese, or trying to write heuristics for
       telling  the  difference between Japanese, Chinese, or Korean, so it knows how to transliterate any given
       Uni-Han glyph.  And moreover, context-insensitive substitution is still mostly useful, but still  clearly
       couldn't be mistaken for authoritative.

       Text::Unidecode  is an example of the 80/20 rule in action-- you get 80% of the usefulness using just 20%
       of a "real" solution.

       A "real" approach to transliteration  for  any  given  language  can  involve  such  increasingly  tricky
       contextual factors as these:

       The previous / preceding character(s)
           What  a given symbol "X" means, could depend on whether it's followed by a consonant, or by vowel, or
           by some diacritic character.

       Syllables
           A character "X" at end of a syllable could mean something different from when  it's  at  the  start--
           which is especially problematic when the language involved doesn't explicitly mark where one syllable
           stops and the next starts.

       Parts of speech
           What  "X"  sounds  like  at  the end of a word, depends on whether that word is a noun, or a verb, or
           what.

       Meaning
           By semantic context, you can tell that this ideogram "X" means "shoe" (pronounced one  way)  and  not
           "time"  (pronounced  another),  and  that's  how  you know to transliterate it one way instead of the
           other.

       Origin of the word
           "X" means one thing in loanwords and/or placenames (and derivatives thereof), and another  in  native
           words.

       "It's just that way"
           "X"  normally  makes  the  /X/  sound, except for this list of seventy exceptions (and words based on
           them, sometimes indirectly).  Or: you never can tell which of the three ways to  pronounce  "X"  this
           word actually uses; you just have to know which it is, so keep a dictionary on hand!

       Language
           The  character  "X" is actually used in several different languages, and you have to figure out which
           you're looking at before you can determine how to transliterate it.

       Out of a desire to avoid being mired in any of these kinds of contextual factors, I chose to exclude  all
       of them and just stick with context-insensitive replacement.

A POD ENCODING TEST

       •   "Brontee"  is  six  characters  that  should  look  like  "Bronte",  but  with double-dots on the "e"
           character.

       •   "Resume" is six characters that should look like "Resume", but  with  /-shaped  accents  on  the  "e"
           characters.

       •   "laeti" should be four letters long-- the second letter should not be two letters "ae", but should be
           a single letter that looks like an "a" entirely fused with an "e".

       •   "XXXXXX" is six Greek characters that should look kind of like: xpovoc

       •   "XXX XXX XXXXX" is three short Russian words that should look a lot like: KAK BAC 3OBYT

       •   "XX" is two Malayalam characters that should look like: sw

       •   "XXXX" is four Chinese characters that should look like: "Y=+-"

       •   "XXXXX" is five characters that should look like: Hello

       If  all of those come out right, your Pod viewing setup is working fine-- welcome to the 2010s!  If those
       are    full    of    garbage    characters,    consider    viewing    this    page     as     HTML     at
       <https://metacpan.org/pod/Text::Unidecode> or <http://search.cpan.org/perldoc?Text::Unidecode>

       If  things look mostly okay, but the Malayalam and/or the Chinese are just question-marks or empty boxes,
       it's probably just that your computer lacks the fonts for those.

TODO

       Lots:

       * Rebuild the Unihan database.  (Talk about hitting a moving target!)

       * Add tone-numbers for Mandarin hanzi?   Namely:  In  Unihan,  when  tone  marks  are  present  (like  in
       "kMandarin:  dao",  should I continue to transliterate as just "Dao", or should I put in the tone number:
       "Dao4"?  It would be pretty jarring to have digits appear where  previously  there  was  just  alphabetic
       stuff--  But  tone  numbers  make  Chinese  more  readable.   (I have a clever idea about doing this, for
       Unidecode v2 or v3.)

       * Start dealing with characters over U+FFFF.  Cuneiform! Emojis! Whatever!

       * Fill in all the little characters that have crept into the Misc Symbols Etc blocks.

       * More things that need tending to are detailed in the TODO.txt  file,  included  in  this  distribution.
       Normal  installs  probably  don't leave the TODO.txt lying around, but if nothing else, you can see it at
       <http://search.cpan.org/search?dist=Text::Unidecode>

MOTTO

       The Text::Unidecode motto is:

         It's better than nothing!

       ...in both meanings: 1) seeing the output of "unidecode(...)"  is  better  than  just  having  all  font-
       unavailable  Unicode  characters  replaced  with  "?"'s, or rendered as gibberish; and 2) it's the worst,
       i.e., there's nothing that Text::Unidecode's algorithm is  better  than.   All  sensible  transliteration
       algorithms (like for German, see below) are going to be smarter than Unidecode's.

WHEN YOU DON'T LIKE WHAT UNIDECODE DOES

       I will repeat the above, because some people miss it:

       Text::Unidecode  is  meant to be a transliterator of last resort, to be used once you've decided that you
       can't just display the Unicode data as is, and  once  you've  decided  you  don't  have  a  more  clever,
       language-specific  transliterator  available-- or once you've already applied a smarter algorithm and now
       just want Unidecode to do cleanup.

       In other words, when you don't like what Unidecode does, do it yourself.  Really, that's what  the  above
       says.  Here's how you would do this for German, for example:

       In  German,  there's  the  typographical  convention that an umlaut (the double-dots on: ae oe ue) can be
       written as an "-e", like with "Schoen" becoming  "Schoen".   But  Unidecode  doesn't  do  that--  I  have
       Unidecode simply drop the umlaut accent and give back "Schon".

       (I chose this not because I'm a big meanie, but because generally changing "ue" to "ue" is disastrous for
       all  text  that's not in German.  Finnish "Hyvaeae paeivaeae" would turn into "Hyvaeae paeivaeae".  And I
       discourage you from being  yet  another  German  who  emails  me,  trying  to  impel  me  to  consider  a
       typographical nicety of German to be more important than all other languages.)

       If  you  know  that  the  text  you're  handling is probably in German, and you want to apply the "umlaut
       becomes -e" rule, here's how to do it for yourself (and then use Unidecode as the fallback afterwards):

         use utf8;  # <-- probably necessary.

         our( %German_Characters ) = qw(
          Ae AE   ae ae
          Oe OE   oe oe
          Ue UE   ue ue
          ss ss
         );

         use Text::Unidecode qw(unidecode);

         sub german_to_ascii {
           my($german_text) = @_;

           $german_text =~
             s/([AeaeOeoeUeuess])/$German_Characters{$1}/g;

           # And now, as a *fallthrough*:
           $german_text = unidecode( $german_text );
           return $german_text;
         }

       To pick another example, here's something that's not about a  specific  language,  but  simply  having  a
       preference  that may or may not agree with Unidecode's (i.e., mine).  Consider the "X" symbol.  Unidecode
       changes that to "Y=".  If you want "X" as "YEN", then...

         use Text::Unidecode qw(unidecode);

         sub my_favorite_unidecode {
           my($text) = @_;

           $text =~ s/X/YEN/g;

           # ...and anything else you like, such as:
           $text =~ s/X/Euro/g;

           # And then, as a fallback,...
           $text = unidecode($text);

           return $text;
         }

       Then if you do:

         print my_favorite_unidecode("You just won X250,000 and X40,000!!!");

       ...you'll get:

         You just won YEN250,000 and Euro40,000!!!

       ...just as you like it.

       (By the way, the reason I don't have Unidecode just turn "X" into "YEN" is  that  the  same  symbol  also
       stands  for  yuan,  the  Chinese  currency.  A "Y=" is nicely, safely neutral as to whether we're talking
       about yen or yuan-- Japan, or China.)

       Another example: for hanzi/kanji/hanja, I have designed Unidecode to transliterate according to the value
       that that character has in Mandarin (otherwise Cantonese,...).  Some users have complained that  applying
       Unidecode to Japanese produces gibberish.

       To  make a long story short: transliterating from Japanese is difficult and it requires a lot of context-
       sensitivity.  If you have text that you're fairly sure is in Japanese, you're going  to  have  to  use  a
       Japanese-specific  algorithm  to  transliterate Japanese into ASCII.  (And then you can call Unidecode on
       the output from that-- it is useful for, for example, turning  XXXXXXXXX  characters  into  their  normal
       (ASCII) forms.

       (Note,  as  of October 2015: I have titanic but tentative plans for making the value of Unihan characters
       be something you could set parameters for at runtime, in changing the order of "Mandarin  else  Cantonese
       else..."  in the value retrieval.  Currently that preference list is hardwired on my end, at module-build
       time.  Other options I'm considering allowing for: whether the Mandarin and Cantonese values should  have
       the tone numbers on them; whether every Unihan value should have a terminal space; and maybe other clever
       stuff I haven't thought of yet.)

CAVEATS

       If you get really implausible nonsense out of "unidecode(...)", make sure that the input data really is a
       utf8 string.  See perlunicode and perlunitut.

THANKS

       Thanks  to (in only the sloppiest of sorta-chronological order): Jordan Lachler, Harald Tveit Alvestrand,
       Melissa Axelrod, Abhijit Menon-Sen, Mark-Jason Dominus, Joe  Johnston,  Conrad  Heiney,  fileformat.info,
       Philip  Newton, XX, TomaX Xolc, Mike Doherty, JT Smith and the MadMongers, Arden Ogg, Craig Copris, David
       Cusimano, Brendan Byrd, Hex Martin, and many other pals who have helped with  the  ideas  or  values  for
       Unidecode's  transliterations,  or  whose  help  has  been  in the secret F5 tornado that constitutes the
       internals of Unidecode's implementation.

       And thank you to the many people who have encouraged me to plug away at this project.  A decade  went  by
       before  I  had  any  idea that more than about four or five people were using or getting any value out of
       Unidecode.  I am told that in reality, that my figure was missing some zeroes on the end!

PORTS

       Some wonderful people have ported Unidecode to other languages!

       •   Python: <https://pypi.python.org/pypi/Unidecode>

       •   PHP: <https://github.com/silverstripe-labs/silverstripe-unidecode>

       •   Ruby: <http://www.rubydoc.info/gems/unidecode/1.0.0/frames>

       •   JavaScript: <https://www.npmjs.org/package/unidecode>

       •   Java: <https://github.com/xuender/unidecode>

       I can't vouch for the details of each port, but these are clever people, so I'm sure they did a fine job.

LICENSE

       Copyright (c) 2001, 2014, 2015 Sean M. Burke.

       Unidecode is distributed under the Perl Artistic License ( perlartistic ), namely:

       This library is free software; you can redistribute it and/or modify it under  the  same  terms  as  Perl
       itself.

       This  program  is  distributed in the hope that it will be useful, but without any warranty; without even
       the implied warranty of merchantability or fitness for a particular purpose.

DISCLAIMER

       Much of Text::Unidecode's internal data is based on data from The Unicode Consortium,  with  which  I  am
       unaffiliated.   A  good  deal  of  the internal data comes from suggestions that have been contributed by
       people other than myself.

       The views and conclusions contained in my software and documentation are my  own--  they  should  not  be
       interpreted  as  representing  official policies, either expressed or implied, of The Unicode Consortium;
       nor should they be interepreted as necessarily the views or conclusions of people who have contributed to
       this project.

AUTHOR

       Your pal, Sean M. Burke "sburke@cpan.org"

perl v5.20.2                                       2015-10-21                               Text::Unidecode(3pm)