Ubuntu Manpage: Lingua::DE::ASCII - Perl extension to convert german umlauts to and from ascii

Provided by: liblingua-de-ascii-perl_0.11-1.1_all

NAME

       Lingua::DE::ASCII - Perl extension to convert german umlauts to and from ascii

SYNOPSIS

         use Lingua::DE::ASCII;
         print to_ascii("Umlaute wie ae,oe,ue,ss oder auch e usw. " .
                        "sind nicht im ASCII Format " .
                        "und werden deshalb umgeschrieben);
         print to_latin1("Dies muesste auch rueckwaerts funktionieren ma cherie");

DESCRIPTION

This module enables conversion from and to the ASCII format of german texts.

It has two methods: "to_ascii" and "to_latin1" which one do exactly what they say.

Please note that both methods take only one scalar as argument and not whole a list.

to_ascii($string)
The "to_ascii" method is just simple. It replaces each printable ANSI character (codes 160..255) with a
(hopefully) sensfull ASCII representation (might be more than one character). The ANSI character with
codes 128..160 are not printable and they are removed by default. The transliteration is defined with
the global %Lingua::DE::ASCII::ANSI_TO_ASCII_TRANSLITERATION variable. You can change this variable if
you want to change the transliteration behaviour.

to_latin1($string)
The "to_latin1" method is very complex (more than 700 lines of code). It retranslates 7-bit ASCII
representations into a reasonable german ANSI representation. Thus it changes mainly 'ae' to 'ae', 'oe'
to 'oe', 'ue' to 'ue', 'ss' to 'ss'. It also changes some other characters, e.g. '(C)' to 'X' or in words
like 'Crepe' it also restores the really writing 'Crepe'.

Of course, the method only tries to change where it should. That explains the enormous complexity of this
method, as it tries to solve a hard linguistic problem with a bit logic and many regular expressions
(please also look to BUGS if you are interested in known problems).

It's quicker to let "to_latin1" work on a big (even multiline) string than to make a lot of callings with
little strings (like lines or words). The reason is that the method works with a lot of regular
expressions (as nearly every line of code contains a regexp). As Perl is very good to optimize them
especially for long strings, you can gain a good speed advantage if you need it.

At the moment you can't change the behaviour of the "to_latin1" method (e.g. switching from the new
german spelling to the old one), and I'm not sure whether I will enable it. Please inform me, if you feel
that it would be important or much convenient in a case.

EXPORT
to_ascii($string) to_latin1($string)

BUGS

That's only a stupid computer program, faced with a very hard AI problem. So there will be some words
that will be always hard to retranslate from ASCII to Latin-1 encoding. A known example is the difference
between "Mass(einheit)" and "Masseentropie" or similar. Another examples are "floesse" and "Floesse" or
"(Der Schornstein) russe" and "Russe", "Geheimtuer(isch)" and "Geheimtuer", "anzu-ecken" and "anzuecken"
or quite even a lonely "ss" or "ss". Also, it's hard to find the right spelling for the prefixes
"miss-" or "miss-". In doubt I tried to use to more common word and in even still a doubt the program
tries to be conservative, that means it prefers not to translate to an umlaut. Reason is that the text is
still readable with one "ae","oe","ue" or "ss" too much, but a wrong "ae", "oe", "ue" or "ss" can make it
very unreadable.

I tried it with a huge list of german words, but please tell me if you find a bug.

This module is intended for ANSI code that is e.g. different from windows coding.

Misspelled words will create a lot of extra mistakes by the program. In doubt it's better to write with
new Rechtschreibung.

The "to_latin1" method is not very quick (but quick enough to work interactively with text files of about
100 KB). It's programmed to handle as many exceptions as possible.

I avoided localizations for character handling (thus it should work on every computer), but the price is
that in some rare cases of words with multiple umlauts (like "Haekeltuelle") some buggy conversions can
occur. Please tell me if you find such words.

The "to_latin1" method also has some knowledge to work with some basic English. (So that some words
don't confuse everything and you can also use some code snippets in your text). However, it is very
recommended to use American English instead of British English. Espeically many plural forms (ending on
"oes") are hard to handle, and often I decided not to implement an extra rule as it is a "Lingua::DE::*"
module and not an English one.

TESTS

       The test scripts (called by e.g. "make test") need a long time.  The reason is that I test it with a huge
       german word list. Normally you can skip this test if there is no failing in the first few seconds.
       However, the tests also have a progress bar (either a Term::ProgressBar if installed or just a simple
       text output), so that you can see the advances :-)

       There are two major reasons why I added so many words to test even to the CPAN release. On the one hand,
       I wanted to give you a chance to detect strange behaviour under uncommon circumstances. (I haven't test
       it under a non-german locale based operation system e.g. and I have also included that words are tests
       under a random environment to find out unexpected errors) On the other hand, I also wanted you to give a
       chance to detect yourself whether a "to_latin1" result is a bug or a feature.  (Just search through the
       content of the test files to determine whether a strange looking word is tested for and thus wanted).

       There is also a test with common 1000 English words (having an ae,oe,ue or ss inside), as German is
       nowadays often mixed with a lot of them, and this module should not be confused with them.

AUTHOR

       Janek Schleicher, <bigj@kamelfreund.de>

POD ERRORS

       Hey! The above document had some coding errors, which are explained below:

       Around line 949:
           Non-ASCII character seen before =encoding in 'ae,oe,ue,ss'. Assuming ISO8859-1

perl v5.20.2                                       2003-09-02                                       ASCII(3perl)