Provided by: libencode-zapcp1252-perl_0.40-2_all bug

Name

       Encode::ZapCP1252 - Zap Windows Western Gremlins

Synopsis

         use Encode::ZapCP1252;

         # Zap or fix in-place.
         zap_cp1252 $latin1_text;
         fix_cp1252 $utf8_text;

         # Zap or fix copy.
         my $clean_latin1 = zap_cp1252 $latin1_text;
         my $fixed_utf8   = fix_cp1252 $utf8_text;

Description

       Have you ever been processing a Web form submit for feed, assuming that the incoming text
       was encoded as specified in the Content-Type header, or in the XML declaration, only to
       end up with a bunch of junk because someone pasted in content from Microsoft Word? Well,
       this is because Microsoft uses a superset of the Latin-1 encoding called "Windows Western"
       or "CP1252". If the specified encoding is Latin-1, mostly things will come out right, but
       a few things--like curly quotes, m-dashes, ellipses, and the like--may not. The
       differences are well-known; you see a nice chart at documenting the differences on
       Wikipedia <https://en.wikipedia.org/wiki/Windows-1252>.

       Of course, that won't really help you. What will help you is to quit using Latin-1 and
       switch to UTF-8. Then you can just convert from CP1252 to UTF-8 without losing a thing,
       just like this:

         use Encode;
         $text = decode 'cp1252', $text, 1;

       But I know that there are those of you out there stuck with Latin-1 and who don't want any
       junk characters from Word users. That's where this module comes in. Its "zap_cp1252"
       function will zap those CP1252 gremlins for you, turning them into their appropriate ASCII
       approximations.

       Another case that can occasionally come up is when you're reading reading in text that
       claims to be UTF-8, but it still ends up with some CP1252 gremlins mixed in with properly
       encoded characters. I've seen examples of just this sort of thing when processing GMail
       messages and attempting to insert them into a UTF-8 database, as well as in some feeds
       processed by, say Yahoo! Pipes. Doesn't work so well. For such cases, there's
       "fix_cp1252", which converts those CP1252 gremlins into their UTF-8 equivalents.

Usage

       This module exports two subroutines: "zap_cp1252()" and "fix_cp1252()", each of which
       accept a single argument:

         zap_cp1252 $text;
         fix_cp1252 $text;

       When called in a void context, as in these examples, "zap_cp1252()" and "fix_cp1252()"
       subroutine perform in place conversions of any CP1252 gremlins into their appropriate
       ASCII approximations or UTF-8 equivalents, respectively. Note that because the conversion
       happens in place, the data to be converted cannot be a string constant; it must be a
       scalar variable.

       When called in a scalar or list context, on the other hand, a copy will be modifed and
       returned. The original string will be unchanged:

         my $clean_latin1 = zap_cp1252 $latin1_text;
         my $fixed_utf8   = fix_cp1252 $utf8_text;

       In this case, even constant values can be processed. Either way, "undef"s will be ignored.

       In Perl 5.10 and higher, the functions may optionally be called with no arguments, in
       which case $_ will be converted, instead:

         zap_cp1252; # Modify $_ in-place.
         fix_cp1252; # Modify $_ in-place.
         my $zapped = zap_cp1252; # Copy $_ and return zapped
         my $fixed = zap_cp1252; # Copy $_ and return fixed

       In Perl 5.8.8 and higher, the conversion will work even when the string is decoded to
       Perl's internal form (usually via "decode 'ISO-8859-1', $text") or the string is encoded
       (and thus simply processed by Perl as a series of bytes). The conversion will even work on
       a string that has not been decoded but has had its "utf8" flag flipped anyway (usually by
       an injudicious use of "Encode::_utf8_on()". This is to enable the highest possible
       likelihood of removing those CP1252 gremlins no matter what kind of processing has already
       been executed on the string.

       That said, although "fix_cp1252()" takes a conservative approach to replacing text in
       Unicode strings, it should be used as a very last option. Really, avoid that situation if
       you can.

Conversion Table

       Here's how the characters are converted to ASCII and UTF-8. The ASCII conversions are not
       perfect, but they should be good enough for general cleanup. If you want perfect, switch
       to UTF-8 and be done with it!

          Hex | Char  | ASCII | UTF-8 Name
         -----+-------+-------+-------------------------------------------
         0x80 |   X   |   e   | EURO SIGN
         0x82 |   X   |   ,   | SINGLE LOW-9 QUOTATION MARK
         0x83 |   X   |   f   | LATIN SMALL LETTER F WITH HOOK
         0x84 |   X   |   ,,  | DOUBLE LOW-9 QUOTATION MARK
         0x85 |   X   |  ...  | HORIZONTAL ELLIPSIS
         0x86 |   X   |   +   | DAGGER
         0x87 |   X   |   ++  | DOUBLE DAGGER
         0x88 |   X   |   ^   | MODIFIER LETTER CIRCUMFLEX ACCENT
         0x89 |   X   |   %   | PER MILLE SIGN
         0x8a |   X   |   S   | LATIN CAPITAL LETTER S WITH CARON
         0x8b |   X   |   <   | SINGLE LEFT-POINTING ANGLE QUOTATION MARK
         0x8c |   X   |   OE  | LATIN CAPITAL LIGATURE OE
         0x8e |   X   |   Z   | LATIN CAPITAL LETTER Z WITH CARON
         0x91 |   X   |   '   | LEFT SINGLE QUOTATION MARK
         0x92 |   X   |   '   | RIGHT SINGLE QUOTATION MARK
         0x93 |   X   |   "   | LEFT DOUBLE QUOTATION MARK
         0x94 |   X   |   "   | RIGHT DOUBLE QUOTATION MARK
         0x95 |   X   |   *   | BULLET
         0x96 |   X   |   -   | EN DASH
         0x97 |   X   |   --  | EM DASH
         0x98 |   X   |   ~   | SMALL TILDE
         0x99 |   X   |  (tm) | TRADE MARK SIGN
         0x9a |   X   |   s   | LATIN SMALL LETTER S WITH CARON
         0x9b |   X   |   >   | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
         0x9c |   X   |   oe  | LATIN SMALL LIGATURE OE
         0x9e |   X   |   z   | LATIN SMALL LETTER Z WITH CARON
         0x9f |   X   |   Y   | LATIN CAPITAL LETTER Y WITH DIAERESIS

   Changing the Tables
       Don't like these conversions? You can modify them to your heart's content by accessing
       this module's internal conversion tables. For example, if you wanted "zap_cp1252()" to use
       an uppercase "E" for the euro sign, just do this:

         local $Encode::ZapCP1252::ascii_for{"\x80"} = 'E';

       Or if, for some reason, you wanted the UTF-8 equivalent for a bullet converted by
       "fix_cp1252()" to be a black square, you can assign the bytes (never a Unicode string)
       like so:

         local $Encode::ZapCP1252::utf8_for{"\x95"} = Encode::encode_utf8('X');

       Just remember, without "local" this would be a global change. In that case, be careful if
       your code zaps CP1252 elsewhere. Of course, it shouldn't really be doing that. These
       functions are just for cleaning up messes in one spot in your code, not for making a
       fundamental part of your text handling. For that, use Encode.

See Also

       Encode
       Encoding::FixLatin
       Wikipedia: Windows-1252 <https://en.wikipedia.org/wiki/Windows-1252>

Support

       This module is stored in an open GitHub repository <https://github.com/theory/encode-
       zapcp1252/>. Feel free to fork and contribute!

       Please file bug reports via GitHub Issues <https://github.com/theory/encode-
       zapcp1252/issues/> or by sending mail to bug-Encode-CP1252@rt.cpan.org <mailto:bug-Encode-
       CP1252@rt.cpan.org>.

Author

       David E. Wheeler <david@justatheory.com>

Acknowledgments

       My thanks to Sean Burke for sending me his original method for converting CP1252 gremlins
       to more-or-less appropriate ASCII characters, and to Karl Williamson for more correct
       handling of Unicode strings.

Copyright and License

       Copyright (c) 2005-2020 David E. Wheeler. Some Rights Reserved.

       This module is free software; you can redistribute it and/or modify it under the same
       terms as Perl itself.