Provided by: libcatalyst-perl_5.90115-1_all bug

Name

       Catalyst::UTF8 - All About UTF8 and Catalyst Encoding

Description

       Starting in 5.90080 Catalyst will enable UTF8 encoding by default for text like body
       responses.  In addition we've made a ton of fixes around encoding and utf8 scattered
       throughout the codebase.  This document attempts to give an overview of the assumptions
       and practices that  Catalyst uses when dealing with UTF8 and encoding issues.  You should
       also review the Changes file, Catalyst::Delta and Catalyst::Upgrading for more.

       We attempt to describe all relevant processes, try to give some advice and explain where
       we may have been exceptional to respect our commitment to backwards compatibility.

UTF8 in Controller Actions

       Using UTF8 characters in your Controller classes and actions.

   Summary
       In this section we will review changes to how UTF8 characters can be used in controller
       actions, how it looks in the debugging screens (and your logs) as well as how you
       construct URL objects to actions with UTF8 paths (or using UTF8 args or captures).

   Unicode in Controllers and URLs
           package MyApp::Controller::Root;

           use utf8;
           use base 'Catalyst::Controller';

           sub heart_with_arg :Path('♥') Args(1)  {
             my ($self, $c, $arg) = @_;
           }

           sub base :Chained('/') CaptureArgs(0) {
             my ($self, $c) = @_;
           }

             sub capture :Chained('base') PathPart('♥') CaptureArgs(1) {
               my ($self, $c, $capture) = @_;
             }

               sub arg :Chained('capture') PathPart('♥') Args(1) {
                 my ($self, $c, $arg) = @_;
               }

   Discussion
       In the example controller above we have constructed two matchable URL routes:

           http://localhost/root/♥/{arg}
           http://localhost/base/♥/{capture}/♥/{arg}

       The first one is a classic Path type action and the second uses Chaining, and spans three
       actions in total.  As you can see, you can use unicode characters in your Path and
       PathPart attributes (remember to use the "utf8" pragma to allow these multibyte characters
       in your source).  The two constructed matchable routes would match the following incoming
       URLs:

           (heart_with_arg) -> http://localhost/root/%E2%99%A5/{arg}
           (base/capture/arg) -> http://localhost/base/%E2%99%A5/{capture}/%E2%99%A5/{arg}

       That path path "%E2%99%A5" is url encoded unicode (assuming you are hitting this with a
       reasonably modern browser).  Its basically what goes over HTTP when your type a browser
       location that has the unicode 'heart' in it.  However we will use the unicode symbol in
       your debugging messages:

           [debug] Loaded Path actions:
           .-------------------------------------+--------------------------------------.
           | Path                                | Private                              |
           +-------------------------------------+--------------------------------------+
           | /root/♥/*                          | /root/heart_with_arg                  |
           '-------------------------------------+--------------------------------------'

           [debug] Loaded Chained actions:
           .-------------------------------------+--------------------------------------.
           | Path Spec                           | Private                              |
           +-------------------------------------+--------------------------------------+
           | /base/♥/*/♥/*                       | /root/base (0)                       |
           |                                     | -> /root/capture (1)                 |
           |                                     | => /root/arg                         |
           '-------------------------------------+--------------------------------------'

       And if the requested URL uses unicode characters in your captures or args (such as
       "http://localhost:/base/♥/♥/♥/♥") you should see the arguments and captures as their
       unicode characters as well:

           [debug] Arguments are "♥"
           [debug] "GET" request for "base/♥/♥/♥/♥" from "127.0.0.1"
           .------------------------------------------------------------+-----------.
           | Action                                                     | Time      |
           +------------------------------------------------------------+-----------+
           | /root/base                                                 | 0.000080s |
           | /root/capture                                              | 0.000075s |
           | /root/arg                                                  | 0.000755s |
           '------------------------------------------------------------+-----------'

       Again, remember that we are display the unicode character and using it to match actions
       containing such multibyte characters BUT over HTTP you are getting these as URL encoded
       bytes.  For example if you looked at the PSGI $env value for "REQUEST_URI" you would see
       (for the above request)

           REQUEST_URI => "/base/%E2%99%A5/%E2%99%A5/%E2%99%A5/%E2%99%A5"

       So on the incoming request we decode so that we can match and display unicode characters
       (after decoding the URL encoding).  This makes it straightforward to use these types of
       multibyte characters in your actions and see them incoming in captures and arguments.
       Please keep this in might if you are doing for example regular expression matching, length
       determination or other string comparisons, you will need to try these incoming variables
       as though UTF8 strings.  For example in the following action:

               sub arg :Chained('capture') PathPart('♥') Args(1) {
                 my ($self, $c, $arg) = @_;
               }

       when $arg is "♥" you should expect "length($arg)" to be 1 since it is indeed one character
       although it will take more than one byte to store.

   UTF8 in constructing URLs via $c->uri_for
       For the reverse (constructing meaningful URLs to actions that contain multibyte characters
       in their paths or path parts, or when you want to include such characters in your captures
       or arguments) Catalyst will do the right thing (again just remember to use the "utf8"
       pragma).

           use utf8;
           my $url = $c->uri_for( $c->controller('Root')->action_for('arg'), ['♥','♥']);

       When you stringify this object (for use in a template, for example) it will automatically
       do the right thing regarding utf8 encoding and url encoding.

           http://localhost/base/%E2%99%A5/%E2%99%A5/%E2%99%A5/%E2%99%A5

       Since again what you want is a properly url encoded version of this.  In this case your
       string length will reflect URL encoded bytes, not the character length.  Ultimately what
       you want to send over the wire via HTTP needs to be bytes.

UTF8 in GET Query and Form POST

       What Catalyst does with UTF8 in your GET and classic HTML Form POST

   UTF8 in URL query and keywords
       The same rules that we find in URL paths also cover URL query parts.  That is if one types
       a URL like this into the browser

               http://localhost/example?♥=♥♥

       When this goes 'over the wire' to your application server its going to be as percent
       encoded bytes:

               http://localhost/example?%E2%99%A5=%E2%99%A5%E2%99%A5

       When Catalyst encounters this we decode the percent encoding and the utf8 so that we can
       properly display this information (such as in the debugging logs or in a response.)

               [debug] Query Parameters are:
               .-------------------------------------+--------------------------------------.
               | Parameter                           | Value                                |
               +-------------------------------------+--------------------------------------+
               | ♥                                   | ♥♥                                   |
               '-------------------------------------+--------------------------------------'

       All the values and keys that are part of $c->req->query_parameters will be utf8 decoded.
       So you should not need to do anything special to take those values/keys and send them to
       the body response (since as we will see later Catalyst will do all the necessary encoding
       for you).

       Again, remember that values of your parameters are now decode into Unicode strings.  so
       for example you'd expect the result of length to reflect the character length not the byte
       length.

       Just like with arguments and captures, you can use utf8 literals (or utf8 strings) in
       $c->uri_for:

               use utf8;
               my $url = $c->uri_for( $c->controller('Root')->action_for('example'), {'♥' => '♥♥'});

       When you stringify this object (for use in a template, for example) it will automatically
       do the right thing regarding utf8 encoding and url encoding.

               http://localhost/example?%E2%99%A5=%E2%99%A5%E2%99%A5

       Since again what you want is a properly url encoded version of this.  Ultimately what you
       want to send over the wire via HTTP needs to be bytes (not unicode characters).

       Remember if you use any utf8 literals in your source code, you should use the "use utf8"
       pragma.

       NOTE: Assuming UTF-8 in your query parameters and keywords may be an issue if you have
       legacy code where you created URL in templates manually and used an encoding other than
       UTF-8.  In these cases you may find versions of Catalyst after 5.90080+ will incorrectly
       decode.  For backwards compatibility we offer three configurations settings, here
       described in order of precedence:

       "do_not_decode_query"

       If true, then do not try to character decode any wide characters in your request URL query
       or keywords.  You will need to handle this manually in your action code (although if you
       choose this setting, chances are you already do this).

       "default_query_encoding"

       This setting allows one to specify a fixed value for how to decode your query, instead of
       using the default, UTF-8.

       "decode_query_using_global_encoding"

       If this is true we decode using whatever you set "encoding" to.

   UTF8 in Form POST
       In general most modern browsers will follow the specification, which says that POSTed form
       fields should be encoded in the same way that the document was served with.  That means
       that if you are using modern Catalyst and serving UTF8 encoded responses, a browser is
       supposed to notice that and encode the form POSTs accordingly.

       As a result since Catalyst now serves UTF8 encoded responses by default, this means that
       you can mostly rely on incoming form POSTs to be so encoded.  Catalyst will make this
       assumption and decode accordingly (unless you explicitly turn off encoding...)  If you are
       running Catalyst in developer debug, then you will see the correct unicode characters in
       the debug output.  For example if you generate a POST request:

               use Catalyst::Test 'MyApp';
               use utf8;

               my $res = request POST "/example/posted", ['♥'=>'♥', '♥♥'=>'♥'];

       Running in CATALYST_DEBUG=1 mode you should see output like this:

           [debug] Body Parameters are:
           .-------------------------------------+--------------------------------------.
           | Parameter                           | Value                                |
           +-------------------------------------+--------------------------------------+
           | ♥                                   | ♥                                    |
           | ♥♥                                  | ♥                                    |
           '-------------------------------------+--------------------------------------'

       And if you had a controller like this:

               package MyApp::Controller::Example;

               use base 'Catalyst::Controller';

               sub posted :POST Local {
                       my ($self, $c) = @_;
                       $c->res->content_type('text/plain');
                       $c->res->body("hearts => ${\$c->req->post_parameters->{♥}}");
               }

       The following test case would be true:

               use Encode 2.21 'decode_utf8';
               is decode_utf8($req->content), 'hearts => ♥';

       In this case we decode so that we can print and compare strings with multibyte characters.

       NOTE  In some cases some browsers may not follow the specification and set the form POST
       encoding based on the server response.  Catalyst itself doesn't attempt any workarounds,
       but one common approach is to use a hidden form field with a UTF8 value (You might be
       familiar with this from how Ruby on Rails has HTML form helpers that do that
       automatically).  In that case some browsers will send UTF8 encoded if it notices the
       hidden input field contains such a character.  Also, you can add an HTML attribute to your
       form tag which many modern browsers will respect to set the encoding
       (accept-charset="utf-8").  And lastly there are some javascript based tricks and
       workarounds for even more odd cases (just search the web for this will return a number of
       approaches.  Hopefully as more compliant browsers become popular these edge cases will
       fade.

       NOTE  It is possible for a form POST multipart response (normally a file upload) to
       contain inline content with mixed content character sets and encoding.  For example one
       might create a POST like this:

           use utf8;
           use HTTP::Request::Common;

           my $utf8 = 'test ♥';
           my $shiftjs = 'test テスト';
           my $req = POST '/root/echo_arg',
               Content_Type => 'form-data',
                 Content =>  [
                   arg0 => 'helloworld',
                   Encode::encode('UTF-8','♥') => Encode::encode('UTF-8','♥♥'),
                   arg1 => [
                     undef, '',
                     'Content-Type' =>'text/plain; charset=UTF-8',
                     'Content' => Encode::encode('UTF-8', $utf8)],
                   arg2 => [
                     undef, '',
                     'Content-Type' =>'text/plain; charset=SHIFT_JIS',
                     'Content' => Encode::encode('SHIFT_JIS', $shiftjs)],
                   arg2 => [
                     undef, '',
                     'Content-Type' =>'text/plain; charset=SHIFT_JIS',
                     'Content' => Encode::encode('SHIFT_JIS', $shiftjs)],
                 ];

       In this case we've created a POST request but each part specifies its own content
       character set (and setting a content encoding would also be possible).  Generally one
       would not run into this situation in a web browser context but for completeness sake
       Catalyst will notice if a multipart POST contains parts with complex or extended header
       information.  In these cases we will try to inspect the meta data and do the right thing
       (in the above case we'd use SHIFT_JIS to decode, not UTF-8).  However if after inspecting
       the headers we cannot figure out how to decode the data, in those cases it will not
       attempt to apply decoding to the form values.  Instead the part will be represented as an
       instance of an object Catalyst::Request::PartData which will contain all the header
       information needed for you to perform custom parser of the data.

       Ideally we'd fix Catalyst to be smarter about decoding so please submit your cases of this
       so we can add intelligence to the parser and find a way to extract a valid value out of
       it.

UTF8 Encoding in Body Response

       When does Catalyst encode your response body and what rules does it use to determine when
       that is needed.

   Summary
               use utf8;
               use warnings;
               use strict;

               package MyApp::Controller::Root;

               use base 'Catalyst::Controller';
               use File::Spec;

               sub scalar_body :Local {
                       my ($self, $c) = @_;
                       $c->response->content_type('text/html');
                       $c->response->body("<p>This is scalar_body action ♥</p>");
               }

               sub stream_write :Local {
                       my ($self, $c) = @_;
                       $c->response->content_type('text/html');
                       $c->response->write("<p>This is stream_write action ♥</p>");
               }

               sub stream_write_fh :Local {
                       my ($self, $c) = @_;
                       $c->response->content_type('text/html');

                       my $writer = $c->res->write_fh;
                       $writer->write_encoded('<p>This is stream_write_fh action ♥</p>');
                       $writer->close;
               }

               sub stream_body_fh :Local {
                       my ($self, $c) = @_;
                       my $path = File::Spec->catfile('t', 'utf8.txt');
                       open(my $fh, '<', $path) || die "trouble: $!";
                       $c->response->content_type('text/html');
                       $c->response->body($fh);
               }

   Discussion
       Beginning with Catalyst version 5.90080 You no longer need to set the encoding
       configuration (although doing so won't hurt anything).

       Currently we only encode if the content type is one of the types which generally expects a
       UTF8 encoding.  This is determined by the following regular expression:

           our $DEFAULT_ENCODE_CONTENT_TYPE_MATCH = qr{text|xml$|javascript$};
           $c->response->content_type =~ /$DEFAULT_ENCODE_CONTENT_TYPE_MATCH/

       This is a global variable in Catalyst::Response which is stored in the
       "encodable_content_type" attribute of $c->response.  You may currently alter this directly
       on the response or globally.  In the future we may offer a configuration setting for this.

       This would match content-types like the following (examples)

           text/plain
           text/html
           text/xml
           application/javascript
           application/xml
           application/vnd.user+xml

       You should set your content type prior to header finalization if you want Catalyst to
       encode.

       NOTE We do not attempt to encode "application/json" since the two most commonly used
       approaches (Catalyst::View::JSON and Catalyst::Action::REST) have already configured their
       JSON encoders to produce properly encoding UTF8 responses.  If you are rolling your own
       JSON encoding, you may need to set the encoder to do the right thing (or override the
       global regular expression to include the JSON media type).

   Encoding with Scalar Body
       Catalyst supports several methods of supplying your response with body content.  The first
       and currently most common is to set the Catalyst::Response ->body with a scalar string (
       as in the example):

               use utf8;

               sub scalar_body :Local {
                       my ($self, $c) = @_;
                       $c->response->content_type('text/html');
                       $c->response->body("<p>This is scalar_body action ♥</p>");
               }

       In general you should need to do nothing else since Catalyst will automatically encode
       this string during body finalization.  The only matter to watch out for is to make sure
       the string has not already been encoded, as this will result in double encoding errors.

       NOTE pay attention to the content-type setting in the example.  Catalyst inspects that
       content type carefully to determine if the body needs encoding).

       NOTE If you set the character set of the response Catalyst will skip encoding IF the
       character set is set to something that doesn't match $c->encoding->mime_name. We will
       assume if you are setting an alternative character set, that means you want to handle the
       encoding yourself.  However it might be easier to set $c->encoding for a given response
       cycle since you can override this for a given response.  For example here's how to
       override the default encoding and set the correct character set in the response:

           sub override_encoding :Local {
             my ($self, $c) = @_;
             $c->res->content_type('text/plain');
             $c->encoding(Encode::find_encoding('Shift_JIS'));
             $c->response->body("テスト");
           }

       This will use the alternative encoding for a single response.

       NOTE If you manually set the content-type character set to whatever
       $c->encoding->mime_name is set to, we STILL encode, rather than assume your manual setting
       is a flag to override.  This is done to support backward compatible assumptions (in
       particular Catalyst::View::TT has set a utf-8 character set in its default content-type
       for ages, even though it does not itself do any encoding on the body response).  If you
       are going to handle encoding manually you may set $c->clear_encoding for a single request
       response cycle, or as in the above example set an alternative encoding.

   Encoding with streaming type responses
       Catalyst offers two approaches to streaming your body response.  Again, you must remember
       to set your content type prior to streaming, since invoking a streaming response will
       automatically finalize and send your HTTP headers (and your content type MUST be one that
       matches the regular expression given above.)

       Also, if you are going to override $c->encoding (or invoke $c->clear_encoding), you should
       do that before anything else!

       The first streaming method is to use the "write" method on the response object.  This
       method allows 'inlined' streaming and is generally used with blocking style servers.

               sub stream_write :Local {
                       my ($self, $c) = @_;
                       $c->response->content_type('text/html');
                       $c->response->write("<p>This is stream_write action ♥</p>");
               }

       You may call the "write" method as often as you need to finish streaming all your content.
       Catalyst will encode each line in turn as long as the content-type meets the 'encodable
       types' requirement and $c->encoding is set (which it is, as long as you did not change
       it).

       NOTE If you try to change the encoding after you start the stream, this will invoke an
       error response.  However since you've already started streaming this will not show up as
       an HTTP error status code, but rather error information in your body response and an error
       in your logs.

       NOTE If you use ->body AFTER using ->write (for example you may do this to write your HTML
       HEAD information as fast as possible) we expect the contents to body to be encoded as it
       normally would be if you never called ->write.  In general unless you are doing weird
       custom stuff with encoding this is likely to just already do the correct thing.

       The second way to stream a response is to get the response writer object and invoke
       methods on that directly:

               sub stream_write_fh :Local {
                       my ($self, $c) = @_;
                       $c->response->content_type('text/html');

                       my $writer = $c->res->write_fh;
                       $writer->write_encoded('<p>This is stream_write_fh action ♥</p>');
                       $writer->close;
               }

       This can be used just like the "write" method, but typically you request this object when
       you want to do a nonblocking style response since the writer object can be closed over or
       sent to a model that will invoke it in a non blocking manner.  For more on using the
       writer object for non blocking responses you should review the "Catalyst" documentation
       and also you can look at several articles from last years advent, in particular:

       <http://www.catalystframework.org/calendar/2013/10>,
       <http://www.catalystframework.org/calendar/2013/11>,
       <http://www.catalystframework.org/calendar/2013/12>,
       <http://www.catalystframework.org/calendar/2013/13>,
       <http://www.catalystframework.org/calendar/2013/14>.

       The main difference this year is that previously calling ->write_fh would return the
       actual Plack writer object that was supplied by your Plack application handler, whereas
       now we wrap that object in a lightweight decorator object that proxies the "write" and
       "close" methods and supplies an additional "write_encoded" method.  "write_encoded" does
       the exact same thing as "write" except that it will first encode the string when
       necessary.  In general if you are streaming encodable content such as HTML this is the
       method to use.  If you are streaming binary content, you should just use the "write"
       method (although if the content type is set correctly we would skip encoding anyway, but
       you may as well avoid the extra noop overhead).

       The last style of content response that Catalyst supports is setting the body to a
       filehandle like object.  In this case the object is passed down to the Plack application
       handler directly and currently we do nothing to set encoding.

               sub stream_body_fh :Local {
                       my ($self, $c) = @_;
                       my $path = File::Spec->catfile('t', 'utf8.txt');
                       open(my $fh, '<', $path) || die "trouble: $!";
                       $c->response->content_type('text/html');
                       $c->response->body($fh);
               }

       In this example we create a filehandle to a text file that contains UTF8 encoded
       characters. We pass this down without modification, which I think is correct since we
       don't want to double encode.  However this may change in a future development release so
       please be sure to double check the current docs and changelog.  Its possible a future
       release will require you to to set a encoding on the IO layer level so that we can be sure
       to properly encode at body finalization.  So this is still an edge case we are writing
       test examples for.  But for now if you are returning a filehandle like response, you are
       expected to make sure you are following the PSGI specification and return raw bytes.

   Override the Encoding on Context
       As already noted you may change the current encoding (or remove it) by setting an
       alternative encoding on the context;

           $c->encoding(Encode::find_encoding('Shift_JIS'));

       Please note that you can continue to change encoding UNTIL the headers have been
       finalized.  The last setting always wins.  Trying to change encoding after header
       finalization is an error.

   Setting the Content Encoding HTTP Header
       In some cases you may set a content encoding on your response.  For example if you are
       encoding your response with gzip.  In this case you are again on your own.  If we notice
       that the content encoding header is set when we hit finalization, we skip automatic
       encoding:

           use Encode;
           use Compress::Zlib;
           use utf8;

           sub gzipped :Local {
             my ($self, $c) = @_;

             $c->res->content_type('text/plain');
             $c->res->content_type_charset('UTF-8');
             $c->res->content_encoding('gzip');

             $c->response->body(
               Compress::Zlib::memGzip(
                 Encode::encode_utf8("manual_1 ♥")));
           }

       If you are using Catalyst::Plugin::Compress you need to upgrade to the most recent version
       in order to be compatible with changes introduced in Catalyst 5.90080.  Other plugins may
       require updates (please open bugs if you find them).

       NOTE Content encoding may be set to 'identify' and we will still perform automatic
       encoding if the content type is encodable and an encoding is present for the context.

   Using Common Views
       The following common views have been updated so that their tests pass with default UTF8
       encoding for Catalyst:

       Catalyst::View::TT, Catalyst::View::Mason, Catalyst::View::HTML::Mason,
       Catalyst::View::Xslate

       See Catalyst::Upgrading for additional information on Catalyst extensions that require
       upgrades.

       In generally for the common views you should not need to do anything special.  If your
       actual template files contain UTF8 literals you should set configuration on your View to
       enable that.  For example in TT, if your template has actual UTF8 character in it you
       should do the following:

           MyApp::View::TT->config(ENCODING => 'utf-8');

       However Catalyst::View::Xslate wants to do the UTF8 encoding for you (We assume that the
       authors of that view did this as a workaround to the fact that until now encoding was not
       core to Catalyst.  So if you use that view, you either need to tell it to not encode, or
       you need to turn off encoding for Catalyst.

           MyApp::View::Xslate->config(encode_body => 0);

       or

           MyApp->config(encoding=>undef);

       Preference is to disable it in the View.

       Other views may be similar.  You should review View documentation and test during
       upgrading.  We tried to make sure most common views worked properly and noted all
       workaround but if we missed something please alert the development team (instead of
       introducing a local hack into your application that will mean nobody will ever upgrade
       it...).

   Setting the response from an external PSGI application.
       Catalyst::Response allows one to set the response from an external PSGI application.  If
       you do this, and that external application sets a character set on the content-type, we
       "clear_encoding" for the rest of the response.  This is done to prevent double encoding.

       NOTE Even if the character set of the content type is the same as the encoding set in
       $c->encoding, we still skip encoding.  This is a regrettable difference from the general
       rule outlined above, where if the current character set is the same as the current
       encoding, we encode anyway.  Nevertheless I think this is the correct behavior since the
       earlier rule exists only to support backward compatibility with Catalyst::View::TT.

       In general if you want Catalyst to handle encoding, you should avoid setting the content
       type character set since Catalyst will do so automatically based on the requested response
       encoding.  Its best to request alternative encodings by setting $c->encoding and if you
       really want manual control of encoding you should always $c->clear_encoding so that
       programmers that come after you are very clear as to your intentions.

   Disabling default UTF8 encoding
       You may encounter issues with your legacy code running under default UTF8 body encoding.
       If so you can disable this with the following configurations setting:

               MyApp->config(encoding=>undef);

       Where "MyApp" is your Catalyst subclass.

       If you do not wish to disable all the Catalyst encoding features, you may disable specific
       features via two additional configuration options:  'skip_body_param_unicode_decoding' and
       'skip_complex_post_part_handling'.  The first will skip any attempt to decode POST
       parameters in the creating of body parameters and the second will skip creation of
       instances of Catalyst::Request::PartData in the case that the multipart form upload
       contains parts with a mix of content character sets.

       If you believe you have discovered a bug in UTF8 body encoding, I strongly encourage you
       to report it (and not try to hack a workaround in your local code).  We also recommend
       that you regard such a workaround as a temporary solution.  It is ideal if Catalyst
       extension authors can start to count on Catalyst doing the right thing for encoding.

Conclusion

       This document has attempted to be a complete review of how UTF8 and encoding works in the
       current version of Catalyst and also to document known issues, gotchas and backward
       compatible hacks.  Please report issues to the development team.

Author

       John Napiorkowski jjnapiork@cpan.org <email:jjnapiork@cpan.org>