Provided by: python-kitchen-doc_1.2.5-1_all 

NAME
kitchen - kitchen 1.2.5
Author Toshio Kuratomi
Date 19 March 2011
Version
1.0.x
We’ve all done it. In the process of writing a brand new application we’ve discovered that we need a
little bit of code that we’ve invented before. Perhaps it’s something to handle unicode text. Perhaps
it’s something to make a bit of python-2.5 code run on python-2.4. Whatever it is, it ends up being a
tiny bit of code that seems too small to worry about pushing into its own module so it sits there, a part
of your current project, waiting to be cut and pasted into your next project. And the next. And the
next. And since that little bittybit of code proved so useful to you, it’s highly likely that it proved
useful to someone else as well. Useful enough that they’ve written it and copy and pasted it over and
over into each of their new projects.
Well, no longer! Kitchen aims to pull these small snippets of code into a few python modules which you
can import and use within your project. No more copy and paste! Now you can let someone else maintain
and release these small snippets so that you can get on with your life.
This package forms the core of Kitchen. It contains some useful modules for using newer python standard
library modules on older python versions, text manipulation, PEP 386 versioning, and initializing
gettext. With this package we’re trying to provide a few useful features that don’t have too many
dependencies outside of the python standard library. We’ll be releasing other modules that drop into the
kitchen namespace to add other features (possibly with larger deps) as time goes on.
REQUIREMENTS
We’ve tried to keep the core kitchen module’s requirements lightweight. At the moment kitchen only
requires
python 2.4 or later
WARNING:
Kitchen-1.1.0 was the last release that supported python-2.3.x
Soft Requirements
If found, these libraries will be used to make the implementation of some part of kitchen better in some
way. If they are not present, the API that they enable will still exist but may function in a different
manner.
chardet
Used in guess_encoding() and guess_encoding_to_xml() to help guess encoding of byte strings being
converted. If not present, unknown encodings will be converted as if they were latin1
OTHER RECOMMENDED LIBRARIES
These libraries implement commonly used functionality that everyone seems to invent. Rather than
reinvent their wheel, I simply list the things that they do well for now. Perhaps if people can’t find
them normally, I’ll add them as requirements in setup.py or link them into kitchen’s namespace. For now,
I just mention them here:
bunch Bunch is a dictionary that you can use attribute lookup as well as bracket notation to access.
Setting it apart from most homebrewed implementations is the bunchify() function which will
descend nested structures of lists and dicts, transforming the dicts to Bunch’s.
hashlib
Python 2.5 and forward have a hashlib library that provides secure hash functions to python. If
you’re developing for python2.4 though, you can install the standalone hashlib library and have
access to the same functions.
iterutils
The python documentation for itertools has some examples of other nice iterable functions that can
be built from the itertools functions. This third-party module creates those recipes as a module.
ordereddict
Python 2.7 and forward have a OrderedDict that provides a dict whose items are ordered (and
indexable) as well as named.
unittest2
Python 2.7 has an updated unittest library with new functions not present in the python standard
library for Python 2.6 or less. If you want to use those new functions but need your testing
framework to be compatible with older Python the unittest2 library provides the update as an
external module.
nose If you want to use a test discovery tool instead of the unittest framework, nosetests provides a
simple to use way to do that.
LICENSE
This python module is distributed under the terms of the GNU Lesser General Public License Version 2 or
later.
NOTE:
Some parts of this module are licensed under terms less restrictive than the LGPLv2+. If you separate
these files from the work as a whole you are allowed to use them under the less restrictive licenses.
The following is a list of the files that are known:
Python 2 license
_subprocess.py, test_subprocess.py, defaultdict.py, test_defaultdict.py, _base64.py, and
test_base64.py
CONTENTS
Using kitchen to write good code
Kitchen’s functions won’t automatically make you a better programmer. You have to learn when and how to
use them as well. This section of the documentation is intended to show you some of the ways that you
can apply kitchen’s functions to problems that may have arisen in your life. The goal of this section is
to give you enough information to understand what the kitchen API can do for you and where in the
KitchenAPI docs to look for something that can help you with your next issue. Along the way, you might
pick up the knack for identifying issues with your code before you publish it. And that will make you a
better coder.
Overcoming frustration: Correctly using unicode in python2
In python-2.x, there’s two types that deal with text.
1. str is for strings of bytes. These are very similar in nature to how strings are handled in C.
2. unicode is for strings of unicode code points.
NOTE:
Just what the dickens is “Unicode”?
One mistake that people encountering this issue for the first time make is confusing the unicode type
and the encodings of unicode stored in the str type. In python, the unicode type stores an abstract
sequence of code points. Each code point represents a grapheme. By contrast, byte str stores a
sequence of bytes which can then be mapped to a sequence of code points. Each unicode encoding
(UTF-8, UTF-7, UTF-16, UTF-32, etc) maps different sequences of bytes to the unicode code points.
What does that mean to you as a programmer? When you’re dealing with text manipulations (finding the
number of characters in a string or cutting a string on word boundaries) you should be dealing with
unicode strings as they abstract characters in a manner that’s appropriate for thinking of them as a
sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk,
printing to a terminal, sending something over a network link, etc, you should be dealing with byte
str as those devices are going to need to deal with concrete implementations of what bytes represent
your abstract characters.
In the python2 world many APIs use these two classes interchangeably but there are several important APIs
where only one or the other will do the right thing. When you give the wrong type of string to an API
that wants the other type, you may end up with an exception being raised (UnicodeDecodeError or
UnicodeEncodeError). However, these exceptions aren’t always raised because python implicitly converts
between types… sometimes.
Frustration #1: Inconsistent Errors
Although converting when possible seems like the right thing to do, it’s actually the first source of
frustration. A programmer can test out their program with a string like: The quick brown fox jumped over
the lazy dog and not encounter any issues. But when they release their software into the wild, someone
enters the string: I sat down for coffee at the café and suddenly an exception is thrown. The reason?
The mechanism that converts between the two types is only able to deal with ASCII characters. Once you
throw non-ASCII characters into your strings, you have to start dealing with the conversion manually.
So, if I manually convert everything to either byte str or unicode strings, will I be okay? The answer
is…. sometimes.
Frustration #2: Inconsistent APIs
The problem you run into when converting everything to byte str or unicode strings is that you’ll be
using someone else’s API quite often (this includes the APIs in the python standard library) and find
that the API will only accept byte str or only accept unicode strings. Or worse, that the code will
accept either when you’re dealing with strings that consist solely of ASCII but throw an error when you
give it a string that’s got non-ASCII characters. When you encounter these APIs you first need to
identify which type will work better and then you have to convert your values to the correct type for
that code. Thus the programmer that wants to proactively fix all unicode errors in their code needs to
do two things:
1. You must keep track of what type your sequences of text are. Does my_sentence contain unicode or str?
If you don’t know that then you’re going to be in for a world of hurt.
2. Anytime you call a function you need to evaluate whether that function will do the right thing with
str or unicode values. Sending the wrong value here will lead to a UnicodeError being thrown when the
string contains non-ASCII characters.
NOTE:
There is one mitigating factor here. The python community has been standardizing on using unicode in
all its APIs. Although there are some APIs that you need to send byte str to in order to be safe,
(including things as ubiquitous as print() as we’ll see in the next section), it’s getting easier and
easier to use unicode strings with most APIs.
Frustration #3: Inconsistent treatment of output
Alright, since the python community is moving to using unicode strings everywhere, we might as well
convert everything to unicode strings and use that by default, right? Sounds good most of the time but
there’s at least one huge caveat to be aware of. Anytime you output text to the terminal or to a file,
the text has to be converted into a byte str. Python will try to implicitly convert from unicode to byte
str… but it will throw an exception if the bytes are non-ASCII:
>>> string = unicode(raw_input(), 'utf8')
café
>>> log = open('/var/tmp/debug.log', 'w')
>>> log.write(string)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
Okay, this is simple enough to solve: Just convert to a byte str and we’re all set:
>>> string = unicode(raw_input(), 'utf8')
café
>>> string_for_output = string.encode('utf8', 'replace')
>>> log = open('/var/tmp/debug.log', 'w')
>>> log.write(string_for_output)
>>>
So that was simple, right? Well… there’s one gotcha that makes things a bit harder to debug sometimes.
When you attempt to write non-ASCII unicode strings to a file-like object you get a traceback every time.
But what happens when you use print()? The terminal is a file-like object so it should raise an
exception right? The answer to that is…. sometimes:
$ python
>>> print u'café'
café
No exception. Okay, we’re fine then?
We are until someone does one of the following:
• Runs the script in a different locale:
$ LC_ALL=C python
>>> # Note: if you're using a good terminal program when running in the C locale
>>> # The terminal program will prevent you from entering non-ASCII characters
>>> # python will still recognize them if you use the codepoint instead:
>>> print u'caf\xe9'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
• Redirects output to a file:
$ cat test.py
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
print u'café'
$ ./test.py >t
Traceback (most recent call last):
File "./test.py", line 4, in <module>
print u'café'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
Okay, the locale thing is a pain but understandable: the C locale doesn’t understand any characters
outside of ASCII so naturally attempting to display those won’t work. Now why does redirecting to a file
cause problems? It’s because print() in python2 is treated specially. Whereas the other file-like
objects in python always convert to ASCII unless you set them up differently, using print() to output to
the terminal will use the user’s locale to convert before sending the output to the terminal. When
print() is not outputting to the terminal (being redirected to a file, for instance), print() decides
that it doesn’t know what locale to use for that file and so it tries to convert to ASCII instead.
So what does this mean for you, as a programmer? Unless you have the luxury of controlling how your
users use your code, you should always, always, always convert to a byte str before outputting strings to
the terminal or to a file. Python even provides you with a facility to do just this. If you know that
every unicode string you send to a particular file-like object (for instance, stdout) should be converted
to a particular encoding you can use a codecs.StreamWriter object to convert from a unicode string into a
byte str. In particular, codecs.getwriter() will return a StreamWriter class that will help you to wrap
a file-like object for output. Using our print() example:
$ cat test.py
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print u'café'
$ ./test.py >t
$ cat t
café
Frustrations #4 and #5 – The other shoes
In English, there’s a saying “waiting for the other shoe to drop”. It means that when one event (usually
bad) happens, you come to expect another event (usually worse) to come after. In this case we have two
other shoes.
Frustration #4: Now it doesn’t take byte strings?!
If you wrap sys.stdout using codecs.getwriter() and think you are now safe to print any variable without
checking its type I am afraid I must inform you that you’re not paying enough attention to Murphy’s Law.
The StreamWriter that codecs.getwriter() provides will take unicode strings and transform them into byte
str before they get to sys.stdout. The problem is if you give it something that’s already a byte str it
tries to transform that as well. To do that it tries to turn the byte str you give it into unicode and
then transform that back into a byte str… and since it uses the ASCII codec to perform those conversions,
chances are that it’ll blow up when making them:
>>> import codecs
>>> import sys
>>> UTF8Writer = codecs.getwriter('utf8')
>>> sys.stdout = UTF8Writer(sys.stdout)
>>> print 'café'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
To work around this, kitchen provides an alternate version of codecs.getwriter() that can deal with both
byte str and unicode strings. Use kitchen.text.converters.getwriter() in place of the codecs version
like this:
>>> import sys
>>> from kitchen.text.converters import getwriter
>>> UTF8Writer = getwriter('utf8')
>>> sys.stdout = UTF8Writer(sys.stdout)
>>> print u'café'
café
>>> print 'café'
café
Frustration #5: Exceptions
Okay, so we’ve gotten ourselves this far. We convert everything to unicode strings. We’re aware that we
need to convert back into byte str before we write to the terminal. We’ve worked around the inability of
the standard getwriter() to deal with both byte str and unicode strings. Are we all set? Well, there’s
at least one more gotcha: raising exceptions with a unicode message. Take a look:
>>> class MyException(Exception):
>>> pass
>>>
>>> raise MyException(u'Cannot do this')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
__main__.MyException: Cannot do this
>>> raise MyException(u'Cannot do this while at a café')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
__main__.MyException:
>>>
No, I didn’t truncate that last line; raising exceptions really cannot handle non-ASCII characters in a
unicode string and will output an exception without the message if the message contains them. What
happens if we try to use the handy dandy getwriter() trick to work around this?
>>> import sys
>>> from kitchen.text.converters import getwriter
>>> sys.stderr = getwriter('utf8')(sys.stderr)
>>> raise MyException(u'Cannot do this')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
__main__.MyException: Cannot do this
>>> raise MyException(u'Cannot do this while at a café')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
__main__.MyException>>>
Not only did this also fail, it even swallowed the trailing newline that’s normally there…. So how to
make this work? Transform from unicode strings to byte str manually before outputting:
>>> from kitchen.text.converters import to_bytes
>>> raise MyException(to_bytes(u'Cannot do this while at a café'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
__main__.MyException: Cannot do this while at a café
>>>
WARNING:
If you use codecs.getwriter() on sys.stderr, you’ll find that raising an exception with a byte str is
broken by the default StreamWriter as well. Don’t do that or you’ll have no way to output non-ASCII
characters. If you want to use a StreamWriter to encode other things on stderr while still having
working exceptions, use kitchen.text.converters.getwriter().
Frustration #6: Inconsistent APIs Part deux
Sometimes you do everything right in your code but other people’s code fails you. With unicode issues
this happens more often than we want. A glaring example of this is when you get values back from a
function that aren’t consistently unicode string or byte str.
An example from the python standard library is gettext. The gettext functions are used to help translate
messages that you display to users in the users’ native languages. Since most languages contain letters
outside of the ASCII range, the values that are returned contain unicode characters. gettext provides
you with ugettext() and ungettext() to return these translations as unicode strings and gettext(),
ngettext(), lgettext(), and lngettext() to return them as encoded byte str. Unfortunately, even though
they’re documented to return only one type of string or the other, the implementation has corner cases
where the wrong type can be returned.
This means that even if you separate your unicode string and byte str correctly before you pass your
strings to a gettext function, afterwards, you might have to check that you have the right sort of string
type again.
NOTE:
kitchen.i18n provides alternate gettext translation objects that return only byte str or only unicode
string.
A few solutions
Now that we’ve identified the issues, can we define a comprehensive strategy for dealing with them?
Convert text at the border
If you get some piece of text from a library, read from a file, etc, turn it into a unicode string
immediately. Since python is moving in the direction of unicode strings everywhere it’s going to be
easier to work with unicode strings within your code.
If your code is heavily involved with using things that are bytes, you can do the opposite and convert
all text into byte str at the border and only convert to unicode when you need it for passing to another
library or performing string operations on it.
In either case, the important thing is to pick a default type for strings and stick with it throughout
your code. When you mix the types it becomes much easier to operate on a string with a function that can
only use the other type by mistake.
NOTE:
In python3, the abstract unicode type becomes much more prominent. The type named str is the
equivalent of python2’s unicode and python3’s bytes type replaces python2’s str. Most APIs deal in
the unicode type of string with just some pieces that are low level dealing with bytes. The implicit
conversions between bytes and unicode is removed and whenever you want to make the conversion you need
to do so explicitly.
When the data needs to be treated as bytes (or unicode) use a naming convention
Sometimes you’re converting nearly all of your data to unicode strings but you have one or two values
where you have to keep byte str around. This is often the case when you need to use the value verbatim
with some external resource. For instance, filenames or key values in a database. When you do this, use
a naming convention for the data you’re working with so you (and others reading your code later) don’t
get confused about what’s being stored in the value.
If you need both a textual string to present to the user and a byte value for an exact match, consider
keeping both versions around. You can either use two variables for this or a dict whose key is the byte
value.
NOTE:
You can use the naming convention used in kitchen as a guide for implementing your own naming
convention. It prefixes byte str variables of unknown encoding with b_ and byte str of known encoding
with the encoding name like: utf8_. If the default was to handle str and only keep a few unicode
values, those variables would be prefixed with u_.
When outputting data, convert back into bytes
When you go to send your data back outside of your program (to the filesystem, over the network,
displaying to the user, etc) turn the data back into a byte str. How you do this will depend on the
expected output format of the data. For displaying to the user, you can use the user’s default encoding
using locale.getpreferredencoding(). For entering into a file, you’re best bet is to pick a single
encoding and stick with it.
WARNING:
When using the encoding that the user has set (for instance, using locale.getpreferredencoding(),
remember that they may have their encoding set to something that can’t display every single unicode
character. That means when you convert from unicode to a byte str you need to decide what should
happen if the byte value is not valid in the user’s encoding. For purposes of displaying messages to
the user, it’s usually okay to use the replace encoding error handler to replace the invalid
characters with a question mark or other symbol meaning the character couldn’t be displayed.
You can use kitchen.text.converters.getwriter() to do this automatically for sys.stdout. When creating
exception messages be sure to convert to bytes manually.
When writing unittests, include non-ASCII values and both unicode and str type
Unless you know that a specific portion of your code will only deal with ASCII, be sure to include
non-ASCII values in your unittests. Including a few characters from several different scripts is highly
advised as well because some code may have special cased accented roman characters but not know how to
handle characters used in Asian alphabets.
Similarly, unless you know that that portion of your code will only be given unicode strings or only byte
str be sure to try variables of both types in your unittests. When doing this, make sure that the
variables are also non-ASCII as python’s implicit conversion will mask problems with pure ASCII data. In
many cases, it makes sense to check what happens if byte str and unicode strings that won’t decode in the
present locale are given.
Be vigilant about spotting poor APIs
Make sure that the libraries you use return only unicode strings or byte str. Unittests can help you
spot issues here by running many variations of data through your functions and checking that you’re still
getting the types of string that you expect.
Example: Putting this all together with kitchen
The kitchen library provides a wide array of functions to help you deal with byte str and unicode strings
in your program. Here’s a short example that uses many kitchen functions to do its work:
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import locale
import os
import sys
import unicodedata
from kitchen.text.converters import getwriter, to_bytes, to_unicode
from kitchen.i18n import get_translation_object
if __name__ == '__main__':
# Setup gettext driven translations but use the kitchen functions so
# we don't have the mismatched bytes-unicode issues.
translations = get_translation_object('example')
# We use _() for marking strings that we operate on as unicode
# This is pretty much everything
_ = translations.ugettext
# And b_() for marking strings that we operate on as bytes.
# This is limited to exceptions
b_ = translations.lgettext
# Setup stdout
encoding = locale.getpreferredencoding()
Writer = getwriter(encoding)
sys.stdout = Writer(sys.stdout)
# Load data. Format is filename\0description
# description should be utf-8 but filename can be any legal filename
# on the filesystem
# Sample datafile.txt:
# /etc/shells\x00Shells available on caf\xc3\xa9.lan
# /var/tmp/file\xff\x00File with non-utf8 data in the filename
#
# And to create /var/tmp/file\xff (under bash or zsh) do:
# echo 'Some data' > /var/tmp/file$'\377'
datafile = open('datafile.txt', 'r')
data = {}
for line in datafile:
# We're going to keep filename as bytes because we will need the
# exact bytes to access files on a POSIX operating system.
# description, we'll immediately transform into unicode type.
b_filename, description = line.split('\0', 1)
# to_unicode defaults to decoding output from utf-8 and replacing
# any problematic bytes with the unicode replacement character
# We accept mangling of the description here knowing that our file
# format is supposed to use utf-8 in that field and that the
# description will only be displayed to the user, not used as
# a key value.
description = to_unicode(description, 'utf-8').strip()
data[b_filename] = description
datafile.close()
# We're going to add a pair of extra fields onto our data to show the
# length of the description and the filesize. We put those between
# the filename and description because we haven't checked that the
# description is free of NULLs.
datafile = open('newdatafile.txt', 'w')
# Name filename with a b_ prefix to denote byte string of unknown encoding
for b_filename in data:
# Since we have the byte representation of filename, we can read any
# filename
if os.access(b_filename, os.F_OK):
size = os.path.getsize(b_filename)
else:
size = 0
# Because the description is unicode type, we know the number of
# characters corresponds to the length of the normalized unicode
# string.
length = len(unicodedata.normalize('NFC', description))
# Print a summary to the screen
# Note that we do not let implici type conversion from str to
# unicode transform b_filename into a unicode string. That might
# fail as python would use the ASCII filename. Instead we use
# to_unicode() to explicitly transform in a way that we know will
# not traceback.
print _(u'filename: %s') % to_unicode(b_filename)
print _(u'file size: %s') % size
print _(u'desc length: %s') % length
print _(u'description: %s') % data[b_filename]
# First combine the unicode portion
line = u'%s\0%s\0%s' % (size, length, data[b_filename])
# Since the filenames are bytes, turn everything else to bytes before combining
# Turning into unicode first would be wrong as the bytes in b_filename
# might not convert
b_line = '%s\0%s\n' % (b_filename, to_bytes(line))
# Just to demonstrate that getwriter will pass bytes through fine
print b_('Wrote: %s') % b_line
datafile.write(b_line)
datafile.close()
# And just to show how to properly deal with an exception.
# Note two things about this:
# 1) We use the b_() function to translate the string. This returns a
# byte string instead of a unicode string
# 2) We're using the b_() function returned by kitchen. If we had
# used the one from gettext we would need to convert the message to
# a byte str first
message = u'Demonstrate the proper way to raise exceptions. Sincerely, \u3068\u3057\u304a'
raise Exception(b_(message))
SEE ALSO:
kitchen.text.converters
Designing Unicode Aware APIs
APIs that deal with byte str and unicode strings are difficult to get right. Here are a few strategies
with pros and cons of each.
Contents
• Designing Unicode Aware APIs
• Take either bytes or unicode, output only unicode
• Take either bytes or unicode, output the same type
• Separate functions
• Deciding whether to take str or unicode when no value is returned
• Writing to external data
• Updating data structures
• APIs to Avoid
• Returning unicode unless a conversion fails
• Ignoring values with no chance of recovery
• Raising a UnicodeException with no chance of recovery
• Knowing your data
• Do you need to operate on both bytes and unicode?
• Can you restrict the encodings?
• Single byte encodings
• Multibyte encodings
• Fixed width
• Variable Width
• ASCII compatible
• Escaped
• Other
Take either bytes or unicode, output only unicode
In this strategy, you allow the user to enter either unicode strings or byte str but what you give back
is always unicode. This strategy is easy for novice endusers to start using immediately as they will be
able to feed either type of string into the function and get back a string that they can use in other
places.
However, it does lead to the novice writing code that functions correctly when testing it with ASCII-only
data but fails when given data that contains non-ASCII characters. Worse, if your API is not designed to
be flexible, the consumer of your code won’t be able to easily correct those problems once they find
them.
Here’s a good API that uses this strategy:
from kitchen.text.converters import to_unicode
def truncate(msg, max_length, encoding='utf8', errors='replace'):
msg = to_unicode(msg, encoding, errors)
return msg[:max_length]
The call to truncate() starts with the essential parameters for performing the task. It ends with two
optional keyword arguments that define the encoding to use to transform from a byte str to unicode and
the strategy to use if undecodable bytes are encountered. The defaults may vary depending on the use
cases you have in mind. When the output is generally going to be printed for the user to see,
errors='replace' is a good default. If you are constructing keys to a database, raisng an exception
(with errors='strict') may be a better default. In either case, having both parameters allows the person
using your API to choose how they want to handle any problems. Having the values is also a clue to them
that a conversion from byte str to unicode string is going to occur.
NOTE:
If you’re targeting python-3.1 and above, errors='surrogateescape' may be a better default than
errors='strict'. You need to be mindful of a few things when using surrogateescape though:
• surrogateescape will cause issues if a non-ASCII compatible encoding is used (for instance, UTF-16
and UTF-32.) That makes it unhelpful in situations where a true general purpose method of encoding
must be found. PEP 383 mentions that surrogateescape was specifically designed with the limitations
of translating using system locales (where ASCII compatibility is generally seen as inescapable) so
you should keep that in mind.
• If you use surrogateescape to decode from bytes to unicode you will need to use an error handler
other than strict to encode as the lone surrogate that this error handler creates makes for invalid
unicode that must be handled when encoding. In Python-3.1.2 or less, a bug in the encoder error
handlers mean that you can only use surrogateescape to encode; anything else will throw an error.
Evaluate your usages of the variables in question to see what makes sense.
Here’s a bad example of using this strategy:
from kitchen.text.converters import to_unicode
def truncate(msg, max_length):
msg = to_unicode(msg)
return msg[:max_length]
In this example, we don’t have the optional keyword arguments for encoding and errors. A user who uses
this function is more likely to miss the fact that a conversion from byte str to unicode is going to
occur. And once an error is reported, they will have to look through their backtrace and think harder
about where they want to transform their data into unicode strings instead of having the opportunity to
control how the conversion takes place in the function itself. Note that the user does have the ability
to make this work by making the transformation to unicode themselves:
from kitchen.text.converters import to_unicode
msg = to_unicode(msg, encoding='euc_jp', errors='ignore')
new_msg = truncate(msg, 5)
Take either bytes or unicode, output the same type
This strategy is sometimes called polymorphic because the type of data that is returned is dependent on
the type of data that is received. The concept is that when you are given a byte str to process, you
return a byte str in your output. When you are given unicode strings to process, you return unicode
strings in your output.
This can work well for end users as the ones that know about the difference between the two string types
will already have transformed the strings to their desired type before giving it to this function. The
ones that don’t can remain blissfully ignorant (at least, as far as your function is concerned) as the
function does not change the type.
In cases where the encoding of the byte str is known or can be discovered based on the input data this
works well. If you can’t figure out the input encoding, however, this strategy can fail in any of the
following cases:
1. It needs to do an internal conversion between byte str and unicode string.
2. It cannot return the same data as either a unicode string or byte str.
3. You may need to deal with byte strings that are not byte-compatible with ASCII
First, a couple examples of using this strategy in a good way:
def translate(msg, table):
replacements = table.keys()
new_msg = []
for index, char in enumerate(msg):
if char in replacements:
new_msg.append(table[char])
else:
new_msg.append(char)
return ''.join(new_msg)
In this example, all of the strings that we use (except the empty string which is okay because it doesn’t
have any characters to encode) come from outside of the function. Due to that, the user is responsible
for making sure that the msg, and the keys and values in table all match in terms of type (unicode vs
str) and encoding (You can do some error checking to make sure the user gave all the same type but you
can’t do the same for the user giving different encodings). You do not need to make changes to the
string that require you to know the encoding or type of the string; everything is a simple replacement of
one element in the array of characters in message with the character in table.
import json
from kitchen.text.converters import to_unicode, to_bytes
def first_field_from_json_data(json_string):
'''Return the first field in a json data structure.
The format of the json data is a simple list of strings.
'["one", "two", "three"]'
'''
if isinstance(json_string, unicode):
# On all python versions, json.loads() returns unicode if given
# a unicode string
return json.loads(json_string)[0]
# Byte str: figure out which encoding we're dealing with
if '\x00' not in json_data[:2]
encoding = 'utf8'
elif '\x00\x00\x00' == json_data[:3]:
encoding = 'utf-32-be'
elif '\x00\x00\x00' == json_data[1:4]:
encoding = 'utf-32-le'
elif '\x00' == json_data[0] and '\x00' == json_data[2]:
encoding = 'utf-16-be'
else:
encoding = 'utf-16-le'
data = json.loads(unicode(json_string, encoding))
return data[0].encode(encoding)
In this example the function takes either a byte str type or a unicode string that has a list in json
format and returns the first field from it as the type of the input string. The first section of code is
very straightforward; we receive a unicode string, parse it with a function, and then return the first
field from our parsed data (which our function returned to us as json data).
The second portion that deals with byte str is not so straightforward. Before we can parse the string we
have to determine what characters the bytes in the string map to. If we didn’t do that, we wouldn’t be
able to properly find which characters are present in the string. In order to do that we have to figure
out the encoding of the byte str. Luckily, the json specification states that all strings are unicode
and encoded with one of UTF32be, UTF32le, UTF16be, UTF16le, or UTF-8. It further defines the format such
that the first two characters are always ASCII. Each of these has a different sequence of NULLs when
they encode an ASCII character. We can use that to detect which encoding was used to create the byte
str.
Finally, we return the byte str by encoding the unicode back to a byte str.
As you can see, in this example we have to convert from byte str to unicode and back. But we know from
the json specification that byte str has to be one of a limited number of encodings that we are able to
detect. That ability makes this strategy work.
Now for some examples of using this strategy in ways that fail:
import unicodedata
def first_char(msg):
'''Return the first character in a string'''
if not isinstance(msg, unicode):
try:
msg = unicode(msg, 'utf8')
except UnicodeError:
msg = unicode(msg, 'latin1')
msg = unicodedata.normalize('NFC', msg)
return msg[0]
If you look at that code and think that there’s something fragile and prone to breaking in the try:
except: block you are correct in being suspicious. This code will fail on multi-byte character sets that
aren’t UTF-8. It can also fail on data where the sequence of bytes is valid UTF-8 but the bytes are
actually of a different encoding. The reasons this code fails is that we don’t know what encoding the
bytes are in and the code must convert from a byte str to a unicode string in order to function.
In order to make this code robust we must know the encoding of msg. The only way to know that is to ask
the user so the API must do that:
import unicodedata
def number_of_chars(msg, encoding='utf8', errors='strict'):
if not isinstance(msg, unicode):
msg = unicode(msg, encoding, errors)
msg = unicodedata.normalize('NFC', msg)
return len(msg)
Another example of failure:
import os
def listdir(directory):
files = os.listdir(directory)
if isinstance(directory, str):
return files
# files could contain both bytes and unicode
new_files = []
for filename in files:
if not isinstance(filename, unicode):
# What to do here?
continue
new_files.appen(filename)
return new_files
This function illustrates the second failure mode. Here, not all of the possible values can be
represented as unicode without knowing more about the encoding of each of the filenames involved. Since
each filename could have a different encoding there’s a few different options to pursue. We could make
this function always return byte str since that can accurately represent anything that could be returned.
If we want to return unicode we need to at least allow the user to specify what to do in case of an error
decoding the bytes to unicode. We can also let the user specify the encoding to use for doing the
decoding but that won’t help in all cases since not all files will be in the same encoding (or even
necessarily in any encoding):
import locale
import os
def listdir(directory, encoding=locale.getpreferredencoding(), errors='strict'):
# Note: In python-3.1+, surrogateescape may be a better default
files = os.listdir(directory)
if isinstance(directory, str):
return files
new_files = []
for filename in files:
if not isinstance(filename, unicode):
filename = unicode(filename, encoding=encoding, errors=errors)
new_files.append(filename)
return new_files
Note that although we use errors in this example as what to pass to the codec that decodes to unicode we
could also have an errors argument that decides other things to do like skip a filename entirely, return
a placeholder (Nondisplayable filename), or raise an exception.
This leaves us with one last failure to describe:
def first_field(csv_string):
'''Return the first field in a comma separated values string.'''
try:
return csv_string[:csv_string.index(',')]
except ValueError:
return csv_string
This code looks simple enough. The hidden error here is that we are searching for a comma character in a
byte str but not all encodings will use the same sequence of bytes to represent the comma. If you use an
encoding that’s not ASCII compatible on the byte level, then the literal comma ',' in the above code will
match inappropriate bytes. Some examples of how it can fail:
• Will find the byte representing an ASCII comma in another character
• Will find the comma but leave trailing garbage bytes on the end of the string
• Will not match the character that represents the comma in this encoding
There are two ways to solve this. You can either take the encoding value from the user or you can take
the separator value from the user. Of the two, taking the encoding is the better option for two reasons:
1. Taking a separator argument doesn’t clearly document for the API user that the reason they must give
it is to properly match the encoding of the csv_string. They’re just as likely to think that it’s
simply a way to specify an alternate character (like “:” or “|”) for the separator.
2. It’s possible for a variable width encoding to reuse the same byte sequence for different characters
in multiple sequences.
NOTE:
UTF-8 is resistant to this as any character’s sequence of bytes will never be a subset of another
character’s sequence of bytes.
With that in mind, here’s how to improve the API:
def first_field(csv_string, encoding='utf-8', errors='replace'):
if not isinstance(csv_string, unicode):
u_string = unicode(csv_string, encoding, errors)
is_unicode = False
else:
u_string = csv_string
try:
field = u_string[:U_string.index(u',')]
except ValueError:
return csv_string
if not is_unicode:
field = field.encode(encoding, errors)
return field
NOTE:
If you decide you’ll never encounter a variable width encoding that reuses byte sequences you can use
this code instead:
def first_field(csv_string, encoding='utf-8'):
try:
return csv_string[:csv_string.index(','.encode(encoding))]
except ValueError:
return csv_string
Separate functions
Sometimes you want to be able to take either byte str or unicode strings, perform similar operations on
either one and then return data in the same format as was given. Probably the easiest way to do that is
to have separate functions for each and adopt a naming convention to show that one is for working with
byte str and the other is for working with unicode strings:
def translate_b(msg, table):
'''Replace values in str with other byte values like unicode.translate'''
if not isinstance(msg, str):
raise TypeError('msg must be of type str')
str_table = [chr(s) for s in xrange(0,256)]
delete_chars = []
for chr_val in (k for k in table.keys() if isinstance(k, int)):
if chr_val > 255:
raise ValueError('Keys in table must not exceed 255)')
if table[chr_val] == None:
delete_chars.append(chr(chr_val))
elif isinstance(table[chr_val], int):
if table[chr_val] > 255:
raise TypeError('table values cannot be more than 255 or less than 0')
str_table[chr_val] = chr(table[chr_val])
else:
if not isinstance(table[chr_val], str):
raise TypeError('character mapping must return integer, None or str')
str_table[chr_val] = table[chr_val]
str_table = ''.join(str_table)
delete_chars = ''.join(delete_chars)
return msg.translate(str_table, delete_chars)
def translate(msg, table):
'''Replace values in a unicode string with other values'''
if not isinstance(msg, unicode):
raise TypeError('msg must be of type unicode')
return msg.translate(table)
There’s several things that we have to do in this API:
• Because the function names might not be enough of a clue to the user of the functions of the value
types that are expected, we have to check that the types are correct.
• We keep the behaviour of the two functions as close to the same as possible, just with byte str and
unicode strings substituted for each other.
Deciding whether to take str or unicode when no value is returned
Not all functions have a return value. Sometimes a function is there to interact with something external
to python, for instance, writing a file out to disk or a method exists to update the internal state of a
data structure. One of the main questions with these APIs is whether to take byte str, unicode string,
or both. The answer depends on your use case but I’ll give some examples here.
Writing to external data
When your information is going to an external data source like writing to a file you need to decide
whether to take in unicode strings or byte str. Remember that most external data sources are not going
to be dealing with unicode directly. Instead, they’re going to be dealing with a sequence of bytes that
may be interpreted as unicode. With that in mind, you either need to have the user give you a byte str
or convert to a byte str inside the function.
Next you need to think about the type of data that you’re receiving. If it’s textual data, (for
instance, this is a chat client and the user is typing messages that they expect to be read by another
person) it probably makes sense to take in unicode strings and do the conversion inside your function.
On the other hand, if this is a lower level function that’s passing data into a network socket, it
probably should be taking byte str instead.
Just as noted in the API notes above, you should specify an encoding and errors argument if you need to
transform from unicode string to byte str and you are unable to guess the encoding from the data itself.
Updating data structures
Sometimes your API is just going to update a data structure and not immediately output that data
anywhere. Just as when writing external data, you should think about both what your function is going to
do with the data eventually and what the caller of your function is thinking that they’re giving you.
Most of the time, you’ll want to take unicode strings and enter them into the data structure as unicode
when the data is textual in nature. You’ll want to take byte str and enter them into the data structure
as byte str when the data is not text. Use a naming convention so the user knows what’s expected.
APIs to Avoid
There are a few APIs that are just wrong. If you catch yourself making an API that does one of these
things, change it before anyone sees your code.
Returning unicode unless a conversion fails
This type of API usually deals with byte str at some point and converts it to unicode because it’s
usually thought to be text. However, there are times when the bytes fail to convert to a unicode string.
When that happens, this API returns the raw byte str instead of a unicode string. One example of this is
present in the python standard library: python2’s os.listdir():
>>> import os
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
>>> os.mkdir('/tmp/mine')
>>> os.chdir('/tmp/mine')
>>> open('nonsense_char_\xff', 'w').close()
>>> open('all_ascii', 'w').close()
>>> os.listdir(u'.')
[u'all_ascii', 'nonsense_char_\xff']
The problem with APIs like this is that they cause failures that are hard to debug because they don’t
happen where the variables are set. For instance, let’s say you take the filenames from os.listdir() and
give it to this function:
def normalize_filename(filename):
'''Change spaces and dashes into underscores'''
return filename.translate({ord(u' '):u'_', ord(u' '):u'_'})
When you test this, you use filenames that all are decodable in your preferred encoding and everything
seems to work. But when this code is run on a machine that has filenames in multiple encodings the
filenames returned by os.listdir() suddenly include byte str. And byte str has a different
string.translate() function that takes different values. So the code raises an exception where it’s not
immediately obvious that os.listdir() is at fault.
Ignoring values with no chance of recovery
An early version of python3 attempted to fix the os.listdir() problem pointed out in the last section by
returning all values that were decodable to unicode and omitting the filenames that were not. This lead
to the following output:
>>> import os
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
>>> os.mkdir('/tmp/mine')
>>> os.chdir('/tmp/mine')
>>> open(b'nonsense_char_\xff', 'w').close()
>>> open('all_ascii', 'w').close()
>>> os.listdir('.')
['all_ascii']
The issue with this type of code is that it is silently doing something surprising. The caller expects
to get a full list of files back from os.listdir(). Instead, it silently ignores some of the files,
returning only a subset. This leads to code that doesn’t do what is expected that may go unnoticed until
the code is in production and someone notices that something important is being missed.
Raising a UnicodeException with no chance of recovery
Believe it or not, a few libraries exist that make it impossible to deal with unicode text without
raising a UnicodeError. What seems to occur in these libraries is that the library has functions that
expect to receive a unicode string. However, internally, those functions call other functions that
expect to receive a byte str. The programmer of the API was smart enough to convert from a unicode
string to a byte str but they did not give the user the chance to specify the encodings to use or how to
deal with errors. This results in exceptions when the user passes in a byte str because the initial
function wants a unicode string and exceptions when the user passes in a unicode string because the
function can’t convert the string to bytes in the encoding that it’s selected.
Do not put the user in the position of not being able to use your API without raising a UnicodeError with
certain values. If you can only safely take unicode strings, document that byte str is not allowed and
vice versa. If you have to convert internally, make sure to give the caller of your function parameters
to control the encoding and how to treat errors that may occur during the encoding/decoding process. If
your code will raise a UnicodeError with non-ASCII values no matter what, you should probably rethink
your API.
Knowing your data
If you’ve read all the way down to this section without skipping you’ve seen several admonitions about
the type of data you are processing affecting the viability of the various API choices.
Here’s a few things to consider in your data:
Do you need to operate on both bytes and unicode?
Much of the data in libraries, programs, and the general environment outside of python is written where
strings are sequences of bytes. So when we interact with data that comes from outside of python or data
that is about to leave python it may make sense to only operate on the data as a byte str. There’s two
times when this may make sense:
1. The user is intended to hand the data to the function and then the function takes care of sending the
data outside of python (to the filesystem, over the network, etc).
2. The data is not representable as text. For instance, writing a binary file format.
Even when your code is operating in this area you still need to think a little more about your data. For
instance, it might make sense for the person using your API to pass in unicode strings and let the
function convert that into the byte str that it then sends over the wire.
There are also times when it might make sense to operate only on unicode strings. unicode represents
text so anytime that you are working on textual data that isn’t going to leave python it has the
potential to be a unicode-only API. However, there’s two things that you should consider when designing
a unicode-only API:
1. As your API gains popularity, people are going to use your API in places that you may not have thought
of. Corner cases in these other places may mean that processing bytes is desirable.
2. In python2, byte str and unicode are often used interchangeably with each other. That means that
people programming against your API may have received str from some other API and it would be most
convenient for their code if your API accepted it.
NOTE:
In python3, the separation between the text type and the byte type are more clear. So in python3,
there’s less need to have all APIs take both unicode and bytes.
Can you restrict the encodings?
If you determine that you have to deal with byte str you should realize that not all encodings are
created equal. Each has different properties that may make it possible to provide a simpler API provided
that you can reasonably tell the users of your API that they cannot use certain classes of encodings.
As one example, if you are required to find a comma (,) in a byte str you have different choices based on
what encodings are allowed. If you can reasonably restrict your API users to only giving ASCII
compatible encodings you can do this simply by searching for the literal comma character because that
character will be represented by the same byte sequence in all ASCII compatible encodings.
The following are some classes of encodings to be aware of as you decide how generic your code needs to
be.
Single byte encodings
Single byte encodings can only represent 256 total characters. They encode the code points for a
character to the equivalent number in a single byte.
Most single byte encodings are ASCII compatible. ASCII compatible encodings are the most likely to be
usable without changes to code so this is good news. A notable exception to this is the EBDIC family of
encodings.
Multibyte encodings
Multibyte encodings use more than one byte to encode some characters.
Fixed width
Fixed width encodings have a set number of bytes to represent all of the characters in the character set.
UTF-32 is an example of a fixed width encoding that uses four bytes per character and can express every
unicode characters. There are a number of problems with writing APIs that need to operate on fixed
width, multibyte characters. To go back to our earlier example of finding a comma in a string, we have
to realize that even in UTF-32 where the code point for ASCII characters is the same as in ASCII, the
byte sequence for them is different. So you cannot search for the literal byte character as it may pick
up false positives and may break a byte sequence in an odd place.
Variable Width
ASCII compatible
UTF-8 and the EUC family of encodings are examples of ASCII compatible multi-byte encodings. They
achieve this by adhering to two principles:
• All of the ASCII characters are represented by the byte that they are in the ASCII encoding.
• None of the ASCII byte sequences are reused in any other byte sequence for a different character.
Escaped
Some multibyte encodings work by using only bytes from the ASCII encoding but when a particular sequence
of those byes is found, they are interpreted as meaning something other than their ASCII values. UTF-7
is one such encoding that can encode all of the unicode code points. For instance, here’s a some
Japanese characters encoded as UTF-7:
>>> a = u'\u304f\u3089\u3068\u307f'
>>> print a
くらとみ
>>> print a.encode('utf-7')
+ME8wiTBoMH8-
These encodings can be used when you need to encode unicode data that may contain non-ASCII characters
for inclusion in an ASCII only transport medium or file.
However, they are not ASCII compatible in the sense that we used earlier as the bytes that represent a
ASCII character are being reused as part of other characters. If you were to search for a literal plus
sign in this encoded string, you would run across many false positives, for instance.
Other
There are many other popular variable width encodings, for instance UTF-16 and shift-JIS. Many of these
are not ASCII compatible so you cannot search for a literal ASCII character without danger of false
positives or false negatives.
Kitchen API
Kitchen is structured as a collection of modules. In its current configuration, Kitchen ships with the
following modules. Other addon modules that may drag in more dependencies can be found on the project
webpage
Kitchen.i18n Module
I18N is an important piece of any modern program. Unfortunately, setting up i18n in your program is
often a confusing process. The functions provided here aim to make the programming side of that a little
easier.
Most projects will be able to do something like this when they startup:
# myprogram/__init__.py:
import os
import sys
from kitchen.i18n import easy_gettext_setup
_, N_ = easy_gettext_setup('myprogram', localedirs=(
os.path.join(os.path.realpath(os.path.dirname(__file__)), 'locale'),
os.path.join(sys.prefix, 'lib', 'locale')
))
Then, in other files that have strings that need translating:
# myprogram/commands.py:
from myprogram import _, N_
def print_usage():
print _(u"""available commands are:
--help Display help
--version Display version of this program
--bake-me-a-cake as fast as you can
""")
def print_invitations(age):
print _('Please come to my party.')
print N_('I will be turning %(age)s year old',
'I will be turning %(age)s years old', age) % {'age': age}
See the documentation of easy_gettext_setup() and get_translation_object() for more details.
SEE ALSO:
gettext
for details of how the python gettext facilities work
babel The babel module for in depth information on gettext, message catalogs, and translating
your app. babel provides some nice features for i18n on top of gettext
Functions
easy_gettext_setup() should satisfy the needs of most users. get_translation_object() is designed to
ease the way for anyone that needs more control.
kitchen.i18n.easy_gettext_setup(domain, localedirs=(), use_unicode=True)
Setup translation functions for an application
Parameters
• domain – Name of the message domain. This should be a unique name that can be used to
lookup the message catalog for this app.
• localedirs – Iterator of directories to look for message catalogs under. The first
directory to exist is used regardless of whether messages for this domain are present.
If none of the directories exist, fallback on sys.prefix + /share/locale Default: No
directories to search so we just use the fallback.
• use_unicode – If True return the gettext functions for unicode strings else return the
functions for byte str for the translations. Default is True.
Returns
tuple of the gettext function and gettext function for plurals
Setting up gettext can be a little tricky because of lack of documentation. This function will
setup gettext using the Class-based API for you. For the simple case, you can use the default
arguments and call it like this:
_, N_ = easy_gettext_setup()
This will get you two functions, _() and N_() that you can use to mark strings in your code for
translation. _() is used to mark strings that don’t need to worry about plural forms no matter
what the value of the variable is. N_() is used to mark strings that do need to have a different
form if a variable in the string is plural.
SEE ALSO:
api-i18n
This module’s documentation has examples of using _() and N_()
get_translation_object()
for information on how to use localedirs to get the proper message catalogs both when in
development and when installed to FHS compliant directories on Linux.
NOTE:
The gettext functions returned from this function should be superior to the ones returned from
gettext. The traits that make them better are described in the DummyTranslations and
NewGNUTranslations documentation.
Changed in version kitchen-0.2.4: ; API kitchen.i18n 2.0.0 Changed easy_gettext_setup() to return
the lgettext functions instead of gettext functions when use_unicode=False.
kitchen.i18n.get_translation_object(domain, localedirs=(), languages=None, class_=None, fallback=True,
codeset=None, python2_api=True)
Get a translation object bound to the message catalogs
Parameters
• domain – Name of the message domain. This should be a unique name that can be used to
lookup the message catalog for this app or library.
• localedirs – Iterator of directories to look for message catalogs under. The directories
are searched in order for message catalogs. For each of the directories searched, we
check for message catalogs in any language specified in:attr:languages. The message
catalogs are used to create the Translation object that we return. The Translation
object will attempt to lookup the msgid in the first catalog that we found. If it’s not
in there, it will go through each subsequent catalog looking for a match. For this
reason, the order in which you specify the localedirs may be important. If no message
catalogs are found, either return a DummyTranslations object or raise an IOError
depending on the value of fallback. Rhe default localedir from gettext which is
os.path.join(sys.prefix, 'share', 'locale') on Unix is implicitly appended to the
localedirs, making it the last directory searched.
• languages –
Iterator of language codes to check for message catalogs. If unspecified, the user’s
locale settings will be used.
SEE ALSO:
gettext.find() for information on what environment variables are used.
• class – The class to use to extract translations from the message catalogs. Defaults to
NewGNUTranslations.
• fallback – If set to data:False, raise an IOError if no message catalogs are found. If
True, the default, return a DummyTranslations object.
• codeset – Set the character encoding to use when returning byte str objects. This is
equivalent to calling output_charset() on the Translations object that is returned from
this function.
• python2_api – When data:True (default), return Translation objects that use the python2
gettext api (gettext() and lgettext() return byte str. ugettext() exists and returns
unicode strings). When False, return Translation objects that use the python3 gettext
api (gettext returns unicode strings and lgettext returns byte str. ugettext does not
exist.)
Returns
Translation object to get gettext methods from
If you need more flexibility than easy_gettext_setup(), use this function. It sets up a gettext
Translation object and returns it to you. Then you can access any of the methods of the object
that you need directly. For instance, if you specifically need to access lgettext():
translations = get_translation_object('foo')
translations.lgettext('My Message')
This function is similar to the python standard library gettext.translation() but makes it better
in two ways
1.
It returns NewGNUTranslations or DummyTranslations
objects by default. These are superior to the gettext.GNUTranslations and
gettext.NullTranslations objects because they are consistent in the string type they
return and they fix several issues that can cause the python standard library objects to
throw UnicodeError.
2.
This function takes multiple directories to search for
message catalogs.
The latter is important when setting up gettext in a portable manner. There is not a common
directory for translations across operating systems so one needs to look in multiple directories
for the translations. get_translation_object() is able to handle that if you give it a list of
directories to search for catalogs:
translations = get_translation_object('foo', localedirs=(
os.path.join(os.path.realpath(os.path.dirname(__file__)), 'locale'),
os.path.join(sys.prefix, 'lib', 'locale')))
This will search for several different directories:
1. A directory named locale in the same directory as the module that called
get_translation_object(),
2. In /usr/lib/locale
3. In /usr/share/locale (the fallback directory)
This allows gettext to work on Windows and in development (where the message catalogs are
typically in the toplevel module directory) and also when installed under Linux (where the message
catalogs are installed in /usr/share/locale). You (or the system packager) just need to install
the message catalogs in /usr/share/locale and remove the locale directory from the module to make
this work. ie:
In development:
~/foo # Toplevel module directory
~/foo/__init__.py
~/foo/locale # With message catalogs below here:
~/foo/locale/es/LC_MESSAGES/foo.mo
Installed on Linux:
/usr/lib/python2.7/site-packages/foo
/usr/lib/python2.7/site-packages/foo/__init__.py
/usr/share/locale/ # With message catalogs below here:
/usr/share/locale/es/LC_MESSAGES/foo.mo
NOTE:
This function will setup Translation objects that attempt to lookup msgids in all of the found
message catalogs. This means if you have several versions of the message catalogs installed in
different directories that the function searches, you need to make sure that localedirs
specifies the directories so that newer message catalogs are searched first. It also means
that if a newer catalog does not contain a translation for a msgid but an older one that’s in
localedirs does, the translation from that older catalog will be returned.
Changed in version kitchen-1.1.0: ; API kitchen.i18n 2.1.0 Add more parameters to
get_translation_object() so it can more easily be used as a replacement for gettext.translation().
Also change the way we use localedirs. We cycle through them until we find a suitable locale file
rather than simply cycling through until we find a directory that exists. The new code is based
heavily on the python standard library gettext.translation() function.
Changed in version kitchen-1.2.0: ; API kitchen.i18n 2.2.0 Add python2_api parameter
Translation Objects
The standard translation objects from the gettext module suffer from several problems:
• They can throw UnicodeError
• They can’t find translations for non-ASCII byte str messages
• They may return either unicode string or byte str from the same function even though the functions say
they will only return unicode or only return byte str.
DummyTranslations and NewGNUTranslations were written to fix these issues.
class kitchen.i18n.DummyTranslations(fp=None, python2_api=True)
Safer version of gettext.NullTranslations
This Translations class doesn’t translate the strings and is intended to be used as a fallback
when there were errors setting up a real Translations object. It’s safer than
gettext.NullTranslations in its handling of byte str vs unicode strings.
Unlike NullTranslations, this Translation class will never throw a UnicodeError. The code that
you have around a call to DummyTranslations might throw a UnicodeError but at least that will be
in code you control and can fix. Also, unlike NullTranslations all of this Translation object’s
methods guarantee to return byte str except for ugettext() and ungettext() which guarantee to
return unicode strings.
When byte str are returned, the strings will be encoded according to this algorithm:
1. If a fallback has been added, the fallback will be called first. You’ll need to consult the
fallback to see whether it performs any encoding changes.
2. If a byte str was given, the same byte str will be returned.
3. If a unicode string was given and set_output_charset() has been called then we encode the
string using the output_charset
4. If a unicode string was given and this is gettext() or ngettext() and _charset was set output
in that charset.
5. If a unicode string was given and this is gettext() or ngettext() we encode it using ‘utf-8’.
6. If a unicode string was given and this is lgettext() or lngettext() we encode using the value
of locale.getpreferredencoding()
For ugettext() and ungettext(), we go through the same set of steps with the following
differences:
• We transform byte str into unicode strings for these methods.
• The encoding used to decode the byte str is taken from input_charset if it’s set, otherwise we
decode using UTF-8.
input_charset
is an extension to the python standard library gettext that specifies what charset a
message is encoded in when decoding a message to unicode. This is used for two purposes:
1. If the message string is a byte str, this is used to decode the string to a unicode string
before looking it up in the message catalog.
2. In ugettext() and ungettext() methods, if a byte str is given as the message and is
untranslated this is used as the encoding when decoding to unicode. This is different from
_charset which may be set when a message catalog is loaded because input_charset is used to
describe an encoding used in a python source file while _charset describes the encoding used in
the message catalog file.
Any characters that aren’t able to be transformed from a byte str to unicode string or vice versa
will be replaced with a replacement character (ie: u'�' in unicode based encodings, '?' in other
ASCII compatible encodings).
SEE ALSO:
gettext.NullTranslations
For information about what methods are available and what they do.
Changed in version kitchen-1.1.0: ; API kitchen.i18n 2.1.0 * Although we had adapted gettext(),
ngettext(),
lgettext(), and lngettext() to always return byte
str, we hadn’t forced those byte str to always be
in a specified charset. We now make sure that gettext() and
ngettext() return byte str encoded using
output_charset if set, otherwise charset and if
neither of those, UTF-8. With lgettext() and
lngettext() output_charset if set, otherwise
locale.getpreferredencoding(). * Make setting input_charset and output_charset also
set those attributes on any fallback translation objects.
Changed in version kitchen-1.2.0: ; API kitchen.i18n 2.2.0 Add python2_api parameter to __init__()
set_output_charset(charset)
Set the output charset
This serves two purposes. The normal gettext.NullTranslations.set_output_charset() does
not set the output on fallback objects. On python-2.3, gettext.NullTranslations objects
don’t contain this method.
class kitchen.i18n.NewGNUTranslations(fp=None, python2_api=True)
Safer version of gettext.GNUTranslations
gettext.GNUTranslations suffers from two problems that this class fixes.
1. gettext.GNUTranslations can throw a UnicodeError in gettext.GNUTranslations.ugettext() if the
message being translated has non-ASCII characters and there is no translation for it.
2. gettext.GNUTranslations can return byte str from gettext.GNUTranslations.ugettext() and unicode
strings from the other gettext() methods if the message being translated is the wrong type
When byte str are returned, the strings will be encoded according to this algorithm:
1. If a fallback has been added, the fallback will be called first. You’ll need to consult the
fallback to see whether it performs any encoding changes.
2. If a byte str was given, the same byte str will be returned.
3. If a unicode string was given and set_output_charset() has been called then we encode the
string using the output_charset
4. If a unicode string was given and this is gettext() or ngettext() and a charset was detected
when parsing the message catalog, output in that charset.
5. If a unicode string was given and this is gettext() or ngettext() we encode it using UTF-8.
6. If a unicode string was given and this is lgettext() or lngettext() we encode using the value
of locale.getpreferredencoding()
For ugettext() and ungettext(), we go through the same set of steps with the following
differences:
• We transform byte str into unicode strings for these methods.
• The encoding used to decode the byte str is taken from input_charset if it’s set, otherwise we
decode using UTF-8
input_charset
an extension to the python standard library gettext that specifies what charset a message
is encoded in when decoding a message to unicode. This is used for two purposes:
1. If the message string is a byte str, this is used to decode the string to a unicode string
before looking it up in the message catalog.
2. In ugettext() and ungettext() methods, if a byte str is given as the message and is
untranslated his is used as the encoding when decoding to unicode. This is different from the
_charset parameter that may be set when a message catalog is loaded because input_charset is
used to describe an encoding used in a python source file while _charset describes the encoding
used in the message catalog file.
Any characters that aren’t able to be transformed from a byte str to unicode string or vice versa
will be replaced with a replacement character (ie: u'�' in unicode based encodings, '?' in other
ASCII compatible encodings).
SEE ALSO:
gettext.GNUTranslations.gettext
For information about what methods this class has and what they do
Changed in version kitchen-1.1.0: ; API kitchen.i18n 2.1.0 Although we had adapted gettext(),
ngettext(), lgettext(), and lngettext() to always return byte str, we hadn’t forced those byte str
to always be in a specified charset. We now make sure that gettext() and ngettext() return byte
str encoded using output_charset if set, otherwise charset and if neither of those, UTF-8. With
lgettext() and lngettext() output_charset if set, otherwise locale.getpreferredencoding().
Kitchen.text: unicode and utf8 and xml oh my!
The kitchen.text module contains functions that deal with text manipulation.
Kitchen.text.converters
Functions to handle conversion of byte str and unicode strings.
Changed in version kitchen: 0.2a2 ; API kitchen.text 2.0.0 Added getwriter()
Changed in version kitchen: 0.2.2 ; API kitchen.text 2.1.0 Added exception_to_unicode(),
exception_to_bytes(), EXCEPTION_CONVERTERS, and BYTE_EXCEPTION_CONVERTERS
Changed in version kitchen: 1.0.1 ; API kitchen.text 2.1.1 Deprecated BYTE_EXCEPTION_CONVERTERS as we’ve
simplified exception_to_unicode() and exception_to_bytes() to make it unnecessary
Byte Strings and Unicode in Python2
Python2 has two string types, str and unicode. unicode represents an abstract sequence of text
characters. It can hold any character that is present in the unicode standard. str can hold any byte of
data. The operating system and python work together to display these bytes as characters in many cases
but you should always keep in mind that the information is really a sequence of bytes, not a sequence of
characters. In python2 these types are interchangeable a large amount of the time. They are one of the
few pairs of types that automatically convert when used in equality:
>>> # string is converted to unicode and then compared
>>> "I am a string" == u"I am a string"
True
>>> # Other types, like int, don't have this special treatment
>>> 5 == "5"
False
However, this automatic conversion tends to lull people into a false sense of security. As long as
you’re dealing with ASCII characters the automatic conversion will save you from seeing any differences.
Once you start using characters that are not in ASCII, you will start getting UnicodeError and
UnicodeWarning as the automatic conversions between the types fail:
>>> "I am an ñ" == u"I am an ñ"
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False
Why do these conversions fail? The reason is that the python2 unicode type represents an abstract
sequence of unicode text known as code points. str, on the other hand, really represents a sequence of
bytes. Those bytes are converted by your operating system to appear as characters on your screen using a
particular encoding (usually with a default defined by the operating system and customizable by the
individual user.) Although ASCII characters are fairly standard in what bytes represent each character,
the bytes outside of the ASCII range are not. In general, each encoding will map a different character
to a particular byte. Newer encodings map individual characters to multiple bytes (which the older
encodings will instead treat as multiple characters). In the face of these differences, python refuses
to guess at an encoding and instead issues a warning or exception and refuses to convert.
SEE ALSO:
overcoming-frustration
For a longer introduction on this subject.
Strategy for Explicit Conversion
So what is the best method of dealing with this weltering babble of incoherent encodings? The basic
strategy is to explicitly turn everything into unicode when it first enters your program. Then, when you
send it to output, you can transform the unicode back into bytes. Doing this allows you to control the
encodings that are used and avoid getting tracebacks due to UnicodeError. Using the functions defined in
this module, that looks something like this:
>>> from kitchen.text.converters import to_unicode, to_bytes
>>> name = raw_input('Enter your name: ')
Enter your name: Toshio くらとみ
>>> name
'Toshio \xe3\x81\x8f\xe3\x82\x89\xe3\x81\xa8\xe3\x81\xbf'
>>> type(name)
<type 'str'>
>>> unicode_name = to_unicode(name)
>>> type(unicode_name)
<type 'unicode'>
>>> unicode_name
u'Toshio \u304f\u3089\u3068\u307f'
>>> # Do a lot of other things before needing to save/output again:
>>> output = open('datafile', 'w')
>>> output.write(to_bytes(u'Name: %s\\n' % unicode_name))
A few notes:
Looking at line 6, you’ll notice that the input we took from the user was a byte str. In general,
anytime we’re getting a value from outside of python (The filesystem, reading data from the network,
interacting with an external command, reading values from the environment) we are interacting with
something that will want to give us a byte str. Some python standard library modules and third party
libraries will automatically attempt to convert a byte str to unicode strings for you. This is both a
boon and a curse. If the library can guess correctly about the encoding that the data is in, it will
return unicode objects to you without you having to convert. However, if it can’t guess correctly, you
may end up with one of several problems:
UnicodeError
The library attempted to decode a byte str into a unicode, string failed, and raises an exception.
Garbled data
If the library returns the data after decoding it with the wrong encoding, the characters you see
in the unicode string won’t be the ones that you expect.
A byte str instead of unicode string
Some libraries will return a unicode string when they’re able to decode the data and a byte str
when they can’t. This is generally the hardest problem to debug when it occurs. Avoid it in your
own code and try to avoid or open bugs against upstreams that do this. See
DesigningUnicodeAwareAPIs for strategies to do this properly.
On line 8, we convert from a byte str to a unicode string. to_unicode() does this for us. It has some
error handling and sane defaults that make this a nicer function to use than calling str.decode()
directly:
• Instead of defaulting to the ASCII encoding which fails with all but the simple American English
characters, it defaults to UTF-8.
• Instead of raising an error if it cannot decode a value, it will replace the value with the unicode
“Replacement character” symbol (�).
• If you happen to call this method with something that is not a str or unicode, it will return an empty
unicode string.
All three of these can be overridden using different keyword arguments to the function. See the
to_unicode() documentation for more information.
On line 15 we push the data back out to a file. Two things you should note here:
1. We deal with the strings as unicode until the last instant. The string format that we’re using is
unicode and the variable also holds unicode. People sometimes get into trouble when they mix a byte
str format with a variable that holds a unicode string (or vice versa) at this stage.
2. to_bytes(), does the reverse of to_unicode(). In this case, we’re using the default values which turn
unicode into a byte str using UTF-8. Any errors are replaced with a � and sending nonstring objects
yield empty unicode strings. Just like to_unicode(), you can look at the documentation for to_bytes()
to find out how to override any of these defaults.
When to use an alternate strategy
The default strategy of decoding to unicode strings when you take data in and encoding to a byte str when
you send the data back out works great for most problems but there are a few times when you shouldn’t:
• The values aren’t meant to be read as text
• The values need to be byte-for-byte when you send them back out – for instance if they are database
keys or filenames.
• You are transferring the data between several libraries that all expect byte str.
In each of these instances, there is a reason to keep around the byte str version of a value. Here’s a
few hints to keep your sanity in these situations:
1. Keep your unicode and str values separate. Just like the pain caused when you have to use someone
else’s library that returns both unicode and str you can cause yourself pain if you have functions
that can return both types or variables that could hold either type of value.
2. Name your variables so that you can tell whether you’re storing byte str or unicode string. One of
the first things you end up having to do when debugging is determine what type of string you have in a
variable and what type of string you are expecting. Naming your variables consistently so that you
can tell which type they are supposed to hold will save you from at least one of those steps.
3. When you get values initially, make sure that you’re dealing with the type of value that you expect as
you save it. You can use isinstance() or to_bytes() since to_bytes() doesn’t do any modifications of
the string if it’s already a str. When using to_bytes() for this purpose you might want to use:
try:
b_input = to_bytes(input_should_be_bytes_already, errors='strict', nonstring='strict')
except:
handle_errors_somehow()
The reason is that the default of to_bytes() will take characters that are illegal in the chosen
encoding and transform them to replacement characters. Since the point of keeping this data as a byte
str is to keep the exact same bytes when you send it outside of your code, changing things to
replacement characters should be rasing red flags that something is wrong. Setting errors to strict
will raise an exception which gives you an opportunity to fail gracefully.
4. Sometimes you will want to print out the values that you have in your byte str. When you do this you
will need to make sure that you transform unicode to str before combining them. Also be sure that any
other function calls (including gettext) are going to give you strings that are the same type. For
instance:
print to_bytes(_('Username: %(user)s'), 'utf-8') % {'user': b_username}
Gotchas and how to avoid them
Even when you have a good conceptual understanding of how python2 treats unicode and str there are still
some things that can surprise you. In most cases this is because, as noted earlier, python or one of the
python libraries you depend on is trying to convert a value automatically and failing. Explicit
conversion at the appropriate place usually solves that.
str(obj)
One common idiom for getting a simple, string representation of an object is to use:
str(obj)
Unfortunately, this is not safe. Sometimes str(obj) will return unicode. Sometimes it will return a
byte str. Sometimes, it will attempt to convert from a unicode string to a byte str, fail, and throw a
UnicodeError. To be safe from all of these, first decide whether you need unicode or str to be returned.
Then use to_unicode() or to_bytes() to get the simple representation like this:
u_representation = to_unicode(obj, nonstring='simplerepr')
b_representation = to_bytes(obj, nonstring='simplerepr')
print
python has a builtin print() statement that outputs strings to the terminal. This originated in a time
when python only dealt with byte str. When unicode strings came about, some enhancements were made to
the print() statement so that it could print those as well. The enhancements make print() work most of
the time. However, the times when it doesn’t work tend to make for cryptic debugging.
The basic issue is that print() has to figure out what encoding to use when it prints a unicode string to
the terminal. When python is attached to your terminal (ie, you’re running the interpreter or running a
script that prints to the screen) python is able to take the encoding value from your locale settings
LC_ALL or LC_CTYPE and print the characters allowed by that encoding. On most modern Unix systems, the
encoding is utf-8 which means that you can print any unicode character without problem.
There are two common cases of things going wrong:
1. Someone has a locale set that does not accept all valid unicode characters. For instance:
$ LC_ALL=C python
>>> print u'\ufffd'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
This often happens when a script that you’ve written and debugged from the terminal is run from an
automated environment like cron. It also occurs when you have written a script using a utf-8 aware
locale and released it for consumption by people all over the internet. Inevitably, someone is
running with a locale that can’t handle all unicode characters and you get a traceback reported.
2. You redirect output to a file. Python isn’t using the values in LC_ALL unconditionally to decide what
encoding to use. Instead it is using the encoding set for the terminal you are printing to which is
set to accept different encodings by LC_ALL. If you redirect to a file, you are no longer printing to
the terminal so LC_ALL won’t have any effect. At this point, python will decide it can’t find an
encoding and fallback to ASCII which will likely lead to UnicodeError being raised. You can see this
in a short script:
#! /usr/bin/python -tt
print u'\ufffd'
And then look at the difference between running it normally and redirecting to a file:
$ ./test.py
�
$ ./test.py > t
Traceback (most recent call last):
File "test.py", line 3, in <module>
print u'\ufffd'
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
The short answer to dealing with this is to always use bytes when writing output. You can do this by
explicitly converting to bytes like this:
from kitchen.text.converters import to_bytes
u_string = u'\ufffd'
print to_bytes(u_string)
or you can wrap stdout and stderr with a StreamWriter. A StreamWriter is convenient in that you can
assign it to encode for sys.stdout or sys.stderr and then have output automatically converted but it has
the drawback of still being able to throw UnicodeError if the writer can’t encode all possible unicode
codepoints. Kitchen provides an alternate version which can be retrieved with
kitchen.text.converters.getwriter() which will not traceback in its standard configuration.
Unicode, str, and dict keys
The hash() of the ASCII characters is the same for unicode and byte str. When you use them in dict keys,
they evaluate to the same dictionary slot:
>>> u_string = u'a'
>>> b_string = 'a'
>>> hash(u_string), hash(b_string)
(12416037344, 12416037344)
>>> d = {}
>>> d[u_string] = 'unicode'
>>> d[b_string] = 'bytes'
>>> d
{u'a': 'bytes'}
When you deal with key values outside of ASCII, unicode and byte str evaluate unequally no matter what
their character content or hash value:
>>> u_string = u'ñ'
>>> b_string = u_string.encode('utf-8')
>>> print u_string
ñ
>>> print b_string
ñ
>>> d = {}
>>> d[u_string] = 'unicode'
>>> d[b_string] = 'bytes'
>>> d
{u'\\xf1': 'unicode', '\\xc3\\xb1': 'bytes'}
>>> b_string2 = '\\xf1'
>>> hash(u_string), hash(b_string2)
(30848092528, 30848092528)
>>> d = {}
>>> d[u_string] = 'unicode'
>>> d[b_string2] = 'bytes'
{u'\\xf1': 'unicode', '\\xf1': 'bytes'}
How do you work with this one? Remember rule #1: Keep your unicode and byte str values separate. That
goes for keys in a dictionary just like anything else.
• For any given dictionary, make sure that all your keys are either unicode or str. Do not mix the two.
If you’re being given both unicode and str but you don’t need to preserve separate keys for each, I
recommend using to_unicode() or to_bytes() to convert all keys to one type or the other like this:
>>> from kitchen.text.converters import to_unicode
>>> u_string = u'one'
>>> b_string = 'two'
>>> d = {}
>>> d[to_unicode(u_string)] = 1
>>> d[to_unicode(b_string)] = 2
>>> d
{u'two': 2, u'one': 1}
• These issues also apply to using dicts with tuple keys that contain a mixture of unicode and str. Once
again the best fix is to standardise on either str or unicode.
• If you absolutely need to store values in a dictionary where the keys could be either unicode or str
you can use StrictDict which has separate entries for all unicode and byte str and deals correctly with
any tuple containing mixed unicode and byte str.
Functions
Unicode and byte str conversion
kitchen.text.converters.to_unicode(obj, encoding='utf-8', errors='replace', nonstring=None,
non_string=None)
Convert an object into a unicode string
Parameters
• obj – Object to convert to a unicode string. This should normally be a byte str
• encoding – What encoding to try converting the byte str as. Defaults to utf-8
• errors – If errors are found while decoding, perform this action. Defaults to replace
which replaces the invalid bytes with a character that means the bytes were unable to be
decoded. Other values are the same as the error handling schemes in the codec base
classes. For instance strict which raises an exception and ignore which simply omits the
non-decodable characters.
• nonstring –
How to treat nonstring values. Possible values are:
simplerepr
Attempt to call the object’s “simple representation” method and return that value.
Python-2.3+ has two methods that try to return a simple representation:
object.__unicode__() and object.__str__(). We first try to get a usable value
from object.__unicode__(). If that fails we try the same with object.__str__().
empty Return an empty unicode string
strict Raise a TypeError
passthru
Return the object unchanged
repr Attempt to return a unicode string of the repr of the object
Default is simplerepr
• non_string – Deprecated Use nonstring instead
Raises
• TypeError – if nonstring is strict and a non-basestring object is passed in or if
nonstring is set to an unknown value
• UnicodeDecodeError – if errors is strict and obj is not decodable using the given
encoding
Returns
unicode string or the original object depending on the value of nonstring.
Usually this should be used on a byte str but it can take both byte str and unicode strings
intelligently. Nonstring objects are handled in different ways depending on the setting of the
nonstring parameter.
The default values of this function are set so as to always return a unicode string and never
raise an error when converting from a byte str to a unicode string. However, when you do not pass
validly encoded text (or a nonstring object), you may end up with output that you don’t expect.
Be sure you understand the requirements of your data, not just ignore errors by passing it through
this function.
Changed in version 0.2.1a2: Deprecated non_string in favor of nonstring parameter and changed
default value to simplerepr
kitchen.text.converters.to_bytes(obj, encoding='utf-8', errors='replace', nonstring=None,
non_string=None)
Convert an object into a byte str
Parameters
• obj – Object to convert to a byte str. This should normally be a unicode string.
• encoding – Encoding to use to convert the unicode string into a byte str. Defaults to
utf-8.
• errors –
If errors are found while encoding, perform this action. Defaults to replace which
replaces the invalid bytes with a character that means the bytes were unable to be
encoded. Other values are the same as the error handling schemes in the codec base
classes. For instance strict which raises an exception and ignore which simply omits the
non-encodable characters.
• nonstring –
How to treat nonstring values. Possible values are:
simplerepr
Attempt to call the object’s “simple representation” method and return that value.
Python-2.3+ has two methods that try to return a simple representation:
object.__unicode__() and object.__str__(). We first try to get a usable value
from object.__str__(). If that fails we try the same with object.__unicode__().
empty Return an empty byte str
strict Raise a TypeError
passthru
Return the object unchanged
repr Attempt to return a byte str of the repr() of the object
Default is simplerepr.
• non_string – Deprecated Use nonstring instead.
Raises
• TypeError – if nonstring is strict and a non-basestring object is passed in or if
nonstring is set to an unknown value.
• UnicodeEncodeError – if errors is strict and all of the bytes of obj are unable to be
encoded using encoding.
Returns
byte str or the original object depending on the value of nonstring.
WARNING:
If you pass a byte str into this function the byte str is returned unmodified. It is not
re-encoded with the specified encoding. The easiest way to achieve that is:
to_bytes(to_unicode(text), encoding='utf-8')
The initial to_unicode() call will ensure text is a unicode string. Then, to_bytes() will turn
that into a byte str with the specified encoding.
Usually, this should be used on a unicode string but it can take either a byte str or a unicode
string intelligently. Nonstring objects are handled in different ways depending on the setting of
the nonstring parameter.
The default values of this function are set so as to always return a byte str and never raise an
error when converting from unicode to bytes. However, when you do not pass an encoding that can
validly encode the object (or a non-string object), you may end up with output that you don’t
expect. Be sure you understand the requirements of your data, not just ignore errors by passing
it through this function.
Changed in version 0.2.1a2: Deprecated non_string in favor of nonstring parameter and changed
default value to simplerepr
kitchen.text.converters.getwriter(encoding)
Return a codecs.StreamWriter that resists tracing back.
Parameters
encoding – Encoding to use for transforming unicode strings into byte str.
Return type
codecs.StreamWriter
Returns
StreamWriter that you can instantiate to wrap output streams to automatically translate
unicode strings into encoding.
This is a reimplemetation of codecs.getwriter() that returns a StreamWriter that resists issuing
tracebacks. The StreamWriter that is returned uses kitchen.text.converters.to_bytes() to convert
unicode strings into byte str. The departures from codecs.getwriter() are:
1. The StreamWriter that is returned will take byte str as well as unicode strings. Any byte str
will be passed through unmodified.
2. The default error handler for unknown bytes is to replace the bytes with the unknown character
(? in most ascii-based encodings, � in the utf encodings) whereas codecs.getwriter() defaults
to strict. Like codecs.StreamWriter, the returned StreamWriter can have its error handler
changed in code by setting stream.errors = 'new_handler_name'
Example usage:
$ LC_ALL=C python
>>> import sys
>>> from kitchen.text.converters import getwriter
>>> UTF8Writer = getwriter('utf-8')
>>> unwrapped_stdout = sys.stdout
>>> sys.stdout = UTF8Writer(unwrapped_stdout)
>>> print 'caf\xc3\xa9'
café
>>> print u'caf\xe9'
café
>>> ASCIIWriter = getwriter('ascii')
>>> sys.stdout = ASCIIWriter(unwrapped_stdout)
>>> print 'caf\xc3\xa9'
café
>>> print u'caf\xe9'
caf?
SEE ALSO:
API docs for codecs.StreamWriter and codecs.getwriter() and Print Fails on the python wiki.
New in version kitchen: 0.2a2, API: kitchen.text 1.1.0
kitchen.text.converters.to_str(obj)
Deprecated
This function converts something to a byte str if it isn’t one. It’s used to call str() or
unicode() on the object to get its simple representation without danger of getting a UnicodeError.
You should be using to_unicode() or to_bytes() explicitly instead.
If you need unicode strings:
to_unicode(obj, nonstring='simplerepr')
If you need byte str:
to_bytes(obj, nonstring='simplerepr')
kitchen.text.converters.to_utf8(obj, errors='replace', non_string='passthru')
Deprecated
Convert unicode to an encoded utf-8 byte str. You should be using to_bytes() instead:
to_bytes(obj, encoding='utf-8', non_string='passthru')
Transformation to XML
kitchen.text.converters.unicode_to_xml(string, encoding='utf-8', attrib=False, control_chars='replace')
Take a unicode string and turn it into a byte str suitable for xml
Parameters
• string – unicode string to encode into an XML compatible byte str
• encoding – encoding to use for the returned byte str. Default is to encode to UTF-8. If
some of the characters in string are not encodable in this encoding, the unknown
characters will be entered into the output string using xml character references.
• attrib – If True, quote the string for use in an xml attribute. If False (default),
quote for use in an xml text field.
• control_chars –
control characters are not allowed in XML documents. When we encounter those we need to
know what to do. Valid options are:
replace
(default) Replace the control characters with ?
ignore Remove the characters altogether from the output
strict Raise an XmlEncodeError when we encounter a control character
Raises
• kitchen.text.exceptions.XmlEncodeError – If control_chars is set to strict and the string
to be made suitable for output to xml contains control characters or if string is not a
unicode string then we raise this exception.
• ValueError – If control_chars is set to something other than replace, ignore, or strict.
Return type
byte str
Returns
representation of the unicode string as a valid XML byte str
XML files consist mainly of text encoded using a particular charset. XML also denies the use of
certain bytes in the encoded text (example: ASCII Null). There are also special characters that
must be escaped if they are present in the input (example: <). This function takes care of all of
those issues for you.
There are a few different ways to use this function depending on your needs. The simplest
invocation is like this:
unicode_to_xml(u'String with non-ASCII characters: <"á と">')
This will return the following to you, encoded in utf-8:
'String with non-ASCII characters: <"á と">'
Pretty straightforward. Now, what if you need to encode your document in something other than
utf-8? For instance, latin-1? Let’s see:
unicode_to_xml(u'String with non-ASCII characters: <"á と">', encoding='latin-1')
'String with non-ASCII characters: <"á と">'
Because the と character is not available in the latin-1 charset, it is replaced with と in
our output. This is an xml character reference which represents the character at unicode
codepoint 12392, the と character.
When you want to reverse this, use xml_to_unicode() which will turn a byte str into a unicode
string and replace the xml character references with the unicode characters.
XML also has the quirk of not allowing control characters in its output. The control_chars
parameter allows us to specify what to do with those. For use cases that don’t need absolute
character by character fidelity (example: holding strings that will just be used for display in a
GUI app later), the default value of replace works well:
unicode_to_xml(u'String with disallowed control chars: \u0000\u0007')
'String with disallowed control chars: ??'
If you do need to be able to reproduce all of the characters at a later date (examples: if the
string is a key value in a database or a path on a filesystem) you have many choices. Here are a
few that rely on utf-7, a verbose encoding that encodes control characters (as well as non-ASCII
unicode values) to characters from within the ASCII printable characters. The good thing about
doing this is that the code is pretty simple. You just need to use utf-7 both when encoding the
field for xml and when decoding it for use in your python program:
unicode_to_xml(u'String with unicode: と and control char: \u0007', encoding='utf7')
'String with unicode: +MGg and control char: +AAc-'
# [...]
xml_to_unicode('String with unicode: +MGg and control char: +AAc-', encoding='utf7')
u'String with unicode: と and control char: \u0007'
As you can see, the utf-7 encoding will transform even characters that would be representable in
utf-8. This can be a drawback if you want unicode characters in the file to be readable without
being decoded first. You can work around this with increased complexity in your application code:
encoding = 'utf-8'
u_string = u'String with unicode: と and control char: \u0007'
try:
# First attempt to encode to utf8
data = unicode_to_xml(u_string, encoding=encoding, errors='strict')
except XmlEncodeError:
# Fallback to utf-7
encoding = 'utf-7'
data = unicode_to_xml(u_string, encoding=encoding, errors='strict')
write_tag('<mytag encoding=%s>%s</mytag>' % (encoding, data))
# [...]
encoding = tag.attributes.encoding
u_string = xml_to_unicode(u_string, encoding=encoding)
Using code similar to that, you can have some fields encoded using your default encoding and
fallback to utf-7 if there are control characters present.
NOTE:
If your goal is to preserve the control characters you cannot save the entire file as utf-7 and
set the xml encoding parameter to utf-7 if your goal is to preserve the control characters.
Because XML doesn’t allow control characters, you have to encode those separate from any
encoding work that the XML parser itself knows about.
SEE ALSO:
bytes_to_xml()
if you’re dealing with bytes that are non-text or of an unknown encoding that you must
preserve on a byte for byte level.
guess_encoding_to_xml()
if you’re dealing with strings in unknown encodings that you don’t need to save with
char-for-char fidelity.
kitchen.text.converters.xml_to_unicode(byte_string, encoding='utf-8', errors='replace')
Transform a byte str from an xml file into a unicode string
Parameters
• byte_string – byte str to decode
• encoding – encoding that the byte str is in
• errors – What to do if not every character is valid in encoding. See the to_unicode()
documentation for legal values.
Return type
unicode string
Returns
string decoded from byte_string
This function attempts to reverse what unicode_to_xml() does. It takes a byte str (presumably
read in from an xml file) and expands all the html entities into unicode characters and decodes
the byte str into a unicode string. One thing it cannot do is restore any control characters that
were removed prior to inserting into the file. If you need to keep such characters you need to
use xml_to_bytes() and bytes_to_xml() or use on of the strategies documented in unicode_to_xml()
instead.
kitchen.text.converters.byte_string_to_xml(byte_string, input_encoding='utf-8', errors='replace',
output_encoding='utf-8', attrib=False, control_chars='replace')
Make sure a byte str is validly encoded for xml output
Parameters
• byte_string – Byte str to turn into valid xml output
• input_encoding – Encoding of byte_string. Default utf-8
• errors –
How to handle errors encountered while decoding the byte_string into unicode at the
beginning of the process. Values are:
replace
(default) Replace the invalid bytes with a ?
ignore Remove the characters altogether from the output
strict Raise an UnicodeDecodeError when we encounter a non-decodable character
• output_encoding – Encoding for the xml file that this string will go into. Default is
utf-8. If all the characters in byte_string are not encodable in this encoding, the
unknown characters will be entered into the output string using xml character references.
• attrib – If True, quote the string for use in an xml attribute. If False (default),
quote for use in an xml text field.
• control_chars –
XML does not allow control characters. When we encounter those we need to know what to
do. Valid options are:
replace
(default) Replace the control characters with ?
ignore Remove the characters altogether from the output
strict Raise an error when we encounter a control character
Raises
• XmlEncodeError – If control_chars is set to strict and the string to be made suitable for
output to xml contains control characters then we raise this exception.
• UnicodeDecodeError – If errors is set to strict and the byte_string contains bytes that
are not decodable using input_encoding, this error is raised
Return type
byte str
Returns
representation of the byte str in the output encoding with any bytes that aren’t available
in xml taken care of.
Use this when you have a byte str representing text that you need to make suitable for output to
xml. There are several cases where this is the case. For instance, if you need to transform some
strings encoded in latin-1 to utf-8 for output:
utf8_string = byte_string_to_xml(latin1_string, input_encoding='latin-1')
If you already have strings in the proper encoding you may still want to use this function to
remove control characters:
cleaned_string = byte_string_to_xml(string, input_encoding='utf-8', output_encoding='utf-8')
SEE ALSO:
unicode_to_xml()
for other ideas on using this function
kitchen.text.converters.xml_to_byte_string(byte_string, input_encoding='utf-8', errors='replace',
output_encoding='utf-8')
Transform a byte str from an xml file into unicode string
Parameters
• byte_string – byte str to decode
• input_encoding – encoding that the byte str is in
• errors – What to do if not every character is valid in encoding. See the to_unicode()
docstring for legal values.
• output_encoding – Encoding for the output byte str
Returns
unicode string decoded from byte_string
This function attempts to reverse what unicode_to_xml() does. It takes a byte str (presumably
read in from an xml file) and expands all the html entities into unicode characters and decodes
the byte str into a unicode string. One thing it cannot do is restore any control characters that
were removed prior to inserting into the file. If you need to keep such characters you need to
use xml_to_bytes() and bytes_to_xml() or use one of the strategies documented in unicode_to_xml()
instead.
kitchen.text.converters.bytes_to_xml(byte_string, *args, **kwargs)
Return a byte str encoded so it is valid inside of any xml file
Parameters
• byte_string – byte str to transform
• **kwargs (*args,) – extra arguments to this function are passed on to the function
actually implementing the encoding. You can use this to tweak the output in some cases
but, as a general rule, you shouldn’t because the underlying encoding function is not
guaranteed to remain the same.
Return type
byte str consisting of all ASCII characters
Returns
byte str representation of the input. This will be encoded using base64.
This function is made especially to put binary information into xml documents.
This function is intended for encoding things that must be preserved byte-for-byte. If you want
to encode a byte string that’s text and don’t mind losing the actual bytes you probably want to
try byte_string_to_xml() or guess_encoding_to_xml() instead.
NOTE:
Although the current implementation uses base64.b64encode() and there’s no plans to change it,
that isn’t guaranteed. If you want to make sure that you can encode and decode these messages
it’s best to use xml_to_bytes() if you use this function to encode.
kitchen.text.converters.xml_to_bytes(byte_string, *args, **kwargs)
Decode a string encoded using bytes_to_xml()
Parameters
• byte_string – byte str to transform. This should be a base64 encoded sequence of bytes
originally generated by bytes_to_xml().
• **kwargs (*args,) – extra arguments to this function are passed on to the function
actually implementing the encoding. You can use this to tweak the output in some cases
but, as a general rule, you shouldn’t because the underlying encoding function is not
guaranteed to remain the same.
Return type
byte str
Returns
byte str that’s the decoded input
If you’ve got fields in an xml document that were encoded with bytes_to_xml() then you want to use
this function to undecode them. It converts a base64 encoded string into a byte str.
NOTE:
Although the current implementation uses base64.b64decode() and there’s no plans to change it,
that isn’t guaranteed. If you want to make sure that you can encode and decode these messages
it’s best to use bytes_to_xml() if you use this function to decode.
kitchen.text.converters.guess_encoding_to_xml(string, output_encoding='utf-8', attrib=False,
control_chars='replace')
Return a byte str suitable for inclusion in xml
Parameters
• string – unicode or byte str to be transformed into a byte str suitable for inclusion in
xml. If string is a byte str we attempt to guess the encoding. If we cannot guess, we
fallback to latin-1.
• output_encoding – Output encoding for the byte str. This should match the encoding of
your xml file.
• attrib – If True, escape the item for use in an xml attribute. If False (default) escape
the item for use in a text node.
Returns
utf-8 encoded byte str
kitchen.text.converters.to_xml(string, encoding='utf-8', attrib=False, control_chars='ignore')
Deprecated: Use guess_encoding_to_xml() instead
Working with exception messages
kitchen.text.converters.EXCEPTION_CONVERTERS = (<function <lambda>>, <function <lambda>>)
Tuple of functions to try to use to convert an exception into a string
representation. Its main use is to extract a string (unicode or str) from an exception
object in exception_to_unicode() and exception_to_bytes(). The functions here will try the
exception’s args[0] and the exception itself (roughly equivalent to str(exception)) to
extract the message. This is only a default and can be easily overridden when calling those
functions. There are several reasons you might wish to do that. If you have exceptions
where the best string representing the exception is not returned by the default functions,
you can add another function to extract from a different field:
from kitchen.text.converters import (EXCEPTION_CONVERTERS,
exception_to_unicode)
class MyError(Exception):
def __init__(self, message):
self.value = message
c = [lambda e: e.value]
c.extend(EXCEPTION_CONVERTERS)
try:
raise MyError('An Exception message')
except MyError, e:
print exception_to_unicode(e, converters=c)
Another reason would be if you’re converting to a byte str and you know the str needs to be
a non-utf-8 encoding. exception_to_bytes() defaults to utf-8 but if you convert into a
byte str explicitly using a converter then you can choose a different encoding:
from kitchen.text.converters import (EXCEPTION_CONVERTERS,
exception_to_bytes, to_bytes)
c = [lambda e: to_bytes(e.args[0], encoding='euc_jp'),
lambda e: to_bytes(e, encoding='euc_jp')]
c.extend(EXCEPTION_CONVERTERS)
try:
do_something()
except Exception, e:
log = open('logfile.euc_jp', 'a')
log.write('%s
‘ % exception_to_bytes(e, converters=c)
log.close()
Each function in this list should take the exception as its sole argument and return a
string containing the message representing the exception. The functions may return the
message as a :byte class:str, a unicode string, or even an object if you trust the object
to return a decent string representation. The exception_to_unicode() and
exception_to_bytes() functions will make sure to convert the string to the proper type
before returning.
New in version 0.2.2.
kitchen.text.converters.BYTE_EXCEPTION_CONVERTERS = (<function <lambda>>, <function to_bytes>)
Deprecated: Use EXCEPTION_CONVERTERS instead.
Tuple of functions to try to use to convert an exception into a string representation. This tuple
is similar to the one in EXCEPTION_CONVERTERS but it’s used with exception_to_bytes() instead.
Ideally, these functions should do their best to return the data as a byte str but the results
will be run through to_bytes() before being returned.
New in version 0.2.2.
Changed in version 1.0.1: Deprecated as simplifications allow EXCEPTION_CONVERTERS to perform the
same function.
kitchen.text.converters.exception_to_unicode(exc, converters=(<function <lambda>>, <function <lambda>>))
Convert an exception object into a unicode representation
Parameters
• exc – Exception object to convert
• converters – List of functions to use to convert the exception into a string. See
EXCEPTION_CONVERTERS for the default value and an example of adding other converters to
the defaults. The functions in the list are tried one at a time to see if they can
extract a string from the exception. The first one to do so without raising an exception
is used.
Returns
unicode string representation of the exception. The value extracted by the converters will
be converted into unicode before being returned using the utf-8 encoding. If you know you
need to use an alternate encoding add a function that does that to the list of functions in
converters)
New in version 0.2.2.
kitchen.text.converters.exception_to_bytes(exc, converters=(<function <lambda>>, <function <lambda>>))
Convert an exception object into a str representation
Parameters
• exc – Exception object to convert
• converters – List of functions to use to convert the exception into a string. See
EXCEPTION_CONVERTERS for the default value and an example of adding other converters to
the defaults. The functions in the list are tried one at a time to see if they can
extract a string from the exception. The first one to do so without raising an exception
is used.
Returns
byte str representation of the exception. The value extracted by the converters will be
converted into str before being returned using the utf-8 encoding. If you know you need to
use an alternate encoding add a function that does that to the list of functions in
converters)
New in version 0.2.2.
Changed in version 1.0.1: Code simplification allowed us to switch to using EXCEPTION_CONVERTERS
as the default value of converters.
Format Text for Display
Functions related to displaying unicode text. Unicode characters don’t all have the same width so we
need helper functions for displaying them.
New in version 0.2: kitchen.display API 1.0.0
kitchen.text.display.textual_width(msg, control_chars='guess', encoding='utf-8', errors='replace')
Get the textual width of a string
Parameters
• msg – unicode string or byte str to get the width of
• control_chars –
specify how to deal with control characters. Possible values are:
guess (default) will take a guess for control character widths. Most codes will return
zero width. backspace, delete, and clear delete return -1. escape currently
returns -1 as well but this is not guaranteed as it’s not always correct
strict will raise kitchen.text.exceptions.ControlCharError if a control character is
encountered
• encoding – If we are given a byte str this is used to decode it into unicode string. Any
characters that are not decodable in this encoding will get a value dependent on the
errors parameter.
• errors – How to treat errors encoding the byte str to unicode string. Legal values are
the same as for kitchen.text.converters.to_unicode(). The default value of replace will
cause undecodable byte sequences to have a width of one. ignore will have a width of
zero.
Raises ControlCharError – if msg contains a control character and control_chars is strict.
Returns
Textual width of the msg. This is the amount of space that the string will consume on a
monospace display. It’s measured in the number of cell positions or columns it will take
up on a monospace display. This is not the number of glyphs that are in the string.
NOTE:
This function can be wrong sometimes because Unicode does not specify a strict width value for
all of the code points. In particular, we’ve found that some Tamil characters take up to four
character cells but we return a lesser amount.
kitchen.text.display.textual_width_chop(msg, chop, encoding='utf-8', errors='replace')
Given a string, return it chopped to a given textual width
Parameters
• msg – unicode string or byte str to chop
• chop – Chop msg if it exceeds this textual width
• encoding – If we are given a byte str, this is used to decode it into a unicode string.
Any characters that are not decodable in this encoding will be assigned a width of one.
• errors – How to treat errors encoding the byte str to unicode. Legal values are the same
as for kitchen.text.converters.to_unicode()
Return type
unicode string
Returns
unicode string of the msg chopped at the given textual width
This is what you want to use instead of %.*s, as it does the “right” thing with regard to UTF-8
sequences, control characters, and characters that take more than one cell position. Eg:
>>> # Wrong: only displays 8 characters because it is operating on bytes
>>> print "%.*s" % (10, 'café ñunru!')
café ñun
>>> # Properly operates on graphemes
>>> '%s' % (textual_width_chop('café ñunru!', 10))
café ñunru
>>> # takes too many columns because the kanji need two cell positions
>>> print '1234567890\n%.*s' % (10, u'一二三四五六七八九十')
1234567890
一二三四五六七八九十
>>> # Properly chops at 10 columns
>>> print '1234567890\n%s' % (textual_width_chop(u'一二三四五六七八九十', 10))
1234567890
一二三四五
kitchen.text.display.textual_width_fill(msg, fill, chop=None, left=True, prefix='', suffix='')
Expand a unicode string to a specified textual width or chop to same
Parameters
• msg – unicode string to format
• fill – pad string until the textual width of the string is this length
• chop – before doing anything else, chop the string to this length. Default: Don’t chop
the string at all
• left – If True (default) left justify the string and put the padding on the right. If
False, pad on the left side.
• prefix – Attach this string before the field we’re filling
• suffix – Append this string to the end of the field we’re filling
Return type
unicode string
Returns
msg formatted to fill the specified width. If no chop is specified, the string could
exceed the fill length when completed. If prefix or suffix are printable characters, the
string could be longer than the fill width.
NOTE:
prefix and suffix should be used for “invisible” characters like highlighting, color changing
escape codes, etc. The fill characters are appended outside of any prefix or suffix elements.
This allows you to only highlight msg inside of the field you’re filling.
WARNING:
msg, prefix, and suffix should all be representable as unicode characters. In particular, any
escape sequences in prefix and suffix need to be convertible to unicode. If you need to use
byte sequences here rather than unicode characters, use byte_string_textual_width_fill()
instead.
This function expands a string to fill a field of a particular textual width. Use it instead of
%*.*s, as it does the “right” thing with regard to UTF-8 sequences, control characters, and
characters that take more than one cell position in a display. Example usage:
>>> msg = u'一二三四五六七八九十'
>>> # Wrong: This uses 10 characters instead of 10 cells:
>>> u":%-*.*s:" % (10, 10, msg[:9])
:一二三四五六七八九 :
>>> # This uses 10 cells like we really want:
>>> u":%s:" % (textual_width_fill(msg[:9], 10, 10))
:一二三四五:
>>> # Wrong: Right aligned in the field, but too many cells
>>> u"%20.10s" % (msg)
一二三四五六七八九十
>>> # Correct: Right aligned with proper number of cells
>>> u"%s" % (textual_width_fill(msg, 20, 10, left=False))
一二三四五
>>> # Wrong: Adding some escape characters to highlight the line but too many cells
>>> u"%s%20.10s%s" % (prefix, msg, suffix)
u'[7m 一二三四五六七八九十[0m'
>>> # Correct highlight of the line
>>> u"%s%s%s" % (prefix, display.textual_width_fill(msg, 20, 10, left=False), suffix)
u'[7m 一二三四五[0m'
>>> # Correct way to not highlight the fill
>>> u"%s" % (display.textual_width_fill(msg, 20, 10, left=False, prefix=prefix, suffix=suffix))
u' [7m一二三四五[0m'
kitchen.text.display.wrap(text, width=70, initial_indent=u'', subsequent_indent=u'', encoding='utf-8',
errors='replace')
Works like we want textwrap.wrap() to work,
Parameters
• text – unicode string or byte str to wrap
• width – textual width at which to wrap. Default: 70
• initial_indent – string to use to indent the first line. Default: do not indent.
• subsequent_indent – string to use to wrap subsequent lines. Default: do not indent
• encoding – Encoding to use if text is a byte str
• errors – error handler to use if text is a byte str and contains some undecodable
characters.
Return type
list of unicode strings
Returns
list of lines that have been text wrapped and indented.
textwrap.wrap() from the python standard library has two drawbacks that this attempts to fix:
1. It does not handle textual width. It only operates on bytes or characters which are both
inadequate (due to multi-byte and double width characters).
2. It malforms lists and blocks.
kitchen.text.display.fill(text, *args, **kwargs)
Works like we want textwrap.fill() to work
Parameters
text – unicode string or byte str to process
Returns
unicode string with each line separated by a newline
SEE ALSO:
kitchen.text.display.wrap()
for other parameters that you can give this command.
This function is a light wrapper around kitchen.text.display.wrap(). Where that function returns
a list of lines, this function returns one string with each line separated by a newline.
kitchen.text.display.byte_string_textual_width_fill(msg, fill, chop=None, left=True, prefix='',
suffix='', encoding='utf-8', errors='replace')
Expand a byte str to a specified textual width or chop to same
Parameters
• msg – byte str encoded in UTF-8 that we want formatted
• fill – pad msg until the textual width is this long
• chop – before doing anything else, chop the string to this length. Default: Don’t chop
the string at all
• left – If True (default) left justify the string and put the padding on the right. If
False, pad on the left side.
• prefix – Attach this byte str before the field we’re filling
• suffix – Append this byte str to the end of the field we’re filling
Return type
byte str
Returns
msg formatted to fill the specified textual width. If no chop is specified, the string
could exceed the fill length when completed. If prefix or suffix are printable characters,
the string could be longer than fill width.
NOTE:
prefix and suffix should be used for “invisible” characters like highlighting, color changing
escape codes, etc. The fill characters are appended outside of any prefix or suffix elements.
This allows you to only highlight msg inside of the field you’re filling.
SEE ALSO:
textual_width_fill()
For example usage. This function has only two differences.
1. it takes byte str for prefix and suffix so you can pass in arbitrary sequences of
bytes, not just unicode characters.
2. it returns a byte str instead of a unicode string.
Internal Data
There are a few internal functions and variables in this module. Code outside of kitchen shouldn’t use
them but people coding on kitchen itself may find them useful.
kitchen.text.display._COMBINING = ((768, 879), (1155, 1161), (1425, 1469), (1471, 1471), (1473, 1474),
(1476, 1477), (1479, 1479), (1536, 1539), (1552, 1562), (1611, 1631), (1648, 1648), (1750, 1764), (1767,
1768), (1770, 1773), (1807, 1807), (1809, 1809), (1840, 1866), (1958, 1968), (2027, 2035), (2070, 2073),
(2075, 2083), (2085, 2087), (2089, 2093), (2137, 2139), (2260, 2273), (2275, 2303), (2305, 2306), (2364,
2364), (2369, 2376), (2381, 2381), (2385, 2388), (2402, 2403), (2433, 2433), (2492, 2492), (2497, 2500),
(2509, 2509), (2530, 2531), (2561, 2562), (2620, 2620), (2625, 2626), (2631, 2632), (2635, 2637), (2672,
2673), (2689, 2690), (2748, 2748), (2753, 2757), (2759, 2760), (2765, 2765), (2786, 2787), (2817, 2817),
(2876, 2876), (2879, 2879), (2881, 2883), (2893, 2893), (2902, 2902), (2946, 2946), (3008, 3008), (3021,
3021), (3134, 3136), (3142, 3144), (3146, 3149), (3157, 3158), (3260, 3260), (3263, 3263), (3270, 3270),
(3276, 3277), (3298, 3299), (3393, 3395), (3405, 3405), (3530, 3530), (3538, 3540), (3542, 3542), (3633,
3633), (3636, 3642), (3655, 3662), (3761, 3761), (3764, 3769), (3771, 3772), (3784, 3789), (3864, 3865),
(3893, 3893), (3895, 3895), (3897, 3897), (3953, 3966), (3968, 3972), (3974, 3975), (3984, 3991), (3993,
4028), (4038, 4038), (4141, 4144), (4146, 4146), (4150, 4151), (4153, 4154), (4184, 4185), (4237, 4237),
(4448, 4607), (4957, 4959), (5906, 5908), (5938, 5940), (5970, 5971), (6002, 6003), (6068, 6069), (6071,
6077), (6086, 6086), (6089, 6099), (6109, 6109), (6155, 6157), (6313, 6313), (6432, 6434), (6439, 6440),
(6450, 6450), (6457, 6459), (6679, 6680), (6752, 6752), (6773, 6780), (6783, 6783), (6832, 6845), (6912,
6915), (6964, 6964), (6966, 6970), (6972, 6972), (6978, 6978), (6980, 6980), (7019, 7027), (7082, 7083),
(7142, 7142), (7154, 7155), (7223, 7223), (7376, 7378), (7380, 7392), (7394, 7400), (7405, 7405), (7412,
7412), (7416, 7417), (7616, 7669), (7675, 7679), (8203, 8207), (8234, 8238), (8288, 8291), (8298, 8303),
(8400, 8432), (11503, 11505), (11647, 11647), (11744, 11775), (12330, 12335), (12441, 12442), (42607,
42607), (42612, 42621), (42654, 42655), (42736, 42737), (43014, 43014), (43019, 43019), (43045, 43046),
(43204, 43204), (43232, 43249), (43307, 43309), (43347, 43347), (43443, 43443), (43456, 43456), (43696,
43696), (43698, 43700), (43703, 43704), (43710, 43711), (43713, 43713), (43766, 43766), (44013, 44013),
(64286, 64286), (65024, 65039), (65056, 65071), (65279, 65279), (65529, 65531), (66045, 66045), (66272,
66272), (66422, 66426), (68097, 68099), (68101, 68102), (68108, 68111), (68152, 68154), (68159, 68159),
(68325, 68326), (69702, 69702), (69759, 69759), (69817, 69818), (69888, 69890), (69939, 69940), (70003,
70003), (70080, 70080), (70090, 70090), (70197, 70198), (70377, 70378), (70460, 70460), (70477, 70477),
(70502, 70508), (70512, 70516), (70722, 70722), (70726, 70726), (70850, 70851), (71103, 71104), (71231,
71231), (71350, 71351), (71467, 71467), (72767, 72767), (92912, 92916), (92976, 92982), (113822, 113822),
(119141, 119145), (119149, 119170), (119173, 119179), (119210, 119213), (119362, 119364), (122880,
122886), (122888, 122904), (122907, 122913), (122915, 122916), (122918, 122922), (125136, 125142),
(125252, 125258), (917505, 917505), (917536, 917631), (917760, 917999))
Internal table, provided by this module to list code points which combine with other characters
and therefore should have no textual width. This is a sorted tuple of non-overlapping intervals.
Each interval is a tuple listing a starting code point and ending code point. Every code point
between the two end points is a combining character.
SEE ALSO:
_generate_combining_table()
for how this table is generated
This table was last regenerated on python-3.6.0-rc1 with unicodedata.unidata_version 9.0.0
kitchen.text.display._generate_combining_table()
Combine Markus Kuhn’s data with unicodedata to make combining char list
Return type
tuple of tuples
Returns
tuple of intervals of code points that are combining character. Each interval is a 2-tuple
of the starting code point and the ending code point for the combining characters.
In normal use, this function serves to tell how we’re generating the combining char list. For
speed reasons, we use this to generate a static list and just use that later.
Markus Kuhn’s list of combining characters is more complete than what’s in the python unicodedata
library but the python unicodedata is synced against later versions of the unicode database
This is used to generate the _COMBINING table.
kitchen.text.display._print_combining_table()
Print out a new _COMBINING table
This will print a new _COMBINING table in the format used in kitchen/text/display.py. It’s useful
for updating the _COMBINING table with updated data from a new python as the format won’t change
from what’s already in the file.
kitchen.text.display._interval_bisearch(value, table)
Binary search in an interval table.
Parameters
• value – numeric value to search for
• table – Ordered list of intervals. This is a list of two-tuples. The elements of the
two-tuple define an interval’s start and end points.
Returns
If value is found within an interval in the table return True. Otherwise, False
This function checks whether a numeric value is present within a table of intervals. It checks
using a binary search algorithm, dividing the list of values in half and checking against the
values until it determines whether the value is in the table.
kitchen.text.display._ucp_width(ucs, control_chars='guess')
Get the textual width of a ucs character
Parameters
• ucs – integer representing a single unicode code point
• control_chars –
specify how to deal with control characters. Possible values are:
guess (default) will take a guess for control character widths. Most codes will return
zero width. backspace, delete, and clear delete return -1. escape currently
returns -1 as well but this is not guaranteed as it’s not always correct
strict will raise ControlCharError if a control character is encountered
Raises ControlCharError – if the code point is a unicode control character and control_chars is
set to ‘strict’
Returns
textual width of the character.
NOTE:
It’s important to remember this is textual width and not the number of characters or bytes.
kitchen.text.display._textual_width_le(width, *args)
Optimize the common case when deciding which textual width is larger
Parameters
• width – textual width to compare against.
• *args – unicode strings to check the total textual width of
Returns
True if the total length of args are less than or equal to width. Otherwise False.
We often want to know “does X fit in Y”. It takes a while to use textual_width() to calculate
this. However, we know that the number of canonically composed unicode characters is always going
to have 1 or 2 for the textual width per character. With this we can take the following
shortcuts:
1. If the number of canonically composed characters is more than width, the true textual width
cannot be less than width.
2. If the number of canonically composed characters * 2 is less than the width then the textual
width must be ok.
textual width of a canonically composed unicode string will always be greater than or equal to the
the number of unicode characters. So we can first check if the number of composed unicode
characters is less than the asked for width. If it is we can return True immediately. If not,
then we must do a full textual width lookup.
Miscellaneous functions for manipulating text
Collection of text functions that don’t fit in another category.
Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0 Added isbasestring(), isbytestring(), and
isunicodestring() to help tell which string type is which on python2 and python3
kitchen.text.misc.byte_string_valid_encoding(byte_string, encoding='utf-8')
Detect if a byte str is valid in a specific encoding
Parameters
• byte_string – Byte str to test for bytes not valid in this encoding
• encoding – encoding to test against. Defaults to UTF-8.
Returns
True if there are no invalid UTF-8 characters. False if an invalid character is detected.
NOTE:
This function checks whether the byte str is valid in the specified encoding. It does not
detect whether the byte str actually was encoded in that encoding. If you want that sort of
functionality, you probably want to use guess_encoding() instead.
kitchen.text.misc.byte_string_valid_xml(byte_string, encoding='utf-8')
Check that a byte str would be valid in xml
Parameters
• byte_string – Byte str to check
• encoding – Encoding of the xml file. Default: UTF-8
Returns
True if the string is valid. False if it would be invalid in the xml file
In some cases you’ll have a whole bunch of byte strings and rather than transforming them to
unicode and back to byte str for output to xml, you will just want to make sure they work with the
xml file you’re constructing. This function will help you do that. Example:
ARRAY_OF_MOSTLY_UTF8_STRINGS = [...]
processed_array = []
for string in ARRAY_OF_MOSTLY_UTF8_STRINGS:
if byte_string_valid_xml(string, 'utf-8'):
processed_array.append(string)
else:
processed_array.append(guess_bytes_to_xml(string, encoding='utf-8'))
output_xml(processed_array)
kitchen.text.misc.guess_encoding(byte_string, disable_chardet=False)
Try to guess the encoding of a byte str
Parameters
• byte_string – byte str to guess the encoding of
• disable_chardet – If this is True, we never attempt to use chardet to guess the encoding.
This is useful if you need to have reproducibility whether chardet is installed or not.
Default: False.
Raises TypeError – if byte_string is not a byte str type
Returns
string containing a guess at the encoding of byte_string. This is appropriate to pass as
the encoding argument when encoding and decoding unicode strings.
We start by attempting to decode the byte str as UTF-8. If this succeeds we tell the world it’s
UTF-8 text. If it doesn’t and chardet is installed on the system and disable_chardet is False
this function will use it to try detecting the encoding of byte_string. If it is not installed or
chardet cannot determine the encoding with a high enough confidence then we rather arbitrarily
claim that it is latin-1. Since latin-1 will encode to every byte, decoding from latin-1 to
unicode will not cause UnicodeErrors although the output might be mangled.
kitchen.text.misc.html_entities_unescape(string)
Substitute unicode characters for HTML entities
Parameters
string – unicode string to substitute out html entities
Raises TypeError – if something other than a unicode string is given
Return type
unicode string
Returns
The plain text without html entities
kitchen.text.misc.isbasestring(obj)
Determine if obj is a byte str or unicode string
In python2 this is eqiuvalent to isinstance(obj, basestring). In python3 it checks whether the
object is an instance of str, bytes, or bytearray. This is an aid to porting code that needed to
test whether an object was derived from basestring in python2 (commonly used in unicode-bytes
conversion functions)
Parameters
obj – Object to test
Returns
True if the object is a basestring. Otherwise False.
New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0
kitchen.text.misc.isbytestring(obj)
Determine if obj is a byte str
In python2 this is equivalent to isinstance(obj, str). In python3 it checks whether the object is
an instance of bytes or bytearray.
Parameters
obj – Object to test
Returns
True if the object is a byte str. Otherwise, False.
New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0
kitchen.text.misc.isunicodestring(obj)
Determine if obj is a unicode string
In python2 this is equivalent to isinstance(obj, unicode). In python3 it checks whether the
object is an instance of str.
Parameters
obj – Object to test
Returns
True if the object is a unicode string. Otherwise, False.
New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0
kitchen.text.misc.process_control_chars(string, strategy='replace')
Look for and transform control characters in a string
Parameters
• string – string to search for and transform control characters within
• strategy –
XML does not allow ASCII control characters. When we encounter those we need to know
what to do. Valid options are:
replace
(default) Replace the control characters with "?"
ignore Remove the characters altogether from the output
strict Raise a ControlCharError when we encounter a control character
Raises
• TypeError – if string is not a unicode string.
• ValueError – if the strategy is not one of replace, ignore, or strict.
• kitchen.text.exceptions.ControlCharError – if the strategy is strict and a control
character is present in the string
Returns
unicode string with no control characters in it.
Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0 Strip out the C1 control characters in
addition to the C0 control characters.
kitchen.text.misc.str_eq(str1, str2, encoding='utf-8', errors='replace')
Compare two strings, converting to byte str if one is unicode
Parameters
• str1 – First string to compare
• str2 – Second string to compare
• encoding – If we need to convert one string into a byte str to compare, the encoding to
use. Default is utf-8.
• errors – What to do if we encounter errors when encoding the string. See the
kitchen.text.converters.to_bytes() documentation for possible values. The default is
replace.
This function prevents UnicodeError (python-2.4 or less) and UnicodeWarning (python 2.5 and
higher) when we compare a unicode string to a byte str. The errors normally arise because the
conversion is done to ASCII. This function lets you convert to utf-8 or another encoding instead.
NOTE:
When we need to convert one of the strings from unicode in order to compare them we convert the
unicode string into a byte str. That means that strings can compare differently if you use
different encodings for each.
Note that str1 == str2 is faster than this function if you can accept the following limitations:
• Limited to python-2.5+ (otherwise a UnicodeDecodeError may be thrown)
• Will generate a UnicodeWarning if non-ASCII byte str is compared to unicode string.
UTF-8
Functions for operating on byte str encoded as UTF-8
NOTE:
In many cases, it is better to convert to unicode, operate on the strings, then convert back to UTF-8.
unicode type can handle many of these functions itself. For those that it doesn’t (removing control
characters from length calculations, for instance) the code to do so with a unicode type is often
simpler.
WARNING:
All of the functions in this module are deprecated. Most of them have been replaced with functions
that operate on unicode values in kitchen.text.display. kitchen.text.utf8.utf8_valid() has been
replaced with a function in kitchen.text.misc.
kitchen.text.utf8.utf8_text_fill(text, *args, **kwargs)
Deprecated Similar to textwrap.fill() but understands utf-8 strings and doesn’t screw up
lists/blocks/etc.
Use kitchen.text.display.fill() instead.
kitchen.text.utf8.utf8_text_wrap(text, width=70, initial_indent='', subsequent_indent='')
Deprecated Similar to textwrap.wrap() but understands utf-8 data and doesn’t screw up
lists/blocks/etc
Use kitchen.text.display.wrap() instead
kitchen.text.utf8.utf8_valid(msg)
Deprecated Detect if a string is valid utf-8
Use kitchen.text.misc.byte_string_valid_encoding() instead.
kitchen.text.utf8.utf8_width(msg)
Deprecated Get the textual width of a utf-8 string
Use kitchen.text.display.textual_width() instead.
kitchen.text.utf8.utf8_width_chop(msg, chop=None)
Deprecated Return a string chopped to a given textual width
Use textual_width_chop() and textual_width() instead:
>>> msg = 'く ku ら ra と to み mi'
>>> # Old way:
>>> utf8_width_chop(msg, 5)
(5, 'く ku')
>>> # New way
>>> from kitchen.text.converters import to_bytes
>>> from kitchen.text.display import textual_width, textual_width_chop
>>> (textual_width(msg), to_bytes(textual_width_chop(msg, 5)))
(5, 'く ku')
kitchen.text.utf8.utf8_width_fill(msg, fill, chop=None, left=True, prefix='', suffix='')
Deprecated Pad a utf-8 string to fill a specified width
Use byte_string_textual_width_fill() instead
converters
deals with converting text for different encodings and to and from XML
display
deals with issues with printing text to a screen
misc is a catchall for text manipulation functions that don’t seem to fit elsewhere
utf8 contains deprecated functions to manipulate utf8 byte strings
Kitchen.collections
StrictDict
kitchen.collections.StrictDict provides a dictionary that treats str and unicode as distinct key values.
class kitchen.collections.strictdict.StrictDict
Map class that considers unicode and str different keys
Ordinarily when you are dealing with a dict keyed on strings you want to have keys that have the
same characters end up in the same bucket even if one key is unicode and the other is a byte str.
The normal dict type does this for ASCII characters (but not for anything outside of the ASCII
range.)
Sometimes, however, you want to keep the two string classes strictly separate, for instance, if
you’re creating a single table that can map from unicode characters to str characters and vice
versa. This class will help you do that by making all unicode keys evaluate to a different key
than all str keys.
SEE ALSO:
dict for documentation on this class’s methods. This class implements all the standard dict
methods. Its treatment of unicode and str keys as separate is the only difference.
Kitchen.iterutils Module
Functions to manipulate iterables
New in version Kitchen:: 0.2.1a1
Module author: Toshio Kuratomi <toshio@fedoraproject.org>
Module author: Luke Macken <lmacken@redhat.com>
kitchen.iterutils.isiterable(obj, include_string=False)
Check whether an object is an iterable
Parameters
• obj – Object to test whether it is an iterable
• include_string – If True and obj is a byte str or unicode string this function will
return True. If set to False, byte str and unicode strings will cause this function to
return False. Default False.
Returns
True if obj is iterable, otherwise False.
kitchen.iterutils.iterate(obj, include_string=False)
Generator that can be used to iterate over anything
Parameters
• obj – The object to iterate over
• include_string – if True, treat strings as iterables. Otherwise treat them as a single
scalar value. Default False
This function will create an iterator out of any scalar or iterable. It is useful for making a
value given to you an iterable before operating on it. Iterables have their items returned.
scalars are transformed into iterables. A string is treated as a scalar value unless the
include_string parameter is set to True. Example usage:
>>> list(iterate(None))
[None]
>>> list(iterate([None]))
[None]
>>> list(iterate([1, 2, 3]))
[1, 2, 3]
>>> list(iterate(set([1, 2, 3])))
[1, 2, 3]
>>> list(iterate(dict(a='1', b='2')))
['a', 'b']
>>> list(iterate(1))
[1]
>>> list(iterate(iter([1, 2, 3])))
[1, 2, 3]
>>> list(iterate('abc'))
['abc']
>>> list(iterate('abc', include_string=True))
['a', 'b', 'c']
Helpers for versioning software
PEP-386 compliant versioning
PEP 386 defines a standard format for version strings. This module contains a function for creating
strings in that format.
kitchen.versioning.version_tuple_to_string(version_info)
Return a PEP 386 version string from a PEP 386 style version tuple
Parameters
version_info – Nested set of tuples that describes the version. See below for an example.
Returns
a version string
This function implements just enough of PEP 386 to satisfy our needs. PEP 386 defines a standard
format for version strings and refers to a function that will be merged into the python standard
library that transforms a tuple of version information into a standard version string. This
function is an implementation of that function. Once that function becomes available in the
python standard library we will start using it and deprecate this function.
version_info takes the form that PEP 386’s NormalizedVersion.from_parts() uses:
((Major, Minor, [Micros]), [(Alpha/Beta/rc marker, version)],
[(post/dev marker, version)])
Ex: ((1, 0, 0), ('a', 2), ('dev', 3456))
It generates a PEP 386 compliant version string:
N.N[.N]+[{a|b|c|rc}N[.N]+][.postN][.devN]
Ex: 1.0.0a2.dev3456
WARNING:
This function does next to no error checking. It’s up to the person defining the version tuple
to make sure that the values make sense. If the PEP 386 compliant version parser doesn’t get
released soon we’ll look at making this function check that the version tuple makes sense
before transforming it into a string.
It’s recommended that you use this function to keep a __version_info__ tuple and __version__
string in your modules. Why do we need both a tuple and a string? The string is often useful for
putting into human readable locations like release announcements, version strings in tarballs,
etc. Meanwhile the tuple is very easy for a computer to compare. For example, kitchen sets up its
version information like this:
from kitchen.versioning import version_tuple_to_string
__version_info__ = ((0, 2, 1),)
__version__ = version_tuple_to_string(__version_info__)
Other programs that depend on a kitchen version between 0.2.1 and 0.3.0 can find whether the
present version is okay with code like this:
from kitchen import __version_info__, __version__
if __version_info__ < ((0, 2, 1),) or __version_info__ >= ((0, 3, 0),):
print 'kitchen is present but not at the right version.'
print 'We need at least version 0.2.1 and less than 0.3.0'
print 'Currently found: kitchen-%s' % __version__
Exceptions
Kitchen has a hierarchy of exceptions that should make it easy to catch many errors emitted by kitchen
itself.
Base kitchen exceptions
Exception classes for kitchen and the root of the exception hierarchy for all kitchen modules.
exception kitchen.exceptions.KitchenError
Base exception class for any error thrown directly by kitchen.
Kitchen.text exceptions
Exception classes thrown by kitchen’s text processing routines.
exception kitchen.text.exceptions.XmlEncodeError
Exception thrown by error conditions when encoding an xml string.
exception kitchen.text.exceptions.ControlCharError
Exception thrown when an ascii control character is encountered.
1.0.0 Porting Guide
The 0.1 through 1.0.0 releases focused on bringing in functions from yum and python-fedora. This porting
guide tells how to port from those APIs to their kitchen replacements.
python-fedora
┌───────────────────────────────┬──────────────────────────────────────┐
│ python-fedora │ kitchen replacement │
├───────────────────────────────┼──────────────────────────────────────┤
│ fedora.iterutils.isiterable() │ kitchen.iterutils.isiterable() [1] │
├───────────────────────────────┼──────────────────────────────────────┤
│ fedora.textutils.to_unicode() │ kitchen.text.converters.to_unicode() │
├───────────────────────────────┼──────────────────────────────────────┤
│ fedora.textutils.to_bytes() │ kitchen.text.converters.to_bytes() │
└───────────────────────────────┴──────────────────────────────────────┘
[1] isiterable() has changed slightly in kitchen. The include_string attribute has switched its default
value from True to False. So you need to change code like:
>>> # Old code
>>> isiterable('abcdef')
True
>>> # New code
>>> isiterable('abcdef', include_string=True)
True
yum
┌─────────────────────────────┬────────────────────────────────────────────────┐
│ yum │ kitchen replacement │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.dummy_wrapper() │ kitchen.i18n.DummyTranslations.ugettext() │
│ │ [2] │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.dummyP_wrapper() │ kitchen.i18n.DummyTanslations.ungettext() │
│ │ [2] │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.utf8_width() │ kitchen.text.display.textual_width() │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.utf8_width_chop() │ kitchen.text.display.textual_width_chop() │
│ │ and kitchen.text.display.textual_width() │
│ │ [3] [5] │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.utf8_valid() │ kitchen.text.misc.byte_string_valid_encoding() │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.utf8_text_wrap() │ kitchen.text.display.wrap() [4] │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.utf8_text_fill() │ kitchen.text.display.fill() [4] │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.to_unicode() │ kitchen.text.converters.to_unicode() [6] │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.to_unicode_maybe() │ kitchen.text.converters.to_unicode() [6] │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.to_utf8() │ kitchen.text.converters.to_bytes() [6] │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.to_str() │ kitchen.text.converters.to_unicode() or │
│ │ kitchen.text.converters.to_bytes() [7] │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.str_eq() │ kitchen.text.misc.str_eq() │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.misc.to_xml() │ kitchen.text.converters.unicode_to_xml() or │
│ │ kitchen.text.converters.byte_string_to_xml() │
│ │ [8] │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n._() │ See: Initializing Yum i18n │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.P_() │ See: Initializing Yum i18n │
├─────────────────────────────┼────────────────────────────────────────────────┤
│ yum.i18n.exception2msg() │ kitchen.text.converters.exception_to_unicode() │
│ │ or kitchen.text.converter.exception_to_bytes() │
│ │ [9] │
└─────────────────────────────┴────────────────────────────────────────────────┘
[2] These yum methods provided fallback support for gettext functions in case either gaftonmode was set
or gettext failed to return an object. In kitchen, we can use the kitchen.i18n.DummyTranslations
object to fulfill that role. Please see Initializing Yum i18n for more suggestions on how to do
this.
[3] The yum version of these functions returned a byte str. The kitchen version listed here returns a
unicode string. If you need a byte str simply call kitchen.text.converters.to_bytes() on the
result.
[4] The yum version of these functions would return either a byte str or a unicode string depending on
what the input value was. The kitchen version always returns unicode strings.
[5] yum.i18n.utf8_width_chop() performed two functions. It returned the piece of the message that fit
in a specified width and the width of that message. In kitchen, you need to call two functions, one
for each action:
>>> # Old way
>>> utf8_width_chop(msg, 5)
(5, 'く ku')
>>> # New way
>>> from kitchen.text.display import textual_width, textual_width_chop
>>> (textual_width(msg), textual_width_chop(msg, 5))
(5, u'く ku')
[6] If the yum version of to_unicode() or to_utf8() is given an object that is not a string, it returns
the object itself. kitchen.text.converters.to_unicode() and kitchen.text.converters.to_bytes()
default to returning the simplerepr of the object instead. If you want the yum behaviour, set the
nonstring parameter to passthru:
>>> from kitchen.text.converters import to_unicode
>>> to_unicode(5)
u'5'
>>> to_unicode(5, nonstring='passthru')
5
[7] yum.i18n.to_str() could return either a byte str. or a unicode string In kitchen you can get the
same effect but you get to choose whether you want a byte str or a unicode string. Use to_bytes()
for str and to_unicode() for unicode.
[8] yum.misc.to_xml() was buggy as written. I think the intention was for you to be able to pass a byte
str or unicode string in and get out a byte str that was valid to use in an xml file. The two
kitchen functions byte_string_to_xml() and unicode_to_xml() do that for each string type.
[9] When porting yum.i18n.exception2msg() to use kitchen, you should setup two wrapper functions to aid
in your port. They’ll look like this:
from kitchen.text.converters import EXCEPTION_CONVERTERS, \
BYTE_EXCEPTION_CONVERTERS, exception_to_unicode, \
exception_to_bytes
def exception2umsg(e):
'''Return a unicode representation of an exception'''
c = [lambda e: e.value]
c.extend(EXCEPTION_CONVERTERS)
return exception_to_unicode(e, converters=c)
def exception2bmsg(e):
'''Return a utf8 encoded str representation of an exception'''
c = [lambda e: e.value]
c.extend(BYTE_EXCEPTION_CONVERTERS)
return exception_to_bytes(e, converters=c)
The reason to define this wrapper is that many of the exceptions in yum put the message in the value
attribute of the Exception instead of adding it to the args attribute. So the default
EXCEPTION_CONVERTERS don’t know where to find the message. The wrapper tells kitchen to check the value
attribute for the message. The reason to define two wrappers may be less obvious.
yum.i18n.exception2msg() can return a unicode string or a byte str depending on a combination of what
attributes are present on the Exception and what locale the function is being run in. By contrast,
kitchen.text.converters.exception_to_unicode() only returns unicode strings and
kitchen.text.converters.exception_to_bytes() only returns byte str. This is much safer as it keeps code
that can only handle unicode or only handle byte str correctly from getting the wrong type when an input
changes but it means you need to examine the calling code when porting from yum.i18n.exception2msg() and
use the appropriate wrapper.
Initializing Yum i18n
Previously, yum had several pieces of code to initialize i18n. From the toplevel of yum/i18n.py:
try:.
'''
Setup the yum translation domain and make _() and P_() translation wrappers
available.
using ugettext to make sure translated strings are in Unicode.
'''
import gettext
t = gettext.translation('yum', fallback=True)
_ = t.ugettext
P_ = t.ungettext
except:
'''
Something went wrong so we make a dummy _() wrapper there is just
returning the same text
'''
_ = dummy_wrapper
P_ = dummyP_wrapper
With kitchen, this can be changed to this:
from kitchen.i18n import easy_gettext_setup, DummyTranslations
try:
_, P_ = easy_gettext_setup('yum')
except:
translations = DummyTranslations()
_ = translations.ugettext
P_ = translations.ungettext
NOTE:
In overcoming-frustration, it is mentioned that for some things (like exception messages), using the
byte str oriented functions is more appropriate. If this is desired, the setup portion is only a
second call to kitchen.i18n.easy_gettext_setup():
b_, bP_ = easy_gettext_setup('yum', use_unicode=False)
The second place where i18n is setup is in yum.YumBase._getConfig() in yum/__init_.py if gaftonmode is in
effect:
if startupconf.gaftonmode:
global _
_ = yum.i18n.dummy_wrapper
This can be changed to:
if startupconf.gaftonmode:
global _
_ = DummyTranslations().ugettext()
Conventions for contributing to kitchen
Style
• Strive to be PEP 8 compliant
• Run :command:`pylint ` over the code and try to resolve most of its nitpicking
Python 2.4 compatibility
At the moment, we’re supporting python-2.4 and above. Understand that there’s a lot of python features
that we cannot use because of this.
Sometimes modules in the python standard library can be added to kitchen so that they’re available. When
we do that we need to be careful of several things:
1. Keep the module in sync with the version in the python-2.x trunk. Use
maintainers/sync-copied-files.py for this.
2. Sync the unittests as well as the module.
3. Be aware that not all modules are written to remain compatible with Python-2.4 and might use python
language features that were not present then (generator expressions, relative imports, decorators,
with, try: with both except: and finally:, etc) These are not good candidates for importing into
kitchen as they require more work to keep synced.
Unittests
• At least smoketest your code (make sure a function will return expected values for one set of inputs).
• Note that even 100% coverage is not a guarantee of working code! Good tests will realize that you need
to also give multiple inputs that test the code paths of called functions that are outside of your
code. Example:
def to_unicode(msg, encoding='utf8', errors='replace'):
return unicode(msg, encoding, errors)
# Smoketest only. This will give 100% coverage for your code (it
# tests all of the code inside of to_unicode) but it leaves a lot of
# room for errors as it doesn't test all combinations of arguments
# that are then passed to the unicode() function.
tools.ok_(to_unicode('abc') == u'abc')
# Better -- tests now cover non-ascii characters and that error conditions
# occur properly. There's a lot of other permutations that can be
# added along these same lines.
tools.ok_(to_unicode(u'café', 'utf8', 'replace'))
tools.assert_raises(UnicodeError, to_unicode, [u'cafè ñunru'.encode('latin1')])
• We’re using nose for unittesting. Rather than depend on unittest2 functionality, use the functions
that nose provides.
• Remember to maintain python-2.4 compatibility even in unittests.
Docstrings and documentation
We use sphinx to build our documentation. We use the sphinx autodoc extension to pull docstrings out of
the modules for API documentation. This means that docstrings for subpackages and modules should follow
a certain pattern. The general structure is:
• Introductory material about a module in the module’s top level docstring.
• Introductory material should begin with a level two title: an overbar and underbar of ‘-‘.
• docstrings for every function.
• The first line is a short summary of what the function does
• This is followed by a blank line
• The next lines are a field list <http://sphinx.pocoo.org/markup/desc.html#info-field-lists>_ giving
information about the function’s signature. We use the keywords: arg, kwarg, raises, returns, and
sometimes rtype. Use these to describe all arguments, key word arguments, exceptions raised, and
return values using these.
• Parameters that are kwarg should specify what their default behaviour is.
Kitchen versioning
Currently the kitchen library is in early stages of development. While we’re in this state, the main
kitchen library uses the following pattern for version information:
•
Versions look like this::
__version_info__ = ((0, 1, 2),) __version__ = ‘0.1.2’
• The Major version number remains at 0 until we decide to make the first 1.0 release of kitchen. At
that point, we’re declaring that we have some confidence that we won’t need to break backwards
compatibility for a while.
• The Minor version increments for any backwards incompatible API changes. When this is updated, we
reset micro to zero.
• The Micro version increments for any other changes (backwards compatible API changes, pure bugfixes,
etc).
NOTE:
Versioning is only updated for releases that generate sdists and new uploads to the download
directory. Usually we update the version information for the library just before release. By
contrast, we update kitchen Versioning when an API change is made. When in doubt, look at the version
information in the last release.
I18N
All strings that are used as feedback for users need to be translated. kitchen sets up several functions
for this. _() is used for marking things that are shown to users via print, GUIs, or other “standard”
methods. Strings for exceptions are marked with b_(). This function returns a byte str which is needed
for use with exceptions:
from kitchen import _, b_
def print_message(msg, username):
print _('%(user)s, your message of the day is: %(message)s') % {
'message': msg, 'user': username}
raise Exception b_('Test message')
This serves several purposes:
• It marks the strings to be extracted by an xgettext-like program.
• _() is a function that will substitute available translations at runtime.
NOTE:
By using the %()s with dict style of string formatting, we make this string friendly to translators
that may need to reorder the variables when they’re translating the string.
paver <http://www.blueskyonmars.com/projects/paver/>_ and babel <http://babel.edgewall.org/>_ are used to
extract the strings.
API updates
Kitchen strives to have a long deprecation cycle so that people have time to switch away from any APIs
that we decide to discard. Discarded APIs should raise a DeprecationWarning and clearly state in the
warning message and the docstring how to convert old code to use the new interface. An example of
deprecating a function:
import warnings
from kitchen import _
from kitchen.text.converters import to_bytes, to_unicode
from kitchen.text.new_module import new_function
def old_function(param):
'''**Deprecated**
This function is deprecated. Use
:func:`kitchen.text.new_module.new_function` instead. If you want
unicode strngs as output, switch to::
>>> from kitchen.text.new_module import new_function
>>> output = new_function(param)
If you want byte strings, use::
>>> from kitchen.text.new_module import new_function
>>> from kitchen.text.converters import to_bytes
>>> output = to_bytes(new_function(param))
'''
warnings.warn(_('kitchen.text.old_function is deprecated. Use'
' kitchen.text.new_module.new_function instead'),
DeprecationWarning, stacklevel=2)
as_unicode = isinstance(param, unicode)
message = new_function(to_unicode(param))
if not as_unicode:
message = to_bytes(message)
return message
If a particular API change is very intrusive, it may be better to create a new version of the subpackage
and ship both the old version and the new version.
NEWS file
Update the NEWS file when you make a change that will be visible to the users. This is not a ChangeLog
file so we don’t need to list absolutely everything but it should give the user an idea of how this
version differs from prior versions. API changes should be listed here explicitly. bugfixes can be more
general:
-----
0.2.0
-----
* Relicense to LGPLv2+
* Add kitchen.text.format module with the following functions:
textual_width, textual_width_chop.
* Rename the kitchen.text.utils module to kitchen.text.misc. use of the
old names is deprecated but still available.
* bugfixes applied to kitchen.pycompat24.defaultdict that fixes some
tracebacks
Kitchen subpackages
Kitchen itself is a namespace. The kitchen sdist (tarball) provides certain useful subpackages.
SEE ALSO:
Kitchen addon packages
For information about subpackages not distributed in the kitchen sdist that install into the
kitchen namespace.
Versioning
Each subpackage should have its own version information which is independent of the other kitchen
subpackages and the main kitchen library version. This is used so that code that depends on kitchen APIs
can check the version information. The standard way to do this is to put something like this in the
subpackage’s __init__.py:
from kitchen.versioning import version_tuple_to_string
__version_info__ = ((1, 0, 0),)
__version__ = version_tuple_to_string(__version_info__)
__version_info__ is documented in kitchen.versioning. The values of the first tuple should describe API
changes to the module. There are at least three numbers present in the tuple: (Major, minor, micro).
The major version number is for backwards incompatible changes (For instance, removing a function, or
adding a new mandatory argument to a function). Whenever one of these occurs, you should increment the
major number and reset minor and micro to zero. The second number is the minor version. Anytime new but
backwards compatible changes are introduced this number should be incremented and the micro version
number reset to zero. The micro version should be incremented when a change is made that does not change
the API at all. This is a common case for bugfixes, for instance.
Version information beyond the first three parts of the first tuple may be useful for versioning but
semantically have similar meaning to the micro version.
NOTE:
We update the __version_info__ tuple when the API is updated. This way there’s less chance of
forgetting to update the API version when a new release is made. However, we try to only increment
the version numbers a single step for any release. So if kitchen-0.1.0 has kitchen.text.__version__
== ‘1.0.1’, kitchen-0.1.1 should have kitchen.text.__version__ == ‘1.0.2’ or ‘1.1.0’ or ‘2.0.0’.
Criteria for subpackages in kitchen
Subpackages within kitchen should meet these criteria:
• Generally useful or needed for other pieces of kitchen.
• No mandatory requirements outside of the python standard library.
• Optional requirements from outside the python standard library are allowed. Things with mandatory
requirements are better placed in kitchen addon packages
• Somewhat API stable – this is not a hard requirement. We can change the kitchen api. However, it is
better not to as people may come to depend on it.
SEE ALSO:
API Updates
Kitchen addon packages
Addon packages are very similar to subpackages integrated into the kitchen sdist. This section just
lists some of the differences to watch out for.
setup.py
Your setup.py should contain entries like this:
# It's suggested to use a dotted name like this so the package is easily
# findable on pypi:
setup(name='kitchen.config',
# Include kitchen in the keywords, again, for searching on pypi
keywords=['kitchen', 'configuration'],
# This package lives in the directory kitchen/config
packages=['kitchen.config'],
# [...]
)
Package directory layout
Create a kitchen directory in the toplevel. Place the addon subpackage in there. For example:
./ <== toplevel with README, setup.py, NEWS, etc
kitchen/
kitchen/__init__.py
kitchen/config/ <== subpackage directory
kitchen/config/__init__.py
Fake kitchen module
The :file::__init__.py in the kitchen directory is special. It won’t be installed. It just needs to
pull in the kitchen from the system so that you are able to test your module. You should be able to use
this boilerplate:
# Fake module. This is not installed, It's just made to import the real
# kitchen modules for testing this module
import pkgutil
# Extend the __path__ with everything in the real kitchen module
__path__ = pkgutil.extend_path(__path__, __name__)
NOTE:
kitchen needs to be findable by python for this to work. Installed in the site-packages directory or
adding it to the PYTHONPATH will work.
Your unittests should now be able to find both your submodule and the main kitchen module.
Versioning
It is recommended that addon packages version similarly to Versioning. The __version_info__ and
__version__ strings can be changed independently of the version exposed by setup.py so that you have
both an API version (__version_info__) and release version that’s easier for people to parse. However,
you aren’t required to do this and you could follow a different methodology if you want (for instance,
Kitchen versioning)
Glossary
“Everything but the kitchen sink”
An English idiom meaning to include nearly everything that you can think of.
API version
Version that is meant for computer consumption. This version is parsable and comparable by
computers. It contains information about a library’s API so that computer software can decide
whether it works with the software.
ASCII A character encoding that maps numbers to characters essential to American English. It maps 128
characters using 7bits.
SEE ALSO:
http://en.wikipedia.org/wiki/ASCII
ASCII compatible
An encoding in which the particular byte that maps to a character in the ASCII character set is
only used to map to that character. This excludes EBDIC based encodings and many multi-byte fixed
and variable width encodings since they reuse the bytes that make up the ASCII encoding for other
purposes. UTF-8 is notable as a variable width encoding that is ASCII compatible.
SEE ALSO:
http://en.wikipedia.org/wiki/Variable-width_encoding
For another explanation of various ways bytes are mapped to characters in a possibly
incompatible manner.
code points
code point
code point
A number that maps to a particular abstract character. Code points make it so that we have a
number pointing to a character without worrying about implementation details of how those numbers
are stored for the computer to read. Encodings define how the code points map to particular
sequences of bytes on disk and in memory.
control characters
control character
control character
The set of characters in unicode that are used, not to display glyphs on the screen, but to tell
the display in program to do something.
SEE ALSO:
http://en.wikipedia.org/wiki/Control_character
grapheme
characters or pieces of characters that you might write on a page to make words, sentences, or
other pieces of text.
SEE ALSO:
http://en.wikipedia.org/wiki/Grapheme
I18N I18N is an abbreviation for internationalization. It’s often used to signify the need to
translate words, number and date formats, and other pieces of data in a computer program so that
it will work well for people who speak another language than yourself.
message catalogs
message catalog
message catalog
Message catalogs contain translations for user-visible strings that are present in your code.
Normally, you need to mark the strings to be translated by wrapping them in one of several gettext
functions. The function serves two purposes:
1. It allows automated tools to find which strings are supposed to be extracted for translation.
2. The functions perform the translation when the program is running.
SEE ALSO:
babel’s documentation
for one method of extracting message catalogs from source code.
Murphy’s Law
“Anything that can go wrong, will go wrong.”
SEE ALSO:
http://en.wikipedia.org/wiki/Murphy%27s_Law
release version
Version that is meant for human consumption. This version is easy for a human to look at to
decide how a particular version relates to other versions of the software.
textual width
The amount of horizontal space a character takes up on a monospaced screen. The units are number
of character cells or columns that it takes the place of.
UTF-8 A character encoding that maps all unicode code points to a sequence of bytes. It is compatible
with ASCII. It uses a variable number of bytes to encode all of unicode. ASCII characters take
one byte. Characters from other parts of unicode take two to four bytes. It is widespread as an
encoding on the internet and in Linux.
INDICES AND TABLES
• genindex
• modindex
• search
PROJECT PAGES
More information about the project can be found on the project webpage
The latest published version of this documentation can be found on the documentation page
COPYRIGHT
2017 Red Hat, Inc. and others
0.2 Aug 28, 2017 KITCHEN(1)