Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

Unicode strings in Python: A basic tutorial

Unicode Wars Episode IV: A New 文字化け
Summary
I explain the basics of Unicode strings and how to handle them properly in Python 2 and 3. 猫猫猫

Here is my attempt to explain the basics of Unicode strings and how to handle them in Python. (For a much, much more comprehensive introduction, I highly recommend reading what Joel Spolsky and Ned Batchelder have written on this topic.)

tl;dr: In Python 2, if you see a str object, convert it to a unicode object right away by calling .decode('utf-8'). Process all strings as unicode objects, not str objects. If you need to write a unicode object out to a file or database, first call .encode('utf-8') on it. If you don't follow this convention, then you'll likely see weird errors when processing strings with non-English characters. In Python 3, you can avoid all of this nastiness.

To skip ahead to the sequel, read Unicode errors in Python 2.

ASCII strings

Let's start simple. You're probably familiar with ASCII strings.

Download this file: test.txt

(I'm doing all of this on a Mac; everything should look identical on Linux. No guarantees for Windows, though!)

What does this file contain?

$ cat test.txt
hi

It contains two characters: h and i

How big is this file?

$ ls -l test.txt
-rw-r--r--@ 1 pgbovine  staff     2B Nov 30 11:56 test.txt

The 2B above means that it's 2 bytes.

What does each byte contain? Use hexdump to find out:

$ hexdump test.txt
0000000 68 69
0000002

The relevant output here are the two numbers 68 and 69. These represent the hexadecimal values of the two bytes in this file: 0x68 and 0x69.

How do we interpret these two numbers 68 and 69? Look them up in an ASCII table under the "Hx" column for hexadecimal, and you'll see that 68 is the letter h and 69 is the letter i in ASCII.

Let's confirm this using Python. Start up a Python 2 interpreter (I used Python 2.7 here) and type the following statements:

>>> x = open('test.txt').read()
>>> x
'hi'
>>> type(x)
<type 'str'>
>>> len(x)
2
>>> x[0]
'h'
>>> x[1]
'i'

If you're a programmer, none of this should be surprising. You've probably been taught since Day One that every character in a string is a single byte, and that the ASCII table translates each byte value into a unique character. This file contains an ASCII string of two characters, so it's exactly two bytes in size. Our Python code confirms this: x[0] is the character h and x[1] is i.

Forget ASCII, learn Unicode

OK, now forget everything you've learned about ASCII strings. We'll now learn about Unicode strings, which can represent any possible character in any language, not just ASCII characters that appear directly on your keyboard. If you want to develop software that works internationally (especially on the Web), then you need to understand Unicode.

People write in many languages besides English, so every character in a string is not necessarily a single byte. After all, a byte can represent at most 256 (28) different characters, and languages such as Chinese have far more than that. For instance, 猫 is the Chinese character that means “cat.” That's clearly not in the ASCII table. How many bytes is it? We'll find out soon. But first, repeat after me: characters and bytes are unrelated.

Now let's follow the same steps as above, but with this file: chinese.txt

What does this file contain?

$ cat chinese.txt
hi猫

It contains three characters: h, i, and .

How big is this file?

$ ls -l chinese.txt
-rw-r--r--@ 1 pgbovine  staff     5B Nov 30 12:44 chinese.txt

The 5B above means that it's 5 bytes. So this file contains 3 characters, but 5 bytes.

What does each byte contain? Again, use hexdump to find out:

$ hexdump chinese.txt
0000000 68 69 e7 8c ab
0000005

The relevant output here are the five hexadecimal numbers 68, 69, e7, 8c, ab. Five numbers, five bytes. Good so far?

How do we interpret these five numbers as characters? Instead of using the ASCII table, we look them up in a far larger table called the Unicode UTF-8 table. (UTF-8 is currently the most popular “dialect” of Unicode.) In this table, 68 is the character h, 69 is the character i, and the three-byte sequence e7, 8c, ab is the character . To recap, h is one byte, i is one byte, but is three bytes.

(You might have noticed that 68 and 69 also represented h and i, respectively, in the ASCII table. That's by design! The Unicode UTF-8 table is a superset of the ASCII table, so an old-school ASCII string is also a valid Unicode UTF-8 string. Thus, Unicode UTF-8 is backwards-compatible with ASCII.)

In the old ASCII view of the world, every character in a string is a single byte. But in the new Unicode UTF-8 view of the world, every character in a string is one or more bytes.

Now you might be wondering why the five bytes were grouped together into three groups {68}, {69}, {e7, 8c, ab} to represent the three characters h, i, and . Why couldn't they instead be grouped like this to represent two characters:

{68, 69, e7}, {8c, ab}

Or like this to represent a different three characters:

{68}, {69, e7}, {8c, ab}

Or even this to represent four characters:

{68}, {69, e7}, {8c}, {ab}

Or a bunch of other ways? Well, the people who invented Unicode UTF-8 did some clever things to ensure that there is only one unique grouping for every sequence of bytes. So when you see the five-byte sequence 68, 69, e7, 8c, ab, there is only one correct way to interpret it: {68}, {69}, {e7, 8c, ab}, which represents the three characters h, i, and .

Handling Unicode strings in Python 2

Let's start up Python 2 and read our chinese.txt file:

>>> x = open('chinese.txt').read()
>>> x
'hi\xe7\x8c\xab'
>>> type(x)
<type 'str'>
>>> len(x)
5

x seems to be a string (type str) of length 5. But that doesn't make intuitive sense, since we know that this file contains only 3 characters: h, i, and . So I want len(x) to return 3, not 5!

What's going on here? It turns out that the Python 2 str type doesn't hold a string; it holds a sequence of bytes. In the old ASCII world, a string was simply a sequence of bytes, so this distinction didn't matter. But in our new Unicode world, the str type is not appropriate for working with strings. In Python 2, str is not a string! It's just a sequence of bytes.

Let's continue the above example and print out all 5 bytes in x:

>>> x[0]
'h'
>>> x[1]
'i'
>>> x[2]
'\xe7'
>>> x[3]
'\x8c'
>>> x[4]
'\xab'

Note that the first two bytes print as h and i, respectively, since those represent printable ASCII characters. However, the last three bytes print as the hexadecimal values (prefixed with \x in Python) e7, 8c, and ab. Recall that those three bytes represent the character in Unicode UTF-8. But since the str type in Python 2 is simply a sequence of bytes, the Python interpreter has no way of knowing that those three bytes should be grouped together into one group to represent the Chinese character . It just sees three separate bytes.

How do we create a proper Unicode UTF-8 string in Python 2? Call the (confusingly-named!) decode method on x:

>>> y = x.decode('utf-8')
>>> type(y)
<type 'unicode'>
>>> y
u'hi\u732b'
>>> len(y)
3
>>> y[0]
u'h'
>>> y[1]
u'i'
>>> y[2]
u'\u732b'

What just happened here? We called x.decode('utf-8') to tell Python to convert (“decode”, yes it's confusing!) the sequence of bytes in x into a Unicode string in UTF-8 format. Now y has type unicode and a length of 3, which is what we expect. y[0] and y[1] are still h and i, respectively (the u prefix means Unicode.) Now look at y[2], which is the third character. Its value is \u732b, which is Python's way of representing U+732B, the unique location (called a “code point”) of in the giant Unicode table.

If we call the print function on y, then (assuming the terminal supports Unicode) it will print out what we expect:

>>> print y
hi猫

A word of caution

Weird things happen in Python 2 if you think str is a string. To be safe, whenever you read a str into your program, convert it to Unicode right away using .decode('utf-8'). Then work only with unicode objects throughout your program. Don't ever work directly with str objects, or else you will be in a world of pain!

In Python 2, the unicode type represents a real string, whereas the str type is a sequence of bytes. Convert all string-like data to the unicode type before trying to process it.

Once you're done processing your Unicode strings, if you want to write them out to a file or database, first convert them back to a sequence of bytes (the str type) using the encode method:

>>> z = y.encode('utf-8')
>>> type(z)
<type 'str'>
>>> len(z)
5
>>> z
'hi\xe7\x8c\xab'
>>> z == x
True

Now it's safe to write z into a file or database since it's simply a sequence of bytes. If you had tried to directly write a Unicode string y into a file or database, weird errors may arise.

Handling Unicode strings in Python 3

Python 3 makes handling Unicode much simpler. The biggest change is that the str type actually holds Unicode strings, not a sequence of bytes.

Let's start up a Python 3 interpreter (I used Python 3.4 here) and read in our chinese.txt file:

>>> x = open('chinese.txt').read()
>>> x
'hi猫'
>>> type(x)
<class 'str'>
>>> len(x)
3
>>> x[0]
'h'
>>> x[1]
'i'
>>> x[2]
'猫'
>>> print(x)
hi猫

x is a string of 3 characters! Everything works like we expect out of the box without any type conversions. Life is much easier.


P.S. What if you wanted to interpret the contents of chinese.txt as bytes? Then open the file in binary mode using the 'rb' option:

>>> x = open('chinese.txt', 'rb').read()
>>> type(x)
<class 'bytes'>
>>> len(x)
5
>>> x
b'hi\xe7\x8c\xab'

This emulates what happens by default in Python 2. The b prefix denotes bytes. To convert bytes to a Unicode string, use decode:

>>> y = x.decode('utf-8')
>>> type(y)
<class 'str'>
>>> len(y)
3
>>> print(y)
hi猫

In sum, in Python 3, str represents a Unicode string, while the bytes type is a sequence of bytes. This naming scheme makes a lot more intuitive sense.


Now read the sequel: Unicode errors in Python 2.

Created: 2015-11-30
Last modified: 2015-11-30
Related pages tagged as programming: