Dmitry's Very Online Internet Site

Unicode, String Encodings, and Python

I’ve been working on the cryptopals challenges and avoiding stock libraries, to learn more about string encodings and the bytes type in Python 3. So here, we’ll learn way more than we need to know about Unicode.

Text as Unicode Strings

To work with text data, computers need to work with many different types of characters. From all the world’s languages to symbols to emojis, the list goes on. Unicode1 is a specification that aims to give every character used in human texts its own unique and standard code.

Strings in Python 3 are Unicode strings2. A Unicode string is sequence of Unicode code points. A code point has an integer value between 0 and 0x10FFFF (1,114,111 in decimal). A code point corresponds to a Unicode character in the Unicode specification.

Text data can be specified in Python with Unicode string literals:3

1assert "A unicode \u265E \N{BLACK CHESS KNIGHT}" == "A unicode ♞ ♞"
2assert "string \u00f0 \xf0" == "string ð ð"
3assert ord("♞") == 9822    # returns the decimal code point
4assert chr(9822) == "♞"    # returns the UTF-8 character at the decimal code point

\u265E is an escape sequence corresponding to the Unicode code point U+265E, which corresponds to the character ♞.

Encoded Data as bytes

When writing a string to file or to memory, computers use an encoding to represent the characters as sequences of bits. One way to encode a string into binary in Python is to use str.encode. This method returns a bytes type, which is meant for working with binary data. For example:

1# Str to Bytes
2assert "10100\u265E".encode("utf-8") == b"10100\xe2\x99\x9e"

We can think of the bytes string above as a representation of the bit string

1# Bytes to Binary String Representation
2assert "".join(bin(x) for x in b"10100\xe2\x99\x9e") == "0b1100010b1100000b1100010b1100000b1100000b111000100b100110010b10011110"

(I did not remove the “b” to make each character’s binary string easily distinguishable. Also note the variable length of each bit string.)

UTF-8, used above, is one of the encodings of Unicode.4 The built-in open() often uses the UTF-8 encoding by default, but generally this depends on your locale.

The built-in open() can also be used to read data in its pure binary form, by using mode='b'. This avoids passing the data through any decoding methods and allows for direct operation on the binary data.

Note that bytes can be specified with a string literal such as b"L10", where b is followed by a string of ASCII characters or escape sequences (see the string literals specification for more5). This is also how bytes is represented when printed (which happens implicitly via the repr() built-in).

 1# Hex String to Bytes
 2assert bytes.fromhex("4c") == b"L"
 3
 4# Bytes to Hex String
 5assert b"L".hex() == "4c"
 6
 7# Integer to Bytes
 8assert (90).to_bytes(1, byteorder="big") == b"Z"
 9
10# Bytes to List of Integers (ASCII Int Codes)
11assert list(b"L13") == [76, 49, 51]
12
13# Bytes to Str
14assert b"abcd".decode("utf-8") == b"abcd".decode("ascii") == str(b"abcd", "ascii") == "abcd"

The bytes type should not be confused with a string representation of an integer.6

References

Footnotes


  1. You can see the 1062 common English Unicode characters here. The full Unicode specification can be found here. See the wiki for some high level info. ↩︎

  2. Python changed its handling of Unicode in a big way when moving from 2 to 3. ↩︎

  3. A Python literal is something that the parser interprets as syntax for writing an object directly.

     1# Python Literals
     2"abcd"                              # text string
     3b"\x00104"                          # byte string
     442, 4_2, 0x2A, 0b101010             # integer
     51.2e-14                             # float
     61 + 2.0j                            # complex
     7True                                # bool
     8None                                # None
     9(1, 2)                              # tuple
    10[1, 2]                              # list
    11{1, 2}                              # set
    12{1: 1, 2: 2}                        # dict
    
     ↩︎
  4. Here is a worked example of how to encode the 3-byte euro sign €. While the string encodings (e.g. when using str.encode) can be any from this list of standard encodings, UTF-8 is the one you’re most likely to encounter↩︎

  5. Python’s literal spec for more details. These specifications may also be useful: f-string Syntax and Format String Syntax↩︎

  6. String representations of integers are just Python strings that represent integers using standard numerical systems. Here are few examples:

    1# String Representations of Integers
    2assert bin(20) == "0b10100" == format(20, "#b") == "0b" + format(20, "b")
    3assert hex(76) == "0x4c"
    4assert int("0b10100", 2) == int("10100", 2) == 20
    5assert int("0x4c", 16) == int("4c", 16) == 76
    
     ↩︎

#Python