A UTF-8 mini-refresher

Updated on Fri, 30 Dec 2016 01:08:29 GMT, tagged with ‘tech’.

Aficionados of information theory will recognize UTF-8 as a self-synchronizing prefix-free code—where “code” means a way to encode letters of an alphabet with bits. The alphabet consists of the tens of thousands of Unicode code points, or symbols, each of which is a “letter” in this alphabet.

Going off of this highly-voted StackOverflow answer and the Base-122 writeup, here’s how UTF-8 works. If the first half-byte of a byte (the first hexadecimal digit) is:

(In the above, whenever I’ve said, e.g., “3+6+6+6” bits, each of those numbers represent how many bits of each byte combine to yield the code point. And those bits are from the least significant end—the most significant bits of each byte is taken up by a prelude.)

So, for example:

character UTF-8 bytes Unicode code point
x 78 \U{78}
Å C3 85 \U{C5}
E6 9C 88 \U{6708}
😊 F0 9F 98 8A \U{1F60A}
Ā̂ C4 80 CC 82 \U{100}\U{302}
👍 F0 9F 91 8D \U{1F44D}
👍🏽️ F0 9F 91 8D F0 9F 8F BD EF B8 8F \U{1F44D}\U{1F3FD}\U{FE0F}

(Experiment with the code that generated this at the Rust Playground!)

The first four rows show examples of code points that, in UTF-8, take up 1–4 bytes. The rest show that Unicode is more complicated than can be expressed in this gist 😜.

(Banner credit: Jeremy Keith, Linear B, tablet at Oxford’s Ashmolean Museum, London, England. Flickr.)

Previous: Conditional probability and the criterion of dissimilarity
Next: How Rust slays brittle indexing logic.