Fredrb's Blog

Character encoding and UTF-8

Here are some quick facts I learned about character encoding, ASCII and UTF-8:

Why do we need encoding?

A text file is just bytes to a computer. What differentiates a file with text content to any other binary file is how the reader interprets it. In order to decode a file, the program must know its encoding. If it doesn’t, it’s a meaningless sequence of octets. However, if the reader knows how the files is ASCII encoded, each octet in the file represents a single English letter or a symbol according to the ASCII table.

In sum, a character encoding is a set of mappings between the bytes in the computer and the characters in the character set.

ASCII is one of the most popular character sets from the UNIX era. Unfortunately, ASCII has some shortcomings. It only uses 7bits to represent a character. Meaning that only bytes from 0x00 until 0x7F are in the mapping. A total of 128 characters — which is not enough to map non-English characters, making internalization supports very difficult.

Luckily, there is an alternative. The upper 128 characters from the ASCII table can be used for additional characters. This means that new character sets came along, used the same bottom 128 characters from ASCII and appended their own on the top half. The problem now is that many new mappings exist and interoperability between them becomes difficult. A receiver using a different code page than the sender would read the incorrect message. If someone from Brazil writes a message using the letter é to multiple people, they would read as ة in Arabic, и in Cyrillic and as a corner pipe () if they’re using IBM’s code page 850.

Unicode uses unique code points to map to a character: a U+ prefixed hex (for example, U+0041 for the letter “A” and U+2705 for ✅). In this experiment, we’ll focus on UTF-8, which seems to be a more well known and used character map that uses Unicode values.

Experiments

I can read all day about the character encoding and how computers work under the hood. But it’s not until I start creating files and trying to display them in a different encoding format that I really understand what’s going on.

How a file gets its encoding

When creating a new file using touch, your computer will interpret that file as binary file. It has 0 bytes, and content.

% touch myfile
% file -I myfile
myfile: inode/x-empty; charset=binary

As soon as some text gets added to it, your operating system can now determine what type of file that might be. If we simply add some English letters and no fancy symbols, we should get a character encoding assigned to our file.

% cat myfile
Hey, this is Fred!
% file -I myfile
myfile: text/plain; charset=us-ascii

I’m not in the US, but somehow that seems to be the default encoding. I assume it might have to do with my keyboard layout and the default settings of my machine. The file doesn’t hold any information about the file encoding, there’s a lot of guessing involved to figure out (thanks for jcranmer for pointing this out). Let’s try adding some different symbols and see if the encoding changes.

% cat myfile
Hey, this is Fred!
Here are some different characters: é ã ô ê ø π ∆ ï ∑ à ∫
% file -I myfile
myfile: text/plain; charset=utf-8

It did! We don’t need to specify which encoding we want the file to have. Based on the bytes of the file, it understands that it can’t be a pure ASCII file since it likely has characters outside the 0x7F range.

Let’s dig deeper and check the binary content of these files.

Exploring UTF-8 binary representation

The premise for this part of the exploration is that UTF-8 files use the same bytes to map ASCII characters. Let’s start by first writing a new file with only English letters:

% echo -n "ABC" > ascii_file
% file -I ascii_file
ascii_file: text/plain; charset=us-ascii
% hexdump -C ascii_file
00000000  41 42 43                 |ABC|

We use echo -n to omit the newline at the end of the file.

The 3 bytes in this file are: 0x41 0x42 0x43.

Now, let’s create a new file with an additional non-English letter:

% echo -n "ABCé" > utf8_file
% file -I utf8_file
utf8_file: text/plain; charset=utf-8
% hexdump -C utf8_file
00000000  41 42 43 c3 a9          |ABC..|

That looks great. We still only use 3 bytes for the ASCII letters (0x41 0x42 0x43). But now there are 2 additional bytes at the end that represent the Latin character é (0xc3 0xa9).

Why is the 0xc3 prefixed used in this case and how does that map to unicode character values (U+00E9)?

Each character mapping in UTF-8 relies on a prefix. ASCII characters are encoded with a single byte, therefore they don’t have any. The table below shows the first byte range, and the number of bytes for each character.

First byte range Number of bytes Example
0x00 - 0x7F 1 A (0x41)
0xC2 - 0xDF 2 é (0xC3 0xA9)
0xE0 - 0xF0 3 ✅ (0xE2 0x9C 0x85)
0xF0 -0xFF 4 𐀏 (0xF0 0x90 0x80 0x8F)

Full UTF-8 table

Let’s see what happens if all this bytes are added to a file:

% echo -n "\x41\xc3\xa9\xe2\x9c\x85\xf0\x90\x80\x8f" > utf8_length
% file -I utf8_length
utf8_length: text/plain; charset=utf-8
% cat utf8_length
Aé✅𐀏%
% hexdump utf8_length
0000000 41 c3 a9 e2 9c 85 f0 90 80 8f

This is how an interpreter would look at this file:

[41] [c3 a9] [e2 9c 85] [f0 90 80 8f]
A    é       ✅         𐀏

What happens if the first byte is in the range 0x80 - 0xC1?

It turns out that they left this range out from the UTF-8 spec. There is this StackOverflow answer that points to the RFC, stating:

The octet values C0, C1, F5 to FF never appear.

Appending bytes on this range to the file would yield an invalid UTF-8 file.

Exploring ASCII-like encodings

After some of the comments received in this blog, I’ve realised used the term “Code pages” in a misleading way to refer to encodings that use the bottom 128 ASCII characters and some custom characters on the other half. This means that the following section is riddled with the term “Code pages”, but keep in mind that these “ASCII-like encodings”, for the lack of a better term.

There are multiple encodings that map the upper 128 characters from the ASCII byte to an extended character set. The iconv can be used to convert from a target code page to UTF-8.

But before getting into the actual conversion of files, lets a create a test file that contains all characters in the extended range (0x80:0xFF). Here’s the Python script I’ve used to create this:

with open("outfile.txt", 'ba') as out:
    for b in range(0x80, 0xFF):
        out.write(bytes(b))

This generates a file, as expected, that contains all the occurrences between 0x80 and 0xFF:

% hexdump outfile.txt
0000000 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f
0000010 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f
0000020 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae af
0000030 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf
0000040 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf
0000050 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df
0000060 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef
0000070 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff

We can use this test file to see how each code page maps characters to these values. I’ve written a wrapper to iconv that takes a list of targets, if combined with the list of available code pages (iconv -l) we can iterate over all possible code pages against a single file:

% ./check_encodings.sh ./outfile.txt $(iconv -l | awk '{print $1}' | xargs)

Encoding: UTF-16
肁芃蒅蚇袉誋貍躏邑銓钕隗颙骛鲝麟ꂡꊣ꒥ꚧꢩꪫ겭꺯낱늳뒵뚷뢹못벽뺿상싃쓅웇죉쫋쳍컏탑틓퓕훗

Encoding: CP819
¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Encoding: ISO-8859-2
Ą˘Ł¤ĽŚ§¨ŠŞŤŹŽŻ°ą˛ł´ľśˇ¸šşťź˝žżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ˙

Encoding: 850
ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø׃áíóúñѪº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐└┴┬├─┼ãÃ╚╔╩╦╠═╬¤ðÐÊËÈıÍÎÏ┘┌█▄¦Ì▀ÓßÔÒõÕµþÞÚÛÙýݯ´±‗¾¶§÷¸°¨·¹³²■

Encoding: MAC
ÄÅÇÉÑÖÜáàâäãåçéèêëíìîïñóòôöõúùûü†°¢£§•¶ß®©™´¨≠ÆØ∞±≤≥¥µ∂∑∏π∫ªºΩæø¿¡¬√ƒ≈∆«»… ÀÃÕŒœ–—“”‘’÷◊ÿŸ⁄€‹›fifl‡·‚„‰ÂÊÁËÈÍÎÏÌÓÔÒÚÛÙıˆ˜¯˘˙˚¸˝˛ˇ

Encoding: MACICELAND
ÄÅÇÉÑÖÜáàâäãåçéèêëíìîïñóòôöõúùûüÝ°¢£§•¶ß®©™´¨≠ÆØ∞±≤≥¥µ∂∑∏π∫ªºΩæø¿¡¬√ƒ≈∆«»… ÀÃÕŒœ–—“”‘’÷◊ÿŸ⁄¤ÐðÞþý·‚„‰ÂÊÁËÈÍÎÏÌÓÔ

Encoding: VISCII
ẠẮẰẶẤẦẨẬẼẸẾỀỂỄỆỐỒỔỖỘỢỚỜỞỊỎỌỈỦŨỤỲÕắằặấầẩậẽẹếềểễệốồổỗỠƠộờởịỰỨỪỬơớƯÀÁÂÃẢĂẳẵÈÉÊẺÌÍĨỳĐứÒÓÔạỷừửÙÚỹỵÝỡưàáâãảăữẫèéêẻìíĩỉđựòóôõỏọụùúũủýợỮ

Encoding: CP1124
ЁЂҐЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя№ёђґєѕіїјљњћќ§ўџ

Encoding: CP1129
¡¢£¤¥¦§œ©ª«¬®¯°±²³Ÿµ¶·Œ¹º»¼½¾¿ÀÁÂĂÄÅÆÇÈÉÊË̀ÍÎÏĐÑ̉ÓÔƠÖ×ØÙÚÛÜỮßàáâăäåæçèéêë́íîïđṇ̃óôơö÷øùúûüư₫ÿ

Encoding: CP858
ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø׃áíóúñѪº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐└┴┬├─┼ãÃ╚╔╩╦╠═╬¤ðÐÊËÈ€ÍÎÏ┘┌█▄¦Ì▀ÓßÔÒõÕµþÞÚÛÙýݯ´±‗¾¶§÷¸°¨·¹³²■

There are many more results, but I’ve removed most of them since it’s not that interesting to see all the different code pages, but just to prove the point of flipping through different pages using the same set of bytes! This is really the neat part of this exercise, it proves the very first bullet point on this post. You need to know which encoding is being used, otherwise the message could look like gibberish pipes instead of legitimate letters.


It was good to get a better understanding of how character encoding works and the differences between Unicode and ascii. Honestly, I’ve never really paid attention to this since it has always been a ubiquitous thing for me. I never had to care about the encoding of my files. I knew that Unicode and ascii weren’t the same thing and that sometimes I’ll see a weird � symbol when printing text in the terminal.

Takeaways from this fun exploration were: