I’ve been learning a new language lately and once again I have to contend with it’s way of representing strings.
We use strings all the time in programming languages. They are a fundamental type we expect all languages to have. Unfortunately we often need to adjust our mental model of them depending on the programming language we are using.
What is text?
Usually, when talking about text, we are referring to a collection of characters or glyphs; the smallest unit of a written language. These letters you are reading are characters, this Latin capital ligature Oe , Œ, is a character, and of course this poo emoji, 💩, can be considered a character.
Because computers are just a storing a bunch of bytes we need a way to represent these character so we can store them. We could start by saying A is 0x01, B is 0x02, etc. But then a Thai speaker might start representing their language like ก is 0x01, ข is 0x02, etc. This would make it difficult to share documents as all the Thai characters would come out as a seemingly random series of English characters on the other computer. And that is sort of what happened for a while. Then came Unicode. Unicode is pretty much a modern miracle in the world of computing. It’s a giant assignment of the characters making up most languages, emojis, and more, to numbers called code points. In Unicode A is code point 0x41, B is 0x42, 💩 is 0x1F4A9, and so on.
Finally, those numbers need to be converted to a series of bytes. Which is the basic unit of modern computers. Since a byte can only represent up to 256 values it isn’t possible to fit some of the characters, like the Javanese left rerenggan ꧁ (code point 0xA9C1), into a byte. The solution to this is an encoding.
An encoding is simply a way to take those numbers (code points) and convert them into a series of bytes. There are multiple encodings out there and you’ve probably heard of some of them. ASCII, latin-1, UTF-8, UTF-16, and many more.
This is what People with bunny ears, 👯 (code point 0x1F46F), looks like encoded in UTF-8 as a series of bytes.[0xF0, 0x9F, 0x91, 0xAF]
. Here it is in UTF-16 [0xFF, 0xFE, 0x3D, 0xD8, 0x6F, 0xDC]
. I don’t show the ASCII or latin-1 encoding because those encoding are unable to encode a value larger than one byte. They can only encode a subset of the characters that Unicode outlines (they actually came before Unicode, but it’s backwards compatible with them. Just another part of the brilliance!)
So in summary you can think of text as having three stages.
- What you see (sometimes referred to as graphemes)
- Unicode code points. These are an array of integers.
- Encoded code points. These are an array of bytes.
Sometimes what you see as a single character actually consists of multiple code points. For example this e with a combining diaeresis, ë, actually consists of two code points. The ̈ (code point 0x0308) is a non spacing mark, meaning it shares physical space with the character that came before it.
This can become problematic as a simple question like, “how long is this string?”, can vary depending on how a programming language chooses to represent and operate on a string.
Back to strings in programming languages
Lets take the "hëllo"
example above and see how different languages handle it followed by a brief explanation after each.
Python 3.8
>>> s = "he\u0308llo"
>>> s
'hëllo'
>>> len(s)
6
>>> s[::-1]
'oll̈eh'
Python 3+ represents strings a sequence of Unicode code points. That means len
returns the number of code points. Because the non spacing diaeresis is it’s own code point, reversing it causes it to hang over the l. Python also has byte strings for working with the byte encoded representation.
Node.js 12.16
> s = "he\u0308llo"
'hëllo'
> s.length
6
> s.split("").reverse().join("")
'oll̈eh'
Strings in Javascript are UTF-16 encoded meaning they are a sequence of bytes. However properties and methods like length will essentially work on pairs of bytes since UTF-16 encodes data as two byte pairs. (You can see this by doing "💩".length === 2
vs Pythons len("💩") == 1
.
Elixir 1.11
iex(1)> s = "he\u0308llo"
"hëllo"
iex(2)> String.length s
5
iex(3)> String.reverse s
"ollëh"
Strings in Elixir are UTF-8 encoded bytes, but many of the String functions, like length operate on graphemes as opposed to bytes or individual code points. This results in exactly what we might expect! Elixir also has the charlist data type which uses single quotes and gives you a list of Unicode codepoints.
Rustc 1.48
fn main() {
let s = "he\u{0308}llo";
let reversed: String = s.chars().rev().collect();
println!("{}", s);
println!("{}", s.len());
println!("{}", reversed);
}
> "hëllo"
> 7
> "oll̈eh"
Strings (str
and String
) in Rust are UTF-8 encoded bytes. len returns the number of bytes not code points or graphemes. However calling chars
returns an iterator of the string as type char which is a Unicode scalar value (kind of like a code point) which is then reversed resulting in the same as Python and Javascript.
The painful reality
Unfortunately the reality is that strings are not as simple as we would probably like them to be. Programming languages try to hide the complexity and give us sane defaults, but that often leaves programmers to contend with surprising behavior that can be source of some nuanced bugs. Ultimately I would say that strings are a bit of a leaky abstraction and we need to be aware of how each language represents them.