Unicode is basically just a standardized list of various symbols, which assigns a specific integer value to each of them. These are called code points (not to be confused with code units, which come later).
They are grouped into 17 planes, with 216 code points each. The first one (plane 0) is called Basic Multilingual Plane, the BMP.
To store text that follows the Unicode standard, you simply have to store a sequence of code points, which is just a sequence of integer numbers.
The specifics of how these numbers are stored can vary and are determined by the texts encoding format.
UTF-32
All of the integer values currently defined in Unicode can be expressed in 21 bits or less.
Therefore the simplest approach would be to simply store each code point as a 24 bit (3 byte) integer.
While no official 24 bit encoding for Unicode exits (so far), there is UTF-32, in which each code point is stored as a 32 bit integer.
UTF-8
Due to the order of symbols in Unicode, the code points of the most common characters all have rather small values. UTF-8 acts similar to a compression algorithm that makes more common characters use less memory, but then uses more memory for larger code points.
Up to a value of 127 only a single byte is used, which (not coincidentally) contains all ASCII characters.
Once you need more than 7 bits, you have to split them among two or more bytes. This is where the term code unit comes in. Every byte in UTF-8 represents a single code unit. All code units combined give you the code point. Of course the code point can also be represented by a single code unit. The already discussed UTF-32 is an encoding where that is always the case.
The way the data is split is rather simple:
The highest order byte has to fill the most significant bits with as many 1s as there are bytes in total, followed by a single 0. The rest of its bits are space for your code point.
All other bytes have to set only their most significant bit to 1, but also followed by a single 0, leaving 6 bits of space each.
| Input | UTF-8 |
|---|---|
abcdefg | 0abcdefg |
ab cdefghijk | 110abcde 10fghijk |
abcdefgh ijklmnop | 1110abcd 10efghij 10klmnop |
abcde fghijklm nopqrstu | 11110abc 10defghi 10jklmno 10pqrstu |
UTF-16
UTF-16 uses 16 bit code units, which just like UTF-8 allows you to use a single code unit for more common characters, but requires 2 units for larger code points.
Unlike UTF-8, it requires some math and cannot be done by just moving bits around.
Up to a value of 65,535 (0xFFFF), you can store the code point as a 16 bit integer directly.
If the value is larger than this, you first have to subtract 65,536 (0x10000) from it.
Then take the lowest 10 bits (0-9) of the result and put the bit sequence 110111 in front to form the so called low surrogate, the less significant code unit.
Then take the next higher 10 bits (10-19) and prepend it with 110110 to get the high surrogate.
Combined they give you a surrogate pair.
Now you might be wondering: How do you distinguish the code units in a surrogate pair from two single-unit code points?
The answer lies in the possible integer values a surrogate can have. Since the most significant bits are fixed, the high surrogate can only represent values from 55,296 (0xD800) to 56,319 (0xDBFF), and the low only 56,320 (0xDC00) to 57,343 (0xDFFF).
The range from 55,296 to 57,343 in Unicode is reserved for surrogates and can never be occupied by a real code unit. To decode a UTF-16 encoded text you simply have to check if a code unit falls within the surrogate range. If not, you can use it as is. Otherwise it will be part of a pair (assuming the data is valid), where you can combine the lower 10 bits of each to form one 20 bit number to which you add 65,536 (0x10000).
A different way to look at UTF-16:
Each code unit can represent 216 different code points, but 1,024 of those are not used to represent any symbols and are reserved for surrogates.
By combining two code units that represent one of these unused code points, you get 1,024*1,024 = 220 possible values they can represent together. Which, combined with the code points a single unit can encode, is enough to cover all of the (current) Unicode code points.
