Unicode: Character Encoding Deep Dive
Before Unicode, character encoding was a mess. ASCII gave us 128 characters—enough for English, but useless for the rest of the world. The solution? Everyone invented their own encoding.
Key Insights
- Unicode separates the concept of characters (code points) from their binary representation (encodings), and confusing these two concepts is the root cause of most encoding bugs.
- UTF-8’s variable-width design makes it backwards-compatible with ASCII and memory-efficient for most text, which is why it dominates the web at over 98% adoption.
- String length, comparison, and truncation all behave unexpectedly with Unicode—a single visible character can be multiple code points, and identical-looking strings can have different byte representations.
The Pre-Unicode Problem
Before Unicode, character encoding was a mess. ASCII gave us 128 characters—enough for English, but useless for the rest of the world. The solution? Everyone invented their own encoding.
ISO-8859-1 covered Western European languages. ISO-8859-5 handled Cyrillic. Windows-1252 was Microsoft’s “extended ASCII.” Japan had Shift-JIS, EUC-JP, and ISO-2022-JP. China had GB2312 and Big5. None of these were compatible with each other.
The result was mojibake—garbled text that appears when you decode bytes using the wrong encoding:
# Text encoded in Windows-1252
original = "café résumé"
encoded_bytes = original.encode('windows-1252')
print(f"Bytes: {encoded_bytes}")
# Bytes: b'caf\xe9 r\xe9sum\xe9'
# Decoded with the wrong encoding
wrong_decode = encoded_bytes.decode('utf-8', errors='replace')
print(f"Mojibake: {wrong_decode}")
# Mojibake: caf� r�sum�
# Or worse, silently wrong interpretation
latin2_decode = encoded_bytes.decode('iso-8859-2')
print(f"Silent corruption: {latin2_decode}")
# Silent corruption: café résumé (looks right but é becomes é internally)
This wasn’t just annoying—it corrupted data, broke systems, and made internationalization a nightmare. Unicode was created to solve this by assigning every character a unique number, regardless of platform, program, or language.
Unicode Fundamentals: Code Points and Characters
Unicode assigns each character a code point—a number written as U+ followed by hexadecimal digits. The letter “A” is U+0041. The emoji “😀” is U+1F600.
But here’s where it gets tricky: what you see as a single “character” on screen might be multiple code points.
# A single visible character can be multiple code points
family_emoji = "👨👩👧👦"
print(f"What you see: {family_emoji}")
print(f"Length in Python: {len(family_emoji)}")
# Length in Python: 11
# Let's see the actual code points
code_points = [f"U+{ord(c):04X}" for c in family_emoji]
print(f"Code points: {code_points}")
# Code points: ['U+1F468', 'U+200D', 'U+1F469', 'U+200D', 'U+1F467', 'U+200D', 'U+1F466']
// JavaScript has the same issue
const family = "👨👩👧👦";
console.log(`String length: ${family.length}`);
// String length: 11
// To iterate actual grapheme clusters (visible characters), use Intl.Segmenter
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const graphemes = [...segmenter.segment(family)].map(s => s.segment);
console.log(`Grapheme count: ${graphemes.length}`);
// Grapheme count: 1
Unicode organizes code points into 17 planes of 65,536 code points each. Plane 0 (U+0000 to U+FFFF) is the Basic Multilingual Plane (BMP), containing most common characters. Planes 1-16 are supplementary planes, home to emoji, historic scripts, and rare characters.
UTF-8: The Dominant Encoding
Code points are abstract numbers. To store or transmit them, you need an encoding. UTF-8 is a variable-width encoding that uses 1 to 4 bytes per code point:
| Code Point Range | Byte Pattern | Bytes |
|---|---|---|
| U+0000 to U+007F | 0xxxxxxx | 1 |
| U+0080 to U+07FF | 110xxxxx 10xxxxxx | 2 |
| U+0800 to U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx | 3 |
| U+10000 to U+10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 4 |
The genius of UTF-8 is that ASCII text is valid UTF-8. Every byte below 128 means exactly what it meant in ASCII. This made migration painless.
def utf8_encode_manual(code_point: int) -> bytes:
"""Manually encode a code point to UTF-8 bytes."""
if code_point <= 0x7F:
return bytes([code_point])
elif code_point <= 0x7FF:
return bytes([
0xC0 | (code_point >> 6),
0x80 | (code_point & 0x3F)
])
elif code_point <= 0xFFFF:
return bytes([
0xE0 | (code_point >> 12),
0x80 | ((code_point >> 6) & 0x3F),
0x80 | (code_point & 0x3F)
])
else:
return bytes([
0xF0 | (code_point >> 18),
0x80 | ((code_point >> 12) & 0x3F),
0x80 | ((code_point >> 6) & 0x3F),
0x80 | (code_point & 0x3F)
])
# Examples
print(f"'A' (U+0041): {utf8_encode_manual(0x0041).hex()}") # 41
print(f"'é' (U+00E9): {utf8_encode_manual(0x00E9).hex()}") # c3a9
print(f"'中' (U+4E2D): {utf8_encode_manual(0x4E2D).hex()}") # e4b8ad
print(f"'😀' (U+1F600): {utf8_encode_manual(0x1F600).hex()}") # f09f9880
UTF-8 now represents over 98% of web pages. It won because it’s compact for ASCII-heavy text, self-synchronizing (you can find character boundaries by looking at byte patterns), and doesn’t have byte-order issues.
UTF-16 and UTF-32: Alternatives and Trade-offs
UTF-16 uses 2 bytes for BMP characters and 4 bytes (as surrogate pairs) for supplementary plane characters. It’s used internally by Java, JavaScript, and Windows.
The surrogate pair mechanism is a constant source of bugs:
// JavaScript strings are UTF-16 internally
const rocket = "🚀";
console.log(rocket.length); // 2, not 1!
// The rocket emoji U+1F680 becomes surrogate pair
console.log(rocket.charCodeAt(0).toString(16)); // d83d (high surrogate)
console.log(rocket.charCodeAt(1).toString(16)); // de80 (low surrogate)
// Naive substring breaks the character
const broken = rocket.substring(0, 1);
console.log(broken); // � (unpaired surrogate)
// Safe iteration with for...of or Array.from
console.log([...rocket].length); // 1
console.log(Array.from(rocket).length); // 1
// Safe substring
function safeSubstring(str, start, end) {
return Array.from(str).slice(start, end).join('');
}
UTF-32 uses exactly 4 bytes per code point. It’s simple—array indexing works perfectly—but wastes memory for most text. It’s rarely used for storage or transmission, but sometimes used internally for processing.
Normalization and Comparison
Here’s a puzzle: are these two strings equal?
s1 = "café" # 4 characters, 'é' is U+00E9
s2 = "café" # 5 characters, 'e' + combining acute accent U+0301
print(f"s1 length: {len(s1)}") # 4
print(f"s2 length: {len(s2)}") # 5
print(f"s1 == s2: {s1 == s2}") # False!
print(f"They look the same: '{s1}' vs '{s2}'") # café vs café
Both render identically, but they’re different byte sequences. This breaks dictionary lookups, database queries, and security checks.
Unicode defines four normalization forms:
- NFC (Composed): Combines characters where possible (é as single code point)
- NFD (Decomposed): Separates into base + combining marks (e + ́)
- NFKC/NFKD: Also normalizes compatibility characters (fi → fi)
import unicodedata
s1 = "café" # precomposed
s2 = "cafe\u0301" # decomposed
# Normalize before comparing
s1_nfc = unicodedata.normalize('NFC', s1)
s2_nfc = unicodedata.normalize('NFC', s2)
print(f"After NFC normalization: {s1_nfc == s2_nfc}") # True
# For user-facing comparison, also consider case folding
def unicode_equal(a: str, b: str) -> bool:
"""Compare strings with normalization and case folding."""
return unicodedata.normalize('NFC', a.casefold()) == \
unicodedata.normalize('NFC', b.casefold())
print(unicode_equal("Café", "CAFÉ")) # True
Common Pitfalls and Security Concerns
Unicode introduces attack vectors that didn’t exist with ASCII.
Homoglyph attacks use characters that look identical but have different code points:
def detect_homoglyphs(text: str, allowed_scripts: set = {'Latin', 'Common'}) -> list:
"""Detect characters from unexpected scripts (potential homoglyphs)."""
import unicodedata
suspicious = []
for i, char in enumerate(text):
try:
script = unicodedata.name(char).split()[0]
# This is simplified; real detection uses Unicode script property
if 'CYRILLIC' in unicodedata.name(char) or 'GREEK' in unicodedata.name(char):
suspicious.append((i, char, unicodedata.name(char)))
except ValueError:
pass
return suspicious
# Looks like "apple.com" but uses Cyrillic 'а' (U+0430)
fake_domain = "аpple.com"
real_domain = "apple.com"
print(f"Visually identical: {fake_domain} vs {real_domain}")
print(f"Actually equal: {fake_domain == real_domain}") # False
Truncation bugs occur when you cut a string in the middle of a multi-byte sequence:
def safe_truncate(text: str, max_bytes: int, encoding: str = 'utf-8') -> str:
"""Truncate string to max bytes without breaking characters."""
encoded = text.encode(encoding)
if len(encoded) <= max_bytes:
return text
# Decode with error handling to find valid truncation point
truncated = encoded[:max_bytes]
while truncated:
try:
return truncated.decode(encoding)
except UnicodeDecodeError:
truncated = truncated[:-1]
return ""
# Truncating "Hello 世界" to 10 bytes
text = "Hello 世界"
print(f"Original bytes: {len(text.encode('utf-8'))}") # 12
print(f"Safe truncate: '{safe_truncate(text, 10)}'") # "Hello 世" (9 bytes)
Best Practices for Engineers
Always declare encodings explicitly. Never assume.
# Python: Declare encoding at file top
# -*- coding: utf-8 -*-
# Open files with explicit encoding
with open('data.txt', 'r', encoding='utf-8') as f:
content = f.read()
Configure databases correctly:
-- MySQL: Use utf8mb4, not utf8 (which only supports BMP)
CREATE DATABASE myapp CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- PostgreSQL: UTF-8 is the default and correct choice
CREATE DATABASE myapp ENCODING 'UTF8';
Use Unicode-aware regex:
import regex # Third-party library with better Unicode support
# Match any letter from any language
pattern = regex.compile(r'\p{L}+')
text = "Hello мир 世界"
print(pattern.findall(text)) # ['Hello', 'мир', '世界']
# Standard library alternative (less comprehensive)
import re
pattern = re.compile(r'[\w]+', re.UNICODE)
Test with adversarial input. Include emoji, RTL text, combining characters, and zero-width characters in your test suite. If your application handles user input, assume users will paste the entire Unicode specification into every field.
Unicode isn’t optional knowledge anymore. Every string your code touches is Unicode, whether you realize it or not. Understanding how it works—and where it breaks—is fundamental to building robust software.