Unicode 编码规则

Unicode 为每个字符分配唯一的码点 (Code Point)。计算机存储时需通过编码方案 (UTF-8, UTF-16 等) 将其转换为字节序列。

平面 (Planes): 基本多文种平面 (BMP, 0000-FFFF) 包含绝大多数常用字符。辅助平面 (SMP, SIP 等) 包含 Emoji 和生僻字。

输入字符:

U+4F60

UTF-8 编码

3 Bytes

Hex Output

E4 BD A0

Binary Pattern

11100100 10111101 10100000

变长编码 (1-4 字节)。ASCII 字符仅占 1 字节。通过首字节的高位比特判断长度。

UTF-16 编码

2 Bytes

Hex Output

4F60

Basic Multilingual Plane (BMP). No surrogates needed.

变长编码 (2 或 4 字节)。BMP 字符占 2 字节。辅助平面字符使用“代理对” (Surrogate Pairs) 表示。

UTF-32 编码

4 Bytes

Hex Output

00004F60

定长编码 (4 字节)。直接存储码点数值，处理简单但空间利用率低。

字节顺序标记 (BOM)

用于标识字节序。UTF-8: EF BB BF。UTF-16 LE: FF FE (小端)。UTF-16 BE: FE FF (大端)。

UTF-8 Bit Pattern Rules

Bytes	Pattern (x = data bit)
1	0xxxxxxx
2	110xxxxx 10xxxxxx
3	1110xxxx 10xxxxxx 10xxxxxx
4	11110xxx 10xxxxxx ... 10xxxxxx

UTF-16 Surrogate Formula

Required when Code Point > 0xFFFF (e.g. Emoji).

Subtract 0x10000 from Code Point.
Take top 10 bits, add to 0xD800 (High).
Take low 10 bits, add to 0xDC00 (Low).

High: D800-DBFF
Low: DC00-DFFF