0% found this document useful (0 votes)
5 views1 page

Uni Code Image

Uploaded by

senbeth11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views1 page

Uni Code Image

Uploaded by

senbeth11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UTF-8 Encoding & Decoding — Zero-Knowledge Guide

What you store are bytes. UTF-8 tells you how to turn characters into bytes (encoding) and back
(decoding).

1) Bits, bytes, hex (what is C3?)


A byte is 8 bits. We write a byte as two hexadecimal (hex) digits. Each hex digit = 4 bits.
Example: hex C3 ⇒ C = 12 = 1100, 3 = 0011 ⇒ 1100 0011.
So the reason C3 “becomes” 1100 0011 is: it’s just hex → binary.

0=0000 1=0001 2=0010 3=0011 4=0100 5=0101 6=0110 7=0111


8=1000 9=1001 A=1010 B=1011 C=1100 D=1101 E=1110 F=1111

2) UTF-8 lead/continuation prefixes


Look at the first bits of the first byte:

• 0xxxxxxx → 1 byte total (ASCII range).

• 110xxxxx → 2 bytes total (next must start with 10).

• 1110xxxx → 3 bytes total (then two 10 bytes).

• 11110xxx → 4 bytes total (then three 10 bytes).

• Continuation bytes always start 10xxxxxx.

3) ENCODING by hand (char ⇒ bytes)


Example: Encode ‘£’ (U+00A3).
Step 1: U+00A3 = hex A3 = binary 1010 0011.
Step 2: Range is U+0080–07FF ⇒ use 2-byte template 110xxxxx 10xxxxxx.
Step 3: Fill x’s from right to left. Last 6 bits → 2nd byte: 10 100011 = 1010 0011 (A3).
Remaining bits (pad to 5) → 1st byte: 00010 ⇒ 110 00010 = 1100 0010 (C2).
Answer: C2 A3.
Another quick one: ‘é’ (U+00E9) ⇒ C3 A9.

4) DECODING by hand (bytes ⇒ char)


Example: Decode C3 A9.
Step 1: C3⇒1100 0011 (starts with 110 ⇒ 2-byte char). A9⇒1010 1001.
Step 2: Strip prefixes: from first drop 110 ⇒ 00011; from second drop 10 ⇒ 101001.
Step 3: Join bits: 00011 101001 = 1110 1001 = hex E9 = U+00E9 = ‘é’.

5) What to remember for exams


• ASCII stays 1 byte in UTF-8. Others use 2–4 bytes.

• Count leading 1s in the first byte to know how many bytes long the character is.

• Show working and units (bytes) when asked for file sizes.

You might also like