Unicode Explorer: binary search over HTTP Range requests by simonw · Pull Request #90 · simonw/research

simonw · 2026-02-27T15:06:08Z

Started with a spec I generated using Claude Opus chat: https://claude.ai/share/47860666-cb20-44b5-8cdb-d0ebe363384f

Build: Unicode Explorer — Binary Search Over HTTP

What This Is

A demo that performs binary search via HTTP Range requests against a single static file. No backend, no database, no dependencies. Every step of the binary search is a real network fetch — the browser reads one 256-byte record at a time, compares, narrows the range, and fetches again. The network log shows ~14 sequential requests zeroing in on the answer out of 150,000+ records.

Architecture

1. Build Script (build.py, Python)

A one-time data preparation step. Uses only the Python standard library (urllib, json, os, etc). No pip packages.

It downloads the source data:

UnicodeData.txt from https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

Blocks.txt from https://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt

And produces:

data/unicode-data.bin — Every assigned Unicode character as one fixed-width record (256 bytes, UTF-8, right-padded with spaces), one per line, sorted by codepoint. Each record is a JSON object:

cp — codepoint (integer)

name — official Unicode name (e.g. “BLACK HEART SUIT”)

cat — general category abbreviation (e.g. “Lu”, “Nd”, “So”)

block — block name (e.g. “Basic Latin”, “Miscellaneous Symbols”)

data/meta.json — A tiny file:

recordWidth — 256

totalRecords — the count

signposts — an array of 8 objects (see below)

totalBytes — total file size of unicode-data.bin (for the summary display)

If any record’s JSON exceeds 256 bytes, either truncate the name or increase the record width globally. Verify this during the build and report any overflow.

Signposts

The first 3 steps of binary search over ~150k records always hit the same indices (the ½, ¼, ¾, and ⅛ points). These are deterministic — they only depend on the total record count. The build script samples the codepoints at the 8 evenly-spaced ⅛-point positions and includes them in meta.json as:
"signposts": [
  { "idx": 0, "cp": 0 },
  { "idx": 18750, "cp": 4988 },
  { "idx": 37500, "cp": 8712 },
  ...
]
The client uses these to skip the first ~3 binary search steps without any network requests. It walks the signposts array to narrow to a ⅛th segment in microseconds, then begins the Range request loop from there. This saves 3 round trips per search.

The signposts should appear in the network log UI as greyed-out “cached” rows so the user understands what happened, before the live fetch rows begin.

2. Client (index.html, single file)

Vanilla HTML/CSS/JS. No build step, no framework, no external dependencies of any kind.

The core mechanic:

On page load, fetch data/meta.json to learn recordWidth, totalRecords, and signposts.

When the user searches for a character:

Determine the codepoint (from pasted character via codePointAt(0), or from hex input like 2665 or U+2665)

Walk the signposts to find the tightest lo/hi bounds without any network requests

Loop (the binary search over the network):

mid = Math.floor((lo + hi) / 2)

fetch("data/unicode-data.bin", { headers: { Range: "bytes=" + (mid * recordWidth) + "-" + ((mid + 1) * recordWidth - 1), "Accept-Encoding": "identity" } })

Parse the response text, trim padding, JSON.parse() it

Compare record.cp to the target codepoint

If equal → found it, display result

If record.cp < target → lo = mid + 1

If record.cp > target → hi = mid - 1

If lo > hi → not found

Each iteration of this loop is a real await fetch(). This is the whole point.

Log every step to the network log panel as it happens — signpost steps appear instantly (greyed out / marked as cached), then fetch rows appear one by one in real time.

UI:

A text input for character or hex codepoint input

A result area showing the character rendered large (~120px), its name, codepoint (hex and decimal), category, and block

A network log panel that is the visual centerpiece of the demo. It’s a table that grows row by row in real time as the binary search runs. Columns:

Step number (1, 2, 3…)

Source — “signpost” (greyed out) or the Range header (e.g. bytes=19200000-19200255)

Record — what was found (e.g. U+4E09 CJK UNIFIED IDEOGRAPH-4E09)

Comparison result (e.g. 128128 > 19977 → go right)

Response time — “cached” for signposts, ms for fetches

A summary line at the end: “Found in 14 requests · 3,584 bytes transferred · full file is 38MB · 3 steps skipped via signposts”

Design:

Clean, minimal, monochrome with one accent color

Mobile-friendly

The network log should feel alive — signpost rows appear instantly, then fetch rows appear one by one with real timing, making the binary search visible as a process

3. Static File Server

For local development, use npx serve . or any static server that supports Range requests.

Critical: Range requests for the data file must include Accept-Encoding: identity to prevent the server from compressing the response. Range requests are incompatible with Content-Encoding because byte offsets refer to the uncompressed file, but a compressed stream can’t be decompressed from an arbitrary midpoint. Setting Accept-Encoding: identity tells the server to send raw bytes. The meta.json fetch doesn’t need this since it’s downloaded in full.

Key Constraints

The client must NEVER download the full data file. Every access to it is a single-record Range request.

The binary search happens over the network, not in memory. Each comparison is a real HTTP round trip. The only exception is the signpost pre-narrowing.

Fixed-width records are essential — this is what makes byte offset computation possible (offset = recordIndex * recordWidth). No index file needed.

Zero external dependencies in the client. Vanilla HTML/CSS/JS only.

The build script uses Python standard library only. No pip packages.

Range fetches must use Accept-Encoding: identity to prevent compression from breaking byte offsets.

What Success Looks Like

Someone opens the page, pastes 💀 into the box, and watches the network log fill up:
Step  Source                   Record                        Comparison              Time
1     signpost                 U+0000 NULL                   128128 > 0 → right      cached
2     signpost                 U+4E09 (19977)                128128 > 19977 → right   cached
3     signpost                 U+9FA5 (40869)                128128 > 40869 → right   cached
4     bytes=28672000-28672255  U+E114 (57620)                128128 > 57620 → right   12ms
5     bytes=31334400-31334655  U+1147A (70778)               128128 > 70778 → right   8ms
...
17    bytes=32563200-32563455  U+1F480 SKULL                 ✓ Found!                 7ms
Summary: “Found SKULL in 17 steps (3 cached, 14 fetched) · 3,584 bytes transferred · full file is 38MB”

That table, filling in live, is the entire demo. It makes the binary search tangible.

Run "uvx rodney --help" and use that to test your work

A demo that performs binary search via HTTP Range requests against a
single static file. build.py downloads Unicode character data and
produces a 76MB fixed-width binary file (256 bytes/record). index.html
performs real network binary search — each comparison is an HTTP Range
fetch of one record. Signposts skip the first ~3 steps. A live network
log table shows each step happening in real time.

https://claude.ai/code/session_014MbuxXYimWKjqRo88xQuch

A demo that performs binary search via HTTP Range requests against a single static file. build.py downloads Unicode character data and produces a 76MB fixed-width binary file (256 bytes/record). index.html performs real network binary search — each comparison is an HTTP Range fetch of one record. Signposts skip the first ~3 steps. A live network log table shows each step happening in real time. https://claude.ai/code/session_014MbuxXYimWKjqRo88xQuch

Excludes UnicodeData.txt, Blocks.txt (fetched source data) and unicode-data.bin (76MB generated binary). All are re-created by running build.py. https://claude.ai/code/session_014MbuxXYimWKjqRo88xQuch

Deployed demo for simonw/research#90 - see research report at https://github.com/simonw/research/tree/main/unicode-explorer-binary-search#readme

simonw · 2026-02-27T18:04:24Z

Wrote this up here: https://simonwillison.net/2026/Feb/27/unicode-explorer/

claude added 2 commits February 27, 2026 15:04

simonw merged commit 4ab48e0 into main Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unicode Explorer: binary search over HTTP Range requests#90

Unicode Explorer: binary search over HTTP Range requests#90
simonw merged 2 commits intomainfrom
claude/unicode-explorer-binary-search-tyuUS

simonw commented Feb 27, 2026 •

edited

Loading

Uh oh!

simonw commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

simonw commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Build: Unicode Explorer — Binary Search Over HTTP

What This Is

Architecture

1. Build Script (build.py, Python)

Signposts

2. Client (index.html, single file)

3. Static File Server

Key Constraints

What Success Looks Like

Uh oh!

simonw commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simonw commented Feb 27, 2026 •

edited

Loading

1. Build Script (`build.py`, Python)

2. Client (`index.html`, single file)