Skip to content

Unicode Explorer: binary search over HTTP Range requests#90

Merged
simonw merged 2 commits intomainfrom
claude/unicode-explorer-binary-search-tyuUS
Feb 27, 2026
Merged

Unicode Explorer: binary search over HTTP Range requests#90
simonw merged 2 commits intomainfrom
claude/unicode-explorer-binary-search-tyuUS

Conversation

@simonw
Copy link
Owner

@simonw simonw commented Feb 27, 2026

Started with a spec I generated using Claude Opus chat: https://claude.ai/share/47860666-cb20-44b5-8cdb-d0ebe363384f

Build: Unicode Explorer — Binary Search Over HTTP

What This Is

A demo that performs binary search via HTTP Range requests against a single static file. No backend, no database, no dependencies. Every step of the binary search is a real network fetch — the browser reads one 256-byte record at a time, compares, narrows the range, and fetches again. The network log shows ~14 sequential requests zeroing in on the answer out of 150,000+ records.

Architecture

1. Build Script (build.py, Python)

A one-time data preparation step. Uses only the Python standard library (urllib, json, os, etc). No pip packages.

It downloads the source data:

And produces:

  • data/unicode-data.bin — Every assigned Unicode character as one fixed-width record (256 bytes, UTF-8, right-padded with spaces), one per line, sorted by codepoint. Each record is a JSON object:
    • cp — codepoint (integer)
    • name — official Unicode name (e.g. “BLACK HEART SUIT”)
    • cat — general category abbreviation (e.g. “Lu”, “Nd”, “So”)
    • block — block name (e.g. “Basic Latin”, “Miscellaneous Symbols”)
  • data/meta.json — A tiny file:
    • recordWidth — 256
    • totalRecords — the count
    • signposts — an array of 8 objects (see below)
    • totalBytes — total file size of unicode-data.bin (for the summary display)

If any record’s JSON exceeds 256 bytes, either truncate the name or increase the record width globally. Verify this during the build and report any overflow.

Signposts

The first 3 steps of binary search over ~150k records always hit the same indices (the ½, ¼, ¾, and ⅛ points). These are deterministic — they only depend on the total record count. The build script samples the codepoints at the 8 evenly-spaced ⅛-point positions and includes them in meta.json as:

"signposts": [
  { "idx": 0, "cp": 0 },
  { "idx": 18750, "cp": 4988 },
  { "idx": 37500, "cp": 8712 },
  ...
]

The client uses these to skip the first ~3 binary search steps without any network requests. It walks the signposts array to narrow to a ⅛th segment in microseconds, then begins the Range request loop from there. This saves 3 round trips per search.

The signposts should appear in the network log UI as greyed-out “cached” rows so the user understands what happened, before the live fetch rows begin.

2. Client (index.html, single file)

Vanilla HTML/CSS/JS. No build step, no framework, no external dependencies of any kind.

The core mechanic:

  1. On page load, fetch data/meta.json to learn recordWidth, totalRecords, and signposts.
  2. When the user searches for a character:
  • Determine the codepoint (from pasted character via codePointAt(0), or from hex input like 2665 or U+2665)
  • Walk the signposts to find the tightest lo/hi bounds without any network requests
  • Loop (the binary search over the network):
    • mid = Math.floor((lo + hi) / 2)
    • fetch("data/unicode-data.bin", { headers: { Range: "bytes=" + (mid * recordWidth) + "-" + ((mid + 1) * recordWidth - 1), "Accept-Encoding": "identity" } })
    • Parse the response text, trim padding, JSON.parse() it
    • Compare record.cp to the target codepoint
    • If equal → found it, display result
    • If record.cp < targetlo = mid + 1
    • If record.cp > targethi = mid - 1
    • If lo > hi → not found
  • Each iteration of this loop is a real await fetch(). This is the whole point.
  1. Log every step to the network log panel as it happens — signpost steps appear instantly (greyed out / marked as cached), then fetch rows appear one by one in real time.

UI:

  • A text input for character or hex codepoint input
  • A result area showing the character rendered large (~120px), its name, codepoint (hex and decimal), category, and block
  • A network log panel that is the visual centerpiece of the demo. It’s a table that grows row by row in real time as the binary search runs. Columns:
    • Step number (1, 2, 3…)
    • Source — “signpost” (greyed out) or the Range header (e.g. bytes=19200000-19200255)
    • Record — what was found (e.g. U+4E09 CJK UNIFIED IDEOGRAPH-4E09)
    • Comparison result (e.g. 128128 > 19977 → go right)
    • Response time — “cached” for signposts, ms for fetches
  • A summary line at the end: “Found in 14 requests · 3,584 bytes transferred · full file is 38MB · 3 steps skipped via signposts”

Design:

  • Clean, minimal, monochrome with one accent color
  • Mobile-friendly
  • The network log should feel alive — signpost rows appear instantly, then fetch rows appear one by one with real timing, making the binary search visible as a process

3. Static File Server

For local development, use npx serve . or any static server that supports Range requests.

Critical: Range requests for the data file must include Accept-Encoding: identity to prevent the server from compressing the response. Range requests are incompatible with Content-Encoding because byte offsets refer to the uncompressed file, but a compressed stream can’t be decompressed from an arbitrary midpoint. Setting Accept-Encoding: identity tells the server to send raw bytes. The meta.json fetch doesn’t need this since it’s downloaded in full.

Key Constraints

  • The client must NEVER download the full data file. Every access to it is a single-record Range request.
  • The binary search happens over the network, not in memory. Each comparison is a real HTTP round trip. The only exception is the signpost pre-narrowing.
  • Fixed-width records are essential — this is what makes byte offset computation possible (offset = recordIndex * recordWidth). No index file needed.
  • Zero external dependencies in the client. Vanilla HTML/CSS/JS only.
  • The build script uses Python standard library only. No pip packages.
  • Range fetches must use Accept-Encoding: identity to prevent compression from breaking byte offsets.

What Success Looks Like

Someone opens the page, pastes 💀 into the box, and watches the network log fill up:

Step  Source                   Record                        Comparison              Time
1     signpost                 U+0000 NULL                   128128 > 0 → right      cached
2     signpost                 U+4E09 (19977)                128128 > 19977 → right   cached
3     signpost                 U+9FA5 (40869)                128128 > 40869 → right   cached
4     bytes=28672000-28672255  U+E114 (57620)                128128 > 57620 → right   12ms
5     bytes=31334400-31334655  U+1147A (70778)               128128 > 70778 → right   8ms
...
17    bytes=32563200-32563455  U+1F480 SKULL                 ✓ Found!                 7ms

Summary: “Found SKULL in 17 steps (3 cached, 14 fetched) · 3,584 bytes transferred · full file is 38MB”

That table, filling in live, is the entire demo. It makes the binary search tangible.

Run "uvx rodney --help" and use that to test your work

A demo that performs binary search via HTTP Range requests against a
single static file. build.py downloads Unicode character data and
produces a 76MB fixed-width binary file (256 bytes/record). index.html
performs real network binary search — each comparison is an HTTP Range
fetch of one record. Signposts skip the first ~3 steps. A live network
log table shows each step happening in real time.

https://claude.ai/code/session_014MbuxXYimWKjqRo88xQuch

A demo that performs binary search via HTTP Range requests against a
single static file. build.py downloads Unicode character data and
produces a 76MB fixed-width binary file (256 bytes/record). index.html
performs real network binary search — each comparison is an HTTP Range
fetch of one record. Signposts skip the first ~3 steps. A live network
log table shows each step happening in real time.

https://claude.ai/code/session_014MbuxXYimWKjqRo88xQuch
Excludes UnicodeData.txt, Blocks.txt (fetched source data) and
unicode-data.bin (76MB generated binary). All are re-created by
running build.py.

https://claude.ai/code/session_014MbuxXYimWKjqRo88xQuch
@simonw
Copy link
Owner Author

simonw commented Feb 27, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants