perf: optimize large EPUB indexing from O(n^2) to O(n) by DChells · Pull Request #458 · crosspoint-reader/crosspoint-reader

DChells · 2026-01-20T20:41:13Z

Summary

Optimizes EPUB metadata indexing for large books (2000+ chapters) from ~30 minutes to ~50 seconds by replacing O(n²) algorithms with O(n log n) hash-indexed lookups.

Fixes #134

Problem

Three phases had O(n²) complexity due to nested loops:

Phase	Operation	Before (2768 chapters)
OPF Pass	For each spine ref, scan all manifest items	~25 min
TOC Pass	For each TOC entry, scan all spine items	~5 min
buildBookBin	For each spine item, scan ZIP central directory	~8.4 min

Total: ~30+ minutes for first-time indexing of large EPUBs.

Solution

Replace linear scans with sorted hash indexes + binary search:

OPF Pass: Build {hash(id), len, offset} index from manifest, binary search for each spine ref
TOC Pass: Build {hash(href), len, spineIndex} index from spine, binary search for each TOC entry
buildBookBin: New ZipFile::fillUncompressedSizes() API - single ZIP central directory scan with batch hash matching

All indexes use FNV-1a hashing with length as secondary key to minimize collisions. Indexes are freed immediately after each phase.

Results

Shadow Slave EPUB (2768 chapters):

Phase	Before	After	Speedup
OPF pass	~25 min	10.8 sec	~140x
TOC pass	~5 min	4.7 sec	~60x
buildBookBin	506 sec	34.6 sec	~15x
Total	~30+ min	~50 sec	~36x

Normal EPUB (87 chapters): 1.7 sec - no regression.

Memory

Peak temporary memory during indexing:

OPF index: ~33KB (2770 items × 12 bytes)
TOC index: ~33KB (2768 items × 12 bytes)
ZIP batch: ~44KB (targets + sizes arrays)

All indexes cleared immediately after each phase. No OOM risk on ESP32-C3.

Note on Threshold

All optimizations are gated by LARGE_SPINE_THRESHOLD = 400 to preserve existing behavior for small books. However, the algorithms work correctly for any book size and are faster even for small books:

Book Size	Old O(n²)	New O(n log n)	Improvement
10 ch	100 ops	50 ops	2x
100 ch	10K ops	800 ops	12x
400 ch	160K ops	4K ops	40x

If preferred, the threshold could be removed to use the optimized path universally.

Testing

Shadow Slave (2768 chapters): 50s first-time indexing, loads and navigates correctly
Normal book (87 chapters): 1.7s indexing, no regression
Build passes
clang-format passes

Files Changed

lib/Epub/Epub/parsers/ContentOpfParser.h/.cpp - OPF manifest index
lib/Epub/Epub/BookMetadataCache.h/.cpp - TOC index + batch size lookup
lib/ZipFile/ZipFile.h/.cpp - New fillUncompressedSizes() API
lib/Epub/Epub.cpp - Timing logs

Algorithm Details (click to expand)

Phase 1: OPF Pass - Manifest to Spine Lookup

Problem: Each <itemref idref="ch001"> in spine must find matching <item id="ch001" href="..."> in manifest.

OLD: For each of 2768 spine refs, scan all 2770 manifest items
     = 7.6M string comparisons

NEW: While parsing manifest, build index:
     { hash("ch001"), len=5, file_offset=120 }
     
     Sort index, then binary search for each spine ref:
     2768 × log₂(2770) ≈ 2768 × 11 = 30K comparisons

Phase 2: TOC Pass - TOC Entry to Spine Index Lookup

Problem: Each TOC entry with href="chapter0001.xhtml" must find its spine index.

OLD: For each of 2768 TOC entries, scan all 2768 spine entries
     = 7.6M string comparisons

NEW: At beginTocPass(), read spine once and build index:
     { hash("OEBPS/chapter0001.xhtml"), len=25, spineIndex=0 }
     
     Sort index, binary search for each TOC entry:
     2768 × log₂(2768) ≈ 30K comparisons
     
     Clear index at endTocPass() to free memory.

Phase 3: buildBookBin - ZIP Size Lookup

Problem: Need uncompressed file size for each spine item (for reading progress). Sizes are in ZIP central directory.

OLD: For each of 2768 spine items, scan ZIP central directory (2773 entries)
     = 7.6M filename reads + string comparisons
     Time: 506 seconds

NEW: 
  Step 1: Build targets from spine
          { hash("OEBPS/chapter0001.xhtml"), len=25, index=0 }
          Sort by (hash, len)
  
  Step 2: Single pass through ZIP central directory
          For each entry:
            - Compute hash ON THE FLY (no string allocation)
            - Binary search targets
            - If match: sizes[target.index] = uncompressedSize
  
  Step 3: Use sizes array directly (O(1) per spine item)
  
  Total: 2773 entries × log₂(2768) ≈ 33K comparisons
  Time: 35 seconds

Why Hash + Length?

Using 64-bit FNV-1a hash + string length as a composite key:

Collision probability: ~1 in 2⁶⁴ × typical_path_lengths
No string storage needed in index (just 12-16 bytes per entry)
Integer comparisons are faster than string comparisons
Verification on match handles the rare collision case

AI-assisted development. All changes tested on hardware.

Remove the call to loadAllFileStatSlims() which pre-loads all ZIP central directory entries into memory. For EPUBs with 2000+ chapters (like webnovels), this exhausts the ESP32-C3's ~380KB RAM and causes abort(). The existing loadFileStatSlim() function already handles individual lookups by scanning the central directory per-file when the cache is empty. This is O(n*m) instead of O(n), but prevents memory exhaustion. Fixes crosspoint-reader#134

DChells · 2026-01-21T04:42:23Z

Fix Pushed - Ready for Re-test

The initial implementation caused OOM crash at beginTocPass() because std::unordered_map<std::string, int16_t> with 2768 string keys consumed ~100KB+ RAM.

Changes in Latest Commit

Removed the hrefToSpineIndex map entirely
Reverted createTocEntry() to use original O(n) spine file scan
Kept the safe spineToTocIndex vector in buildBookBin() (only ~5.5KB for 2768 entries)

What This PR Now Does

Optimizes buildBookBin() from O(n²) to O(n) using a compact std::vector<int16_t> (~5.5KB)
Leaves createTocEntry() as O(n) per call (map optimization was not RAM-safe)

Expected Performance

With this fix, Shadow Slave indexing should:

Not crash at beginTocPass()
Complete faster than PR fix: prevent OOM crash when loading large EPUBs (2000+ chapters) #457 alone (due to buildBookBin optimization)
Use minimal additional RAM (~5.5KB for the vector)

Next Steps

Please re-test with Shadow Slave and report:

Does it crash at beginTocPass()? (should NOT crash now)
Total indexing time
Does the book open and navigate correctly?

Replace O(n²) lookups with O(n) preprocessing: 1. createTocEntry(): Build href->spineIndex map once in beginTocPass() instead of scanning spine file for every TOC entry 2. buildBookBin(): Build spineIndex->tocIndex vector in single pass instead of scanning TOC file for every spine entry For 2768-chapter EPUBs, this reduces: - TOC pass: from ~7.6M file reads to ~5.5K reads - buildBookBin: from ~7.6M file reads to ~5.5K reads Memory impact: ~80KB for href map (acceptable trade-off for 10x+ speedup)

The unordered_map with 2768 string keys (~100KB+) was causing OOM crashes at beginTocPass() on ESP32-C3's limited ~380KB RAM. Reverted createTocEntry() to use original O(n) spine file scan instead. Kept the safe spineToTocIndex vector in buildBookBin() (only ~5.5KB).

Three optimizations for EPUBs with many chapters (e.g. 2768 chapters): 1. OPF idref→href lookup: Build sorted hash index during manifest parsing, use binary search during spine resolution. Reduces ~4min to ~30-60s. 2. TOC href→spineIndex lookup: Build sorted hash index in beginTocPass(), use binary search in createTocEntry(). Reduces ~4min to ~30-60s. 3. ZIP central-dir cursor: Resume scanning from last position instead of restarting from beginning. Reduces ~8min to ~1-3min. All optimizations only activate for large EPUBs (≥400 spine items). Small books use unchanged code paths. Memory impact: ~33KB + ~39KB temporary during indexing, freed after. Expected total: ~17min → ~3-5min for Shadow Slave (2768 chapters). Also adds phase timing logs for performance measurement.

Add ZipFile::fillUncompressedSizes() for single-pass ZIP central directory scan with hash-based target matching. Also apply clang-format fixes for CI. Shadow Slave results: - buildBookBin: 506s → 35s - Total indexing: 8.7min → 50s

DChells · 2026-01-22T04:04:22Z

Update: Full Optimization Complete

Since the earlier comment, I've implemented all three optimizations using a RAM-safe hash-index approach (replacing the unordered_map that caused OOM).

Results on Shadow Slave (2768 chapters):

OPF pass: 10.8 sec
TOC pass: 4.7 sec
buildBookBin: 34.6 sec
Total: ~50 seconds (down from 30+ minutes)

Normal books (87 chapters): 1.7 sec - no regression.

Technical approach:

Replaced unordered_map<string> (caused OOM) with compact vector<{hash, len, index}> structs
FNV-1a hashing + binary search instead of string-keyed maps
Peak memory: ~44KB temporary (freed after each phase)
New ZipFile::fillUncompressedSizes() API for batch ZIP lookups

All optimizations gated by threshold (≥400 items) but work for any book size.

Updated the PR description with full details and algorithm explanation.

Implementing faster .epub indexing + fixing minor issue with hiding /fonts in the file browser

This reverts commit 0e0dd5b.

daveallie

Great work here @DChells, this feels much better for larger books

DChells · 2026-01-27T19:09:51Z

Thank you technically it will work fine for all books not just large ones but I didn't want to completely overhaul the existing system without review so I had it only work on books with a large enough spine. But it should improve the speed of regular book sizes as well although it will be negligible from a user perspective probably because typical books are small chapter wise.

* master: (33 commits) feat: add HalDisplay and HalGPIO (crosspoint-reader#522) feat: Display epub metadata on Recents (crosspoint-reader#511) chore: Cut release 0.16.0 fix: Correctly render italics on image alt placeholders (crosspoint-reader#569) chore: .gitignore: add compile_commands.json & .cache (crosspoint-reader#568) fix: Render keyboard entry over multiple lines (crosspoint-reader#567) fix: missing front layout in mapLabels() (crosspoint-reader#564) refactor: Re-work for OTA feature (crosspoint-reader#509) perf: optimize large EPUB indexing from O(n^2) to O(n) (crosspoint-reader#458) feat: Add Spanish hyphenation support (crosspoint-reader#558) feat: Add support to B&W filters to image covers (crosspoint-reader#476) feat(ux): page turning on button pressed if long-press chapter skip is disabled (crosspoint-reader#451) feat: Add status bar option "Full w/ Progress Bar" (crosspoint-reader#438) fix: Validate settings on read. (crosspoint-reader#492) fix: rotate origin in drawImage (crosspoint-reader#557) feat: Extract author from XTC/XTCH files (crosspoint-reader#563) fix: add txt books to recent tab (crosspoint-reader#526) docs: add font generation commands to builtin font headers (crosspoint-reader#547) docs: Update README with supported languages for EPUB (crosspoint-reader#530) fix: Fix KOReader document md5 calculation for binary matching progress sync (crosspoint-reader#529) ...

daveallie · 2026-01-28T11:07:06Z

I did think that reading through it as part of the review. Would be great to follow up on this one for a future release to apply the same logic everywhere to clean up the code

Synced from upstream/master (da4d3b5): - feat: Add HalDisplay and HalGPIO (crosspoint-reader#522) - feat: Display epub metadata on Recents (crosspoint-reader#511) - feat: Add Spanish hyphenation support (crosspoint-reader#558) - fix: Render keyboard entry over multiple lines (crosspoint-reader#567) - fix: missing front layout in mapLabels() (crosspoint-reader#564) - perf: optimize large EPUB indexing from O(n^2) to O(n) (crosspoint-reader#458) - chore: .gitignore: add compile_commands.json & .cache (crosspoint-reader#568) Files synced: - lib/hal/ (new) - lib/Epub/ (hyphenation, metadata cache, parsers) - lib/ZipFile/ - src/MappedInputManager.* - src/activities/home/MyLibraryActivity.* - src/activities/util/ Skipped conflicting files: - lib/GfxRenderer/ (has CJK changes) - src/activities/reader/ (has dark mode changes) - src/main.cpp (has I18n changes)

…ader#458) ## Summary Optimizes EPUB metadata indexing for large books (2000+ chapters) from ~30 minutes to ~50 seconds by replacing O(n²) algorithms with O(n log n) hash-indexed lookups. Fixes crosspoint-reader#134 ## Problem Three phases had O(n²) complexity due to nested loops: | Phase | Operation | Before (2768 chapters) | |-------|-----------|------------------------| | OPF Pass | For each spine ref, scan all manifest items | ~25 min | | TOC Pass | For each TOC entry, scan all spine items | ~5 min | | buildBookBin | For each spine item, scan ZIP central directory | ~8.4 min | Total: **~30+ minutes** for first-time indexing of large EPUBs. ## Solution Replace linear scans with sorted hash indexes + binary search: - **OPF Pass**: Build `{hash(id), len, offset}` index from manifest, binary search for each spine ref - **TOC Pass**: Build `{hash(href), len, spineIndex}` index from spine, binary search for each TOC entry - **buildBookBin**: New `ZipFile::fillUncompressedSizes()` API - single ZIP central directory scan with batch hash matching All indexes use FNV-1a hashing with length as secondary key to minimize collisions. Indexes are freed immediately after each phase. ## Results **Shadow Slave EPUB (2768 chapters):** | Phase | Before | After | Speedup | |-------|--------|-------|---------| | OPF pass | ~25 min | 10.8 sec | ~140x | | TOC pass | ~5 min | 4.7 sec | ~60x | | buildBookBin | 506 sec | 34.6 sec | ~15x | | **Total** | **~30+ min** | **~50 sec** | **~36x** | **Normal EPUB (87 chapters):** 1.7 sec - no regression. ## Memory Peak temporary memory during indexing: - OPF index: ~33KB (2770 items × 12 bytes) - TOC index: ~33KB (2768 items × 12 bytes) - ZIP batch: ~44KB (targets + sizes arrays) All indexes cleared immediately after each phase. No OOM risk on ESP32-C3. ## Note on Threshold All optimizations are gated by `LARGE_SPINE_THRESHOLD = 400` to preserve existing behavior for small books. However, the algorithms work correctly for any book size and are faster even for small books: | Book Size | Old O(n²) | New O(n log n) | Improvement | |-----------|-----------|----------------|-------------| | 10 ch | 100 ops | 50 ops | 2x | | 100 ch | 10K ops | 800 ops | 12x | | 400 ch | 160K ops | 4K ops | 40x | If preferred, the threshold could be removed to use the optimized path universally. ## Testing - [x] Shadow Slave (2768 chapters): 50s first-time indexing, loads and navigates correctly - [x] Normal book (87 chapters): 1.7s indexing, no regression - [x] Build passes - [x] clang-format passes ## Files Changed - `lib/Epub/Epub/parsers/ContentOpfParser.h/.cpp` - OPF manifest index - `lib/Epub/Epub/BookMetadataCache.h/.cpp` - TOC index + batch size lookup - `lib/ZipFile/ZipFile.h/.cpp` - New `fillUncompressedSizes()` API - `lib/Epub/Epub.cpp` - Timing logs <details> <summary><b>Algorithm Details</b> (click to expand)</summary> ### Phase 1: OPF Pass - Manifest to Spine Lookup **Problem**: Each `<itemref idref="ch001">` in spine must find matching `<item id="ch001" href="...">` in manifest. ``` OLD: For each of 2768 spine refs, scan all 2770 manifest items = 7.6M string comparisons NEW: While parsing manifest, build index: { hash("ch001"), len=5, file_offset=120 } Sort index, then binary search for each spine ref: 2768 × log₂(2770) ≈ 2768 × 11 = 30K comparisons ``` ### Phase 2: TOC Pass - TOC Entry to Spine Index Lookup **Problem**: Each TOC entry with `href="chapter0001.xhtml"` must find its spine index. ``` OLD: For each of 2768 TOC entries, scan all 2768 spine entries = 7.6M string comparisons NEW: At beginTocPass(), read spine once and build index: { hash("OEBPS/chapter0001.xhtml"), len=25, spineIndex=0 } Sort index, binary search for each TOC entry: 2768 × log₂(2768) ≈ 30K comparisons Clear index at endTocPass() to free memory. ``` ### Phase 3: buildBookBin - ZIP Size Lookup **Problem**: Need uncompressed file size for each spine item (for reading progress). Sizes are in ZIP central directory. ``` OLD: For each of 2768 spine items, scan ZIP central directory (2773 entries) = 7.6M filename reads + string comparisons Time: 506 seconds NEW: Step 1: Build targets from spine { hash("OEBPS/chapter0001.xhtml"), len=25, index=0 } Sort by (hash, len) Step 2: Single pass through ZIP central directory For each entry: - Compute hash ON THE FLY (no string allocation) - Binary search targets - If match: sizes[target.index] = uncompressedSize Step 3: Use sizes array directly (O(1) per spine item) Total: 2773 entries × log₂(2768) ≈ 33K comparisons Time: 35 seconds ``` ### Why Hash + Length? Using 64-bit FNV-1a hash + string length as a composite key: - Collision probability: ~1 in 2⁶⁴ × typical_path_lengths - No string storage needed in index (just 12-16 bytes per entry) - Integer comparisons are faster than string comparisons - Verification on match handles the rare collision case </details> --- _AI-assisted development. All changes tested on hardware._

DChells added 2 commits January 20, 2026 21:14

DChells force-pushed the perf/large-epub-indexing branch from f3e3f4c to 702a7e7 Compare January 21, 2026 05:14

DChells added 2 commits January 20, 2026 23:35

DChells mentioned this pull request Jan 22, 2026

fix: prevent OOM crash when loading large EPUBs (2000+ chapters) #457

Closed

4 tasks

alpsfordays added a commit to ruby-builds/crosspoint-reader that referenced this pull request Jan 26, 2026

Implementing PR crosspoint-reader#458

0e0dd5b

Implementing faster .epub indexing + fixing minor issue with hiding /fonts in the file browser

alpsfordays added a commit to ruby-builds/crosspoint-reader that referenced this pull request Jan 26, 2026

Revert "Implementing PR crosspoint-reader#458"

250b542

This reverts commit 0e0dd5b.

daveallie approved these changes Jan 27, 2026

View reviewed changes

daveallie changed the title ~~perf: optimize large EPUB indexing from O(n\u00b2) to O(n)~~ perf: optimize large EPUB indexing from O(n^2) to O(n) Jan 27, 2026

daveallie merged commit 83315b6 into crosspoint-reader:master Jan 27, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: optimize large EPUB indexing from O(n^2) to O(n)#458

perf: optimize large EPUB indexing from O(n^2) to O(n)#458
daveallie merged 5 commits intocrosspoint-reader:masterfrom
DChells:perf/large-epub-indexing

DChells commented Jan 20, 2026 •

edited

Loading

Uh oh!

DChells commented Jan 21, 2026

Uh oh!

DChells commented Jan 22, 2026

Uh oh!

daveallie left a comment

Uh oh!

Uh oh!

DChells commented Jan 27, 2026

Uh oh!

daveallie commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

DChells commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Results

Memory

Note on Threshold

Testing

Files Changed

Phase 1: OPF Pass - Manifest to Spine Lookup

Phase 2: TOC Pass - TOC Entry to Spine Index Lookup

Phase 3: buildBookBin - ZIP Size Lookup

Why Hash + Length?

Uh oh!

DChells commented Jan 21, 2026

Fix Pushed - Ready for Re-test

Changes in Latest Commit

What This PR Now Does

Expected Performance

Next Steps

Uh oh!

DChells commented Jan 22, 2026

Update: Full Optimization Complete

Uh oh!

daveallie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DChells commented Jan 27, 2026

Uh oh!

daveallie commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DChells commented Jan 20, 2026 •

edited

Loading