perf: Replace std::list with std::vector in text layout#1038
perf: Replace std::list with std::vector in text layout#1038daveallie merged 4 commits intocrosspoint-reader:masterfrom
Conversation
Replace std::list with std::vector for the words, wordStyles, wordXpos, and wordContinues containers in TextBlock and ParsedText. Vectors provide contiguous memory layout for better cache locality and O(1) random access, eliminating per-node heap allocation and the 16-byte prev/next pointer overhead of doubly-linked list nodes. The indexed access also removes the need for a separate continuesVec copy that was previously built from the list for O(1) layout access.
|
No actionable comments were generated in the recent review. 🎉 📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (2)📓 Common learnings📚 Learning: 2026-02-19T12:17:05.421ZApplied to files:
🧬 Code graph analysis (1)lib/Epub/Epub/ParsedText.cpp (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (6)
📝 WalkthroughWalkthroughRefactors text layout internals: replaces std::list with std::vector for word storage, moves iterator-based logic to index-based access, and removes the separate Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
| // Remove consumed words so size() reflects only remaining words | ||
| if (lineCount > 0) { | ||
| const size_t consumed = lineBreakIndices[lineCount - 1]; | ||
| words.erase(words.begin(), words.begin() + consumed); | ||
| wordStyles.erase(wordStyles.begin(), wordStyles.begin() + consumed); | ||
| wordContinues.erase(wordContinues.begin(), wordContinues.begin() + consumed); |
There was a problem hiding this comment.
Here's the fix added upon changes from #802. I believe this keeps the vector sizes correct now.
This does cause the vectors to move the remaining elements to the front, but at most they contain the partial last line (when includeLastLine=false) from in-progress parsing, so only a handful of elements may need to be moved.
The existing contiguous vector storage can then be reused by ParsedText::addWord.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/Epub/Epub/ParsedText.cpp (1)
313-373:⚠️ Potential issue | 🟡 MinorPreserve the attach-to-previous flag when splitting words.
wordContinuesrepresents "word attaches to previous (no space before it)," used for inline element boundaries. When a word withattachToPrevious=trueis hyphenated, the current code incorrectly transfers this flag to the remainder and clears the prefix's flag. This breaks rendering: the prefix (which is part of the original word that should attach to previous) loses its attachment, and the remainder incorrectly tries to attach despite the intervening hyphen.The fix should preserve the prefix's attachment flag and insert
falsefor the remainder:Suggested fix
- const bool originalContinuedToNext = wordContinues[wordIndex]; - // The original word (now prefix) does NOT continue to remainder (hyphen separates them) - wordContinues[wordIndex] = false; - wordContinues.insert(wordContinues.begin() + wordIndex + 1, originalContinuedToNext); + // Preserve the prefix's attach-to-previous flag; allow a break between prefix and remainder. + wordContinues.insert(wordContinues.begin() + wordIndex + 1, false);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/Epub/Epub/ParsedText.cpp` around lines 313 - 373, ParsedText::hyphenateWordAtIndex is clearing the original word's attach-to-previous flag (wordContinues[wordIndex]) and incorrectly giving that flag to the remainder; instead capture the original attach flag, keep it on the prefix, and insert false for the remainder. Specifically, in hyphenateWordAtIndex, save a bool originalAttach = wordContinues[wordIndex]; after resizing/pushing the hyphen leave wordContinues[wordIndex] = originalAttach; then insert wordContinues.insert(wordContinues.begin() + wordIndex + 1, false); (remove the current logic that sets the prefix to false and reuses originalContinuedToNext).
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
lib/Epub/Epub/ParsedText.cpplib/Epub/Epub/ParsedText.hlib/Epub/Epub/blocks/TextBlock.cpplib/Epub/Epub/blocks/TextBlock.h
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-02-19T12:17:00.962Z
Learnt from: Uri-Tauber
Repo: crosspoint-reader/crosspoint-reader PR: 988
File: lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp:649-661
Timestamp: 2026-02-19T12:17:00.962Z
Learning: In ChapterHtmlSlimParser.cpp, when computing footnote word indices in endElement() for footnote links, the wordIndex must be cumulative across the 750-word mid-paragraph flush boundary. The correct calculation is: `wordIndex = wordsExtractedInBlock + currentTextBlock->size()`, not just `currentTextBlock->size()`. This ensures footnotes attach to the page containing their actual anchor word, even after layoutAndExtractLines(false) has extracted and removed earlier words from the block.
Applied to files:
lib/Epub/Epub/ParsedText.hlib/Epub/Epub/blocks/TextBlock.cpplib/Epub/Epub/ParsedText.cpp
🧬 Code graph analysis (1)
lib/Epub/Epub/blocks/TextBlock.h (2)
lib/Epub/Epub/blocks/Block.h (1)
Block(8-15)lib/EpdFont/EpdFontFamily.h (1)
EpdFontFamily(4-23)
🔇 Additional comments (9)
lib/Epub/Epub/ParsedText.h (1)
16-33: Vector-backed containers and helper signatures align with cpp changes.
The private storage and helper declarations are consistent with the index-based access patterns in the implementation.lib/Epub/Epub/blocks/TextBlock.cpp (2)
15-22: Index-based render loop is clean and bounds-checked.
The size mismatch guard plus direct indexing is safe and clear.
74-96: Vector-based deserialization with a word-count sanity cap looks solid.
The resize-then-fill flow is straightforward and avoids list overhead.lib/Epub/Epub/blocks/TextBlock.h (1)
15-26: Constructor and fields updated consistently for vectors.
The member layout and initializer list now match the vector-based data flow.lib/Epub/Epub/ParsedText.cpp (5)
70-103: State cleanup after line extraction keeps vectors aligned.
UsingwordContinuesdirectly and erasing consumed ranges from all three vectors keeps internal state consistent.
106-112: Pre-reserve and index-based width computation is efficient.
The reserve + indexed push reduces reallocations while matching the new storage model.
131-138: Fallback hyphenation call updated appropriately.
The new signature usage reads correctly within the oversized-word pre-pass.
278-285: Hyphenation hook respects available-width gate.
The guard plus updated call signature look correct in the hyphenated line builder.
430-449: Move-based line extraction and pre-sized X positions look good.
The move iterators plus reservedlineXPosvector are consistent with the vector refactor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@lib/Epub/Epub/ParsedText.cpp`:
- Around line 313-373: ParsedText::hyphenateWordAtIndex is clearing the original
word's attach-to-previous flag (wordContinues[wordIndex]) and incorrectly giving
that flag to the remainder; instead capture the original attach flag, keep it on
the prefix, and insert false for the remainder. Specifically, in
hyphenateWordAtIndex, save a bool originalAttach = wordContinues[wordIndex];
after resizing/pushing the hyphen leave wordContinues[wordIndex] =
originalAttach; then insert wordContinues.insert(wordContinues.begin() +
wordIndex + 1, false); (remove the current logic that sets the prefix to false
and reuses originalContinuedToNext).
|
I added some logging for the various layout phases and had Claude put together a summary of the differences: List vs Vector Text Layout Performance ComparisonTest book: Intermezzo by Sally Rooney, chapter 14 (contains a 2,420-word paragraph). Headline numbers
Early flush behaviorBoth logs show exactly 3 early flushes for the 2,420-word paragraph, with size dropping to single digits after each: This confirms the erase change works correctly. Without it, you'd see The three early flush passes compared
The first flush is an outlier where list's extract was faster (possibly cold cache). On the 2nd and 3rd flushes, vector's contiguous memory wins decisively. Medium paragraphs
Vector is 10-14% faster across the board for medium paragraphs. Smallest paragraphs
These samples span the full 1-25 word range. At tiny sizes, fixed overhead dominates and results can be noisy, but vector is generally faster across this range. MemoryThe heap values during the early flushes tell the real story. Look at the heap before each flush (the low-water mark, when all the words are buffered but not yet extracted):
The list version dipped to 63KB free during the second flush. On an ESP32 with ~230KB total heap, that's only 27% remaining -- dangerously close to allocation failures if other subsystems (fonts, zip streams, CSS) need memory simultaneously. The vector version never went below 126KB free -- roughly double the headroom. The ~63KB difference directly matches the per-node overhead: 880 words x 3 lists x ~24 bytes (prev/next pointers + malloc bookkeeping) = ~63KB. Phase breakdownThe biggest wins come from
What this shows:
This aligns with expected access patterns: extracting 100+ lines repeatedly touches many word entries; contiguous storage keeps iteration cache-friendly while list traversal pays pointer-chasing costs on every pass. Erase costThe erase step added by the fix costs 90-133 microseconds per flush -- 0.1% of total flush time. Completely negligible. |
There was a problem hiding this comment.
Pull request overview
This PR updates EPUB text layout internals to use contiguous storage (std::vector) instead of node-based storage (std::list) for words, styles, and positioning, aiming to improve cache locality and iteration performance during layout/rendering.
Changes:
- Replace
std::listwithstd::vectorinTextBlockandParsedTextword/style storage. - Convert iterator-based traversal to index-based access throughout rendering/layout.
- Remove the extra “indexed continues” copy by using
std::vector<bool>directly.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| lib/Epub/Epub/blocks/TextBlock.h | Switch TextBlock storage/constructor from std::list to std::vector. |
| lib/Epub/Epub/blocks/TextBlock.cpp | Update rendering and deserialization logic to use vector indexing and vector-based allocations. |
| lib/Epub/Epub/ParsedText.h | Switch internal storage to vectors; adjust hyphenation helper signature accordingly. |
| lib/Epub/Epub/ParsedText.cpp | Remove list/iterator patterns, use vector indexing/move ranges, and add end-of-layout consumption via vector erase. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
osteotek
left a comment
There was a problem hiding this comment.
looks good to me, works in my testing
Co-authored-by: Cursor <[email protected]> # Conflicts: # lib/Epub/Epub/ParsedText.cpp
Summary
Revision to @BlindBat's #802. Description comes from the original PR.
std::listwithstd::vectorfor word storage inTextBlockandParsedTextwords[i]) instead of iterator advancement (std::advance(it, n))continuesVeccopy that was built fromwordContinuesfor O(1) access — now unnecessary sincestd::vector<bool>already provides O(1) indexingWhy
std::listallocates each node individually on the heap with 16 bytes of prev/next pointer overhead per node. For text layout with many small words, this means:std::vectorstores elements contiguously, giving better cache performance during the tight rendering and layout loops. TheextractLinefunction also benefits: list splice was O(1) but required maintaining three parallel iterators, while vector range construction with move iterators is simpler and still efficient for the small line-sized chunks involved.Files changed
lib/Epub/Epub/blocks/TextBlock.h/.cpplib/Epub/Epub/ParsedText.h/.cppAI Usage
YES
Test plan