Skip to content

perf: Replace std::list with std::vector in text layout#1038

Merged
daveallie merged 4 commits intocrosspoint-reader:masterfrom
znelson:list-to-vector
Feb 22, 2026
Merged

perf: Replace std::list with std::vector in text layout#1038
daveallie merged 4 commits intocrosspoint-reader:masterfrom
znelson:list-to-vector

Conversation

@znelson
Copy link
Contributor

@znelson znelson commented Feb 20, 2026

Summary

Revision to @BlindBat's #802. Description comes from the original PR.

  • Replace std::list with std::vector for word storage in TextBlock and ParsedText
  • Use index-based access (words[i]) instead of iterator advancement (std::advance(it, n))
  • Remove the separate continuesVec copy that was built from wordContinues for O(1) access — now unnecessary since std::vector<bool> already provides O(1) indexing

Why

std::list allocates each node individually on the heap with 16 bytes of prev/next pointer overhead per node. For text layout with many small words, this means:

  • Scattered heap allocations instead of contiguous memory
  • Poor cache locality during iteration (each node can be anywhere in memory)
  • Per-node malloc/free overhead during construction and destruction

std::vector stores elements contiguously, giving better cache performance during the tight rendering and layout loops. The extractLine function also benefits: list splice was O(1) but required maintaining three parallel iterators, while vector range construction with move iterators is simpler and still efficient for the small line-sized chunks involved.

Files changed

  • lib/Epub/Epub/blocks/TextBlock.h / .cpp
  • lib/Epub/Epub/ParsedText.h / .cpp

AI Usage

YES

Test plan

  • Open an EPUB with mixed formatting (bold, italic, underline) — verify text renders correctly
  • Open a book with justified text — verify word spacing is correct
  • Open a book with hyphenation enabled — verify words break correctly at hyphens
  • Navigate through pages rapidly — verify no rendering glitches or crashes
  • Open a book with long paragraphs — verify text layout matches pre-change behavior

BlindBat and others added 2 commits February 20, 2026 09:12
Replace std::list with std::vector for the words, wordStyles,
wordXpos, and wordContinues containers in TextBlock and ParsedText.

Vectors provide contiguous memory layout for better cache locality
and O(1) random access, eliminating per-node heap allocation and
the 16-byte prev/next pointer overhead of doubly-linked list nodes.
The indexed access also removes the need for a separate continuesVec
copy that was previously built from the list for O(1) layout access.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

No actionable comments were generated in the recent review. 🎉

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee9aa18 and 011acaf.

📒 Files selected for processing (1)
  • lib/Epub/Epub/ParsedText.cpp
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Uri-Tauber
Repo: crosspoint-reader/crosspoint-reader PR: 988
File: lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp:649-661
Timestamp: 2026-02-19T12:17:05.421Z
Learning: In ChapterHtmlSlimParser.cpp, when computing footnote word indices in endElement() for footnote links, the wordIndex must be cumulative across the 750-word mid-paragraph flush boundary. The correct calculation is: `wordIndex = wordsExtractedInBlock + currentTextBlock->size()`, not just `currentTextBlock->size()`. This ensures footnotes attach to the page containing their actual anchor word, even after layoutAndExtractLines(false) has extracted and removed earlier words from the block.
📚 Learning: 2026-02-19T12:17:05.421Z
Learnt from: Uri-Tauber
Repo: crosspoint-reader/crosspoint-reader PR: 988
File: lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp:649-661
Timestamp: 2026-02-19T12:17:05.421Z
Learning: In ChapterHtmlSlimParser.cpp, when computing footnote word indices in endElement() for footnote links, the wordIndex must be cumulative across the 750-word mid-paragraph flush boundary. The correct calculation is: `wordIndex = wordsExtractedInBlock + currentTextBlock->size()`, not just `currentTextBlock->size()`. This ensures footnotes attach to the page containing their actual anchor word, even after layoutAndExtractLines(false) has extracted and removed earlier words from the block.

Applied to files:

  • lib/Epub/Epub/ParsedText.cpp
🧬 Code graph analysis (1)
lib/Epub/Epub/ParsedText.cpp (1)
src/main.cpp (1)
  • renderer (38-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: cppcheck
🔇 Additional comments (6)
lib/Epub/Epub/ParsedText.cpp (6)

84-103: Clean handoff to vector-based line extraction. Passing wordContinues directly and erasing consumed ranges after extraction keeps the remaining buffer consistent.


106-112: Width precompute loop is consistent with vector storage. The reserve + index-based access is a good fit here.


131-139: Oversized-word fallback still covered. The in-place hyphenation call aligns with the new vector-backed data flow.


265-288: Hyphenated greedy layout flow looks solid. The split-and-advance logic still guarantees forward progress.


313-394: In-place hyphenation updates are consistent. Updating words, wordStyles, wordContinues, and wordWidths together preserves alignment.


446-465: Line extraction via move iterators looks good. Preallocating lineXPos and range-moving words fits the new vector model.


📝 Walkthrough

Walkthrough

Refactors text layout internals: replaces std::list with std::vector for word storage, moves iterator-based logic to index-based access, and removes the separate continuesVec parameter—using the wordContinues member directly across hyphenation and line-breaking code paths.

Changes

Cohort / File(s) Summary
ParsedText — storage & layout logic
lib/Epub/Epub/ParsedText.h, lib/Epub/Epub/ParsedText.cpp
Switched internal containers from std::list to std::vector; removed continuesVec parameters from computeLineBreaks, computeHyphenatedLineBreaks, hyphenateWordAtIndex, and extractLine; moved from iterator-based to index-based access and now mutate primary word vectors in-place when extracting lines or inserting hyphenated remainders.
TextBlock — API and rendering
lib/Epub/Epub/blocks/TextBlock.h, lib/Epub/Epub/blocks/TextBlock.cpp
Public fields and constructor changed to accept std::vector instead of std::list; render and deserialization updated to index-based access; underline and width calculations now use vector indexing rather than iterator dereference.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • daveallie
  • ngxson
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main change: replacing std::list with std::vector in text layout performance optimization.
Description check ✅ Passed The description is directly related to the changeset, providing clear context about the motivation, implementation details, files changed, and test plan for the std::list to std::vector migration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines +97 to +102
// Remove consumed words so size() reflects only remaining words
if (lineCount > 0) {
const size_t consumed = lineBreakIndices[lineCount - 1];
words.erase(words.begin(), words.begin() + consumed);
wordStyles.erase(wordStyles.begin(), wordStyles.begin() + consumed);
wordContinues.erase(wordContinues.begin(), wordContinues.begin() + consumed);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the fix added upon changes from #802. I believe this keeps the vector sizes correct now.

This does cause the vectors to move the remaining elements to the front, but at most they contain the partial last line (when includeLastLine=false) from in-progress parsing, so only a handful of elements may need to be moved.

The existing contiguous vector storage can then be reused by ParsedText::addWord.

@znelson znelson marked this pull request as ready for review February 20, 2026 15:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/Epub/Epub/ParsedText.cpp (1)

313-373: ⚠️ Potential issue | 🟡 Minor

Preserve the attach-to-previous flag when splitting words.

wordContinues represents "word attaches to previous (no space before it)," used for inline element boundaries. When a word with attachToPrevious=true is hyphenated, the current code incorrectly transfers this flag to the remainder and clears the prefix's flag. This breaks rendering: the prefix (which is part of the original word that should attach to previous) loses its attachment, and the remainder incorrectly tries to attach despite the intervening hyphen.

The fix should preserve the prefix's attachment flag and insert false for the remainder:

Suggested fix
-  const bool originalContinuedToNext = wordContinues[wordIndex];
-  // The original word (now prefix) does NOT continue to remainder (hyphen separates them)
-  wordContinues[wordIndex] = false;
-  wordContinues.insert(wordContinues.begin() + wordIndex + 1, originalContinuedToNext);
+  // Preserve the prefix's attach-to-previous flag; allow a break between prefix and remainder.
+  wordContinues.insert(wordContinues.begin() + wordIndex + 1, false);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/Epub/Epub/ParsedText.cpp` around lines 313 - 373,
ParsedText::hyphenateWordAtIndex is clearing the original word's
attach-to-previous flag (wordContinues[wordIndex]) and incorrectly giving that
flag to the remainder; instead capture the original attach flag, keep it on the
prefix, and insert false for the remainder. Specifically, in
hyphenateWordAtIndex, save a bool originalAttach = wordContinues[wordIndex];
after resizing/pushing the hyphen leave wordContinues[wordIndex] =
originalAttach; then insert wordContinues.insert(wordContinues.begin() +
wordIndex + 1, false); (remove the current logic that sets the prefix to false
and reuses originalContinuedToNext).
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ba9658 and ee9aa18.

📒 Files selected for processing (4)
  • lib/Epub/Epub/ParsedText.cpp
  • lib/Epub/Epub/ParsedText.h
  • lib/Epub/Epub/blocks/TextBlock.cpp
  • lib/Epub/Epub/blocks/TextBlock.h
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-02-19T12:17:00.962Z
Learnt from: Uri-Tauber
Repo: crosspoint-reader/crosspoint-reader PR: 988
File: lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp:649-661
Timestamp: 2026-02-19T12:17:00.962Z
Learning: In ChapterHtmlSlimParser.cpp, when computing footnote word indices in endElement() for footnote links, the wordIndex must be cumulative across the 750-word mid-paragraph flush boundary. The correct calculation is: `wordIndex = wordsExtractedInBlock + currentTextBlock->size()`, not just `currentTextBlock->size()`. This ensures footnotes attach to the page containing their actual anchor word, even after layoutAndExtractLines(false) has extracted and removed earlier words from the block.

Applied to files:

  • lib/Epub/Epub/ParsedText.h
  • lib/Epub/Epub/blocks/TextBlock.cpp
  • lib/Epub/Epub/ParsedText.cpp
🧬 Code graph analysis (1)
lib/Epub/Epub/blocks/TextBlock.h (2)
lib/Epub/Epub/blocks/Block.h (1)
  • Block (8-15)
lib/EpdFont/EpdFontFamily.h (1)
  • EpdFontFamily (4-23)
🔇 Additional comments (9)
lib/Epub/Epub/ParsedText.h (1)

16-33: Vector-backed containers and helper signatures align with cpp changes.
The private storage and helper declarations are consistent with the index-based access patterns in the implementation.

lib/Epub/Epub/blocks/TextBlock.cpp (2)

15-22: Index-based render loop is clean and bounds-checked.
The size mismatch guard plus direct indexing is safe and clear.


74-96: Vector-based deserialization with a word-count sanity cap looks solid.
The resize-then-fill flow is straightforward and avoids list overhead.

lib/Epub/Epub/blocks/TextBlock.h (1)

15-26: Constructor and fields updated consistently for vectors.
The member layout and initializer list now match the vector-based data flow.

lib/Epub/Epub/ParsedText.cpp (5)

70-103: State cleanup after line extraction keeps vectors aligned.
Using wordContinues directly and erasing consumed ranges from all three vectors keeps internal state consistent.


106-112: Pre-reserve and index-based width computation is efficient.
The reserve + indexed push reduces reallocations while matching the new storage model.


131-138: Fallback hyphenation call updated appropriately.
The new signature usage reads correctly within the oversized-word pre-pass.


278-285: Hyphenation hook respects available-width gate.
The guard plus updated call signature look correct in the hyphenated line builder.


430-449: Move-based line extraction and pre-sized X positions look good.
The move iterators plus reserved lineXPos vector are consistent with the vector refactor.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@lib/Epub/Epub/ParsedText.cpp`:
- Around line 313-373: ParsedText::hyphenateWordAtIndex is clearing the original
word's attach-to-previous flag (wordContinues[wordIndex]) and incorrectly giving
that flag to the remainder; instead capture the original attach flag, keep it on
the prefix, and insert false for the remainder. Specifically, in
hyphenateWordAtIndex, save a bool originalAttach = wordContinues[wordIndex];
after resizing/pushing the hyphen leave wordContinues[wordIndex] =
originalAttach; then insert wordContinues.insert(wordContinues.begin() +
wordIndex + 1, false); (remove the current logic that sets the prefix to false
and reuses originalContinuedToNext).

@znelson
Copy link
Contributor Author

znelson commented Feb 20, 2026

I added some logging for the various layout phases and had Claude put together a summary of the differences:

List vs Vector Text Layout Performance Comparison

Test book: Intermezzo by Sally Rooney, chapter 14 (contains a 2,420-word paragraph).
Device: ESP32-C3, ~230KB total heap.

Headline numbers

Metric List (before) Vector (after) Delta
Chapter parse time 1,657 ms 1,475 ms -182 ms (11% faster)
Min free heap 56,772 bytes 107,388 bytes +50,616 bytes (89% more headroom)

Early flush behavior

Both logs show exactly 3 early flushes for the 2,420-word paragraph, with size dropping to single digits after each:

early flush triggered: size=880  → size after=2
early flush triggered: size=793  → size after=4
early flush triggered: size=752  → size after=1

This confirms the erase change works correctly. Without it, you'd see size=751, 752, 753... firing on every word -- ~1,670 redundant flushes instead of 3.

The three early flush passes compared

Flush Words List total Vector total Delta
1st 880 121,519 us 134,275 us List 10% faster
2nd 793 117,151 us 101,200 us Vector 14% faster
3rd 752 113,726 us 96,350 us Vector 15% faster
Sum 352,396 us 331,825 us Vector 6% faster

The first flush is an outlier where list's extract was faster (possibly cold cache). On the 2nd and 3rd flushes, vector's contiguous memory wins decisively.

Medium paragraphs

Words List total Vector total Savings
550 87,236 us 76,261 us 12.6%
296 43,553 us 38,698 us 11.2%
271 27,017 us 24,129 us 10.7%
244 29,973 us 26,379 us 12.0%
228 38,437 us 33,341 us 13.3%
196 25,074 us 21,664 us 13.6%

Vector is 10-14% faster across the board for medium paragraphs.

Smallest paragraphs

Words List total Vector total Savings
1 466 us 473 us -1.5%
5 520 us 498 us 4.2%
10 1,329 us 1,248 us 6.1%
15 2,006 us 1,318 us 34.3%
20 1,638 us 1,509 us 7.9%
25 1,823 us 1,692 us 7.2%

These samples span the full 1-25 word range. At tiny sizes, fixed overhead dominates and results can be noisy, but vector is generally faster across this range.

Memory

The heap values during the early flushes tell the real story. Look at the heap before each flush (the low-water mark, when all the words are buffered but not yet extracted):

List Vector
Before 1st flush 75,688 free 142,556 free
Before 2nd flush 63,196 free 130,672 free
Before 3rd flush 68,420 free 131,256 free

The list version dipped to 63KB free during the second flush. On an ESP32 with ~230KB total heap, that's only 27% remaining -- dangerously close to allocation failures if other subsystems (fonts, zip streams, CSS) need memory simultaneously.

The vector version never went below 126KB free -- roughly double the headroom. The ~63KB difference directly matches the per-node overhead: 880 words x 3 lists x ~24 bytes (prev/next pointers + malloc bookkeeping) = ~63KB.

Phase breakdown

The biggest wins come from extract, but the phase-level breakdown shows where both small and large gains appear:

Case Impl measure breaks extract erase total
550 words List 4,846 us 15,243 us 67,112 us 87,236 us
550 words Vector 4,788 us 14,220 us 57,125 us 90 us 76,261 us
2nd flush (793) List 5,999 us 24,358 us 86,731 us 117,151 us
2nd flush (793) Vector 5,947 us 21,854 us 73,209 us 133 us 101,200 us
3rd flush (752) List 6,020 us 24,229 us 83,403 us 113,726 us
3rd flush (752) Vector 5,984 us 24,008 us 66,178 us 125 us 96,350 us

What this shows:

  • measure is almost identical between list and vector (typically <2% delta).
  • breaks improves modestly with vector (about 1-10% depending on paragraph).
  • extract carries almost all of the total win on medium/large paragraphs (about 15-21% faster in the examples above).
  • erase is present only in vector and adds just ~90-133 us, far too small to offset extract gains.

This aligns with expected access patterns: extracting 100+ lines repeatedly touches many word entries; contiguous storage keeps iteration cache-friendly while list traversal pays pointer-chasing costs on every pass.

Erase cost

The erase step added by the fix costs 90-133 microseconds per flush -- 0.1% of total flush time. Completely negligible.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates EPUB text layout internals to use contiguous storage (std::vector) instead of node-based storage (std::list) for words, styles, and positioning, aiming to improve cache locality and iteration performance during layout/rendering.

Changes:

  • Replace std::list with std::vector in TextBlock and ParsedText word/style storage.
  • Convert iterator-based traversal to index-based access throughout rendering/layout.
  • Remove the extra “indexed continues” copy by using std::vector<bool> directly.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
lib/Epub/Epub/blocks/TextBlock.h Switch TextBlock storage/constructor from std::list to std::vector.
lib/Epub/Epub/blocks/TextBlock.cpp Update rendering and deserialization logic to use vector indexing and vector-based allocations.
lib/Epub/Epub/ParsedText.h Switch internal storage to vectors; adjust hyphenation helper signature accordingly.
lib/Epub/Epub/ParsedText.cpp Remove list/iterator patterns, use vector indexing/move ranges, and add end-of-layout consumption via vector erase.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

osteotek
osteotek previously approved these changes Feb 20, 2026
Copy link
Member

@osteotek osteotek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, works in my testing

Co-authored-by: Cursor <[email protected]>

# Conflicts:
#	lib/Epub/Epub/ParsedText.cpp
@daveallie daveallie merged commit 3696794 into crosspoint-reader:master Feb 22, 2026
6 checks passed
@znelson znelson deleted the list-to-vector branch February 26, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants