perf: Replace std::list with std::vector in text layout by znelson · Pull Request #1038 · crosspoint-reader/crosspoint-reader

znelson · 2026-02-20T15:45:46Z

Summary

Revision to @BlindBat's #802. Description comes from the original PR.

Replace std::list with std::vector for word storage in TextBlock and ParsedText
Use index-based access (words[i]) instead of iterator advancement (std::advance(it, n))
Remove the separate continuesVec copy that was built from wordContinues for O(1) access — now unnecessary since std::vector<bool> already provides O(1) indexing

Why

std::list allocates each node individually on the heap with 16 bytes of prev/next pointer overhead per node. For text layout with many small words, this means:

Scattered heap allocations instead of contiguous memory
Poor cache locality during iteration (each node can be anywhere in memory)
Per-node malloc/free overhead during construction and destruction

std::vector stores elements contiguously, giving better cache performance during the tight rendering and layout loops. The extractLine function also benefits: list splice was O(1) but required maintaining three parallel iterators, while vector range construction with move iterators is simpler and still efficient for the small line-sized chunks involved.

Files changed

lib/Epub/Epub/blocks/TextBlock.h / .cpp
lib/Epub/Epub/ParsedText.h / .cpp

AI Usage

YES

Test plan

Open an EPUB with mixed formatting (bold, italic, underline) — verify text renders correctly
Open a book with justified text — verify word spacing is correct
Open a book with hyphenation enabled — verify words break correctly at hyphens
Navigate through pages rapidly — verify no rendering glitches or crashes
Open a book with long paragraphs — verify text layout matches pre-change behavior

Replace std::list with std::vector for the words, wordStyles, wordXpos, and wordContinues containers in TextBlock and ParsedText. Vectors provide contiguous memory layout for better cache locality and O(1) random access, eliminating per-node heap allocation and the 16-byte prev/next pointer overhead of doubly-linked list nodes. The indexed access also removes the need for a separate continuesVec copy that was previously built from the list for O(1) layout access.

coderabbitai · 2026-02-20T15:45:53Z

No actionable comments were generated in the recent review. 🎉

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee9aa18 and 011acaf.

📒 Files selected for processing (1)

lib/Epub/Epub/ParsedText.cpp

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Uri-Tauber
Repo: crosspoint-reader/crosspoint-reader PR: 988
File: lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp:649-661
Timestamp: 2026-02-19T12:17:05.421Z
Learning: In ChapterHtmlSlimParser.cpp, when computing footnote word indices in endElement() for footnote links, the wordIndex must be cumulative across the 750-word mid-paragraph flush boundary. The correct calculation is: `wordIndex = wordsExtractedInBlock + currentTextBlock->size()`, not just `currentTextBlock->size()`. This ensures footnotes attach to the page containing their actual anchor word, even after layoutAndExtractLines(false) has extracted and removed earlier words from the block.

📚 Learning: 2026-02-19T12:17:05.421Z

Learnt from: Uri-Tauber
Repo: crosspoint-reader/crosspoint-reader PR: 988
File: lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp:649-661
Timestamp: 2026-02-19T12:17:05.421Z
Learning: In ChapterHtmlSlimParser.cpp, when computing footnote word indices in endElement() for footnote links, the wordIndex must be cumulative across the 750-word mid-paragraph flush boundary. The correct calculation is: `wordIndex = wordsExtractedInBlock + currentTextBlock->size()`, not just `currentTextBlock->size()`. This ensures footnotes attach to the page containing their actual anchor word, even after layoutAndExtractLines(false) has extracted and removed earlier words from the block.

Applied to files:

lib/Epub/Epub/ParsedText.cpp

🧬 Code graph analysis (1)

lib/Epub/Epub/ParsedText.cpp (1)

src/main.cpp (1)

renderer (38-38)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build
GitHub Check: cppcheck

🔇 Additional comments (6)

lib/Epub/Epub/ParsedText.cpp (6)

84-103: Clean handoff to vector-based line extraction. Passing wordContinues directly and erasing consumed ranges after extraction keeps the remaining buffer consistent.

106-112: Width precompute loop is consistent with vector storage. The reserve + index-based access is a good fit here.

131-139: Oversized-word fallback still covered. The in-place hyphenation call aligns with the new vector-backed data flow.

265-288: Hyphenated greedy layout flow looks solid. The split-and-advance logic still guarantees forward progress.

313-394: In-place hyphenation updates are consistent. Updating words, wordStyles, wordContinues, and wordWidths together preserves alignment.

446-465: Line extraction via move iterators looks good. Preallocating lineXPos and range-moving words fits the new vector model.

📝 Walkthrough

Walkthrough

Refactors text layout internals: replaces std::list with std::vector for word storage, moves iterator-based logic to index-based access, and removes the separate continuesVec parameter—using the wordContinues member directly across hyphenation and line-breaking code paths.

Changes

Cohort / File(s)	Summary
ParsedText — storage & layout logic `lib/Epub/Epub/ParsedText.h`, `lib/Epub/Epub/ParsedText.cpp`	Switched internal containers from `std::list` to `std::vector`; removed `continuesVec` parameters from `computeLineBreaks`, `computeHyphenatedLineBreaks`, `hyphenateWordAtIndex`, and `extractLine`; moved from iterator-based to index-based access and now mutate primary word vectors in-place when extracting lines or inserting hyphenated remainders.
TextBlock — API and rendering `lib/Epub/Epub/blocks/TextBlock.h`, `lib/Epub/Epub/blocks/TextBlock.cpp`	Public fields and constructor changed to accept `std::vector` instead of `std::list`; `render` and deserialization updated to index-based access; underline and width calculations now use vector indexing rather than iterator dereference.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

perf: Skip constructing unnecessary std::string #932: Modifies TextBlock::render and related word rendering logic, overlapping with the render/index-access changes in this PR.

Suggested reviewers

daveallie
ngxson

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and accurately summarizes the main change: replacing std::list with std::vector in text layout performance optimization.
Description check	✅ Passed	The description is directly related to the changeset, providing clear context about the motivation, implementation details, files changed, and test plan for the std::list to std::vector migration.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

znelson · 2026-02-20T15:47:26Z

lib/Epub/Epub/ParsedText.cpp

+  // Remove consumed words so size() reflects only remaining words
+  if (lineCount > 0) {
+    const size_t consumed = lineBreakIndices[lineCount - 1];
+    words.erase(words.begin(), words.begin() + consumed);
+    wordStyles.erase(wordStyles.begin(), wordStyles.begin() + consumed);
+    wordContinues.erase(wordContinues.begin(), wordContinues.begin() + consumed);


Here's the fix added upon changes from #802. I believe this keeps the vector sizes correct now.

This does cause the vectors to move the remaining elements to the front, but at most they contain the partial last line (when includeLastLine=false) from in-progress parsing, so only a handful of elements may need to be moved.

The existing contiguous vector storage can then be reused by ParsedText::addWord.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/Epub/Epub/ParsedText.cpp (1)
313-373: ⚠️ Potential issue | 🟡 Minor

Preserve the attach-to-previous flag when splitting words.

wordContinues represents "word attaches to previous (no space before it)," used for inline element boundaries. When a word with attachToPrevious=true is hyphenated, the current code incorrectly transfers this flag to the remainder and clears the prefix's flag. This breaks rendering: the prefix (which is part of the original word that should attach to previous) loses its attachment, and the remainder incorrectly tries to attach despite the intervening hyphen.

The fix should preserve the prefix's attachment flag and insert false for the remainder:
Suggested fix
-  const bool originalContinuedToNext = wordContinues[wordIndex];
-  // The original word (now prefix) does NOT continue to remainder (hyphen separates them)
-  wordContinues[wordIndex] = false;
-  wordContinues.insert(wordContinues.begin() + wordIndex + 1, originalContinuedToNext);
+  // Preserve the prefix's attach-to-previous flag; allow a break between prefix and remainder.
+  wordContinues.insert(wordContinues.begin() + wordIndex + 1, false);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/Epub/Epub/ParsedText.cpp` around lines 313 - 373,
ParsedText::hyphenateWordAtIndex is clearing the original word's
attach-to-previous flag (wordContinues[wordIndex]) and incorrectly giving that
flag to the remainder; instead capture the original attach flag, keep it on the
prefix, and insert false for the remainder. Specifically, in
hyphenateWordAtIndex, save a bool originalAttach = wordContinues[wordIndex];
after resizing/pushing the hyphen leave wordContinues[wordIndex] =
originalAttach; then insert wordContinues.insert(wordContinues.begin() +
wordIndex + 1, false); (remove the current logic that sets the prefix to false
and reuses originalContinuedToNext).

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ba9658 and ee9aa18.

📒 Files selected for processing (4)

lib/Epub/Epub/ParsedText.cpp
lib/Epub/Epub/ParsedText.h
lib/Epub/Epub/blocks/TextBlock.cpp
lib/Epub/Epub/blocks/TextBlock.h

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2026-02-19T12:17:00.962Z

Learnt from: Uri-Tauber
Repo: crosspoint-reader/crosspoint-reader PR: 988
File: lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp:649-661
Timestamp: 2026-02-19T12:17:00.962Z
Learning: In ChapterHtmlSlimParser.cpp, when computing footnote word indices in endElement() for footnote links, the wordIndex must be cumulative across the 750-word mid-paragraph flush boundary. The correct calculation is: `wordIndex = wordsExtractedInBlock + currentTextBlock->size()`, not just `currentTextBlock->size()`. This ensures footnotes attach to the page containing their actual anchor word, even after layoutAndExtractLines(false) has extracted and removed earlier words from the block.

Applied to files:

lib/Epub/Epub/ParsedText.h
lib/Epub/Epub/blocks/TextBlock.cpp
lib/Epub/Epub/ParsedText.cpp

🧬 Code graph analysis (1)

lib/Epub/Epub/blocks/TextBlock.h (2)

lib/Epub/Epub/blocks/Block.h (1)

Block (8-15)

lib/EpdFont/EpdFontFamily.h (1)

EpdFontFamily (4-23)

🔇 Additional comments (9)

lib/Epub/Epub/ParsedText.h (1)

16-33: Vector-backed containers and helper signatures align with cpp changes.
The private storage and helper declarations are consistent with the index-based access patterns in the implementation.

lib/Epub/Epub/blocks/TextBlock.cpp (2)

15-22: Index-based render loop is clean and bounds-checked.
The size mismatch guard plus direct indexing is safe and clear.

74-96: Vector-based deserialization with a word-count sanity cap looks solid.
The resize-then-fill flow is straightforward and avoids list overhead.

lib/Epub/Epub/blocks/TextBlock.h (1)

15-26: Constructor and fields updated consistently for vectors.
The member layout and initializer list now match the vector-based data flow.

lib/Epub/Epub/ParsedText.cpp (5)

70-103: State cleanup after line extraction keeps vectors aligned.
Using wordContinues directly and erasing consumed ranges from all three vectors keeps internal state consistent.

106-112: Pre-reserve and index-based width computation is efficient.
The reserve + indexed push reduces reallocations while matching the new storage model.

131-138: Fallback hyphenation call updated appropriately.
The new signature usage reads correctly within the oversized-word pre-pass.

278-285: Hyphenation hook respects available-width gate.
The guard plus updated call signature look correct in the hyphenated line builder.

430-449: Move-based line extraction and pre-sized X positions look good.
The move iterators plus reserved lineXPos vector are consistent with the vector refactor.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@lib/Epub/Epub/ParsedText.cpp`:
- Around line 313-373: ParsedText::hyphenateWordAtIndex is clearing the original
word's attach-to-previous flag (wordContinues[wordIndex]) and incorrectly giving
that flag to the remainder; instead capture the original attach flag, keep it on
the prefix, and insert false for the remainder. Specifically, in
hyphenateWordAtIndex, save a bool originalAttach = wordContinues[wordIndex];
after resizing/pushing the hyphen leave wordContinues[wordIndex] =
originalAttach; then insert wordContinues.insert(wordContinues.begin() +
wordIndex + 1, false); (remove the current logic that sets the prefix to false
and reuses originalContinuedToNext).

znelson · 2026-02-20T16:54:12Z

I added some logging for the various layout phases and had Claude put together a summary of the differences:

List vs Vector Text Layout Performance Comparison

Test book: Intermezzo by Sally Rooney, chapter 14 (contains a 2,420-word paragraph).
Device: ESP32-C3, ~230KB total heap.

Headline numbers

Metric	List (before)	Vector (after)	Delta
Chapter parse time	1,657 ms	1,475 ms	-182 ms (11% faster)
Min free heap	56,772 bytes	107,388 bytes	+50,616 bytes (89% more headroom)

Early flush behavior

Both logs show exactly 3 early flushes for the 2,420-word paragraph, with size dropping to single digits after each:

early flush triggered: size=880  → size after=2
early flush triggered: size=793  → size after=4
early flush triggered: size=752  → size after=1

This confirms the erase change works correctly. Without it, you'd see size=751, 752, 753... firing on every word -- ~1,670 redundant flushes instead of 3.

The three early flush passes compared

Flush	Words	List total	Vector total	Delta
1st	880	121,519 us	134,275 us	List 10% faster
2nd	793	117,151 us	101,200 us	Vector 14% faster
3rd	752	113,726 us	96,350 us	Vector 15% faster
Sum		352,396 us	331,825 us	Vector 6% faster

The first flush is an outlier where list's extract was faster (possibly cold cache). On the 2nd and 3rd flushes, vector's contiguous memory wins decisively.

Medium paragraphs

Words	List total	Vector total	Savings
550	87,236 us	76,261 us	12.6%
296	43,553 us	38,698 us	11.2%
271	27,017 us	24,129 us	10.7%
244	29,973 us	26,379 us	12.0%
228	38,437 us	33,341 us	13.3%
196	25,074 us	21,664 us	13.6%

Vector is 10-14% faster across the board for medium paragraphs.

Smallest paragraphs

Words	List total	Vector total	Savings
1	466 us	473 us	-1.5%
5	520 us	498 us	4.2%
10	1,329 us	1,248 us	6.1%
15	2,006 us	1,318 us	34.3%
20	1,638 us	1,509 us	7.9%
25	1,823 us	1,692 us	7.2%

These samples span the full 1-25 word range. At tiny sizes, fixed overhead dominates and results can be noisy, but vector is generally faster across this range.

Memory

The heap values during the early flushes tell the real story. Look at the heap before each flush (the low-water mark, when all the words are buffered but not yet extracted):

	List	Vector
Before 1st flush	75,688 free	142,556 free
Before 2nd flush	63,196 free	130,672 free
Before 3rd flush	68,420 free	131,256 free

The list version dipped to 63KB free during the second flush. On an ESP32 with ~230KB total heap, that's only 27% remaining -- dangerously close to allocation failures if other subsystems (fonts, zip streams, CSS) need memory simultaneously.

The vector version never went below 126KB free -- roughly double the headroom. The ~63KB difference directly matches the per-node overhead: 880 words x 3 lists x ~24 bytes (prev/next pointers + malloc bookkeeping) = ~63KB.

Phase breakdown

The biggest wins come from extract, but the phase-level breakdown shows where both small and large gains appear:

Case	Impl	measure	breaks	extract	erase	total
550 words	List	4,846 us	15,243 us	67,112 us	—	87,236 us
550 words	Vector	4,788 us	14,220 us	57,125 us	90 us	76,261 us
2nd flush (793)	List	5,999 us	24,358 us	86,731 us	—	117,151 us
2nd flush (793)	Vector	5,947 us	21,854 us	73,209 us	133 us	101,200 us
3rd flush (752)	List	6,020 us	24,229 us	83,403 us	—	113,726 us
3rd flush (752)	Vector	5,984 us	24,008 us	66,178 us	125 us	96,350 us

What this shows:

measure is almost identical between list and vector (typically <2% delta).
breaks improves modestly with vector (about 1-10% depending on paragraph).
extract carries almost all of the total win on medium/large paragraphs (about 15-21% faster in the examples above).
erase is present only in vector and adds just ~90-133 us, far too small to offset extract gains.

This aligns with expected access patterns: extracting 100+ lines repeatedly touches many word entries; contiguous storage keeps iteration cache-friendly while list traversal pays pointer-chasing costs on every pass.

Erase cost

The erase step added by the fix costs 90-133 microseconds per flush -- 0.1% of total flush time. Completely negligible.

Copilot

Pull request overview

This PR updates EPUB text layout internals to use contiguous storage (std::vector) instead of node-based storage (std::list) for words, styles, and positioning, aiming to improve cache locality and iteration performance during layout/rendering.

Changes:

Replace std::list with std::vector in TextBlock and ParsedText word/style storage.
Convert iterator-based traversal to index-based access throughout rendering/layout.
Remove the extra “indexed continues” copy by using std::vector<bool> directly.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
lib/Epub/Epub/blocks/TextBlock.h	Switch `TextBlock` storage/constructor from `std::list` to `std::vector`.
lib/Epub/Epub/blocks/TextBlock.cpp	Update rendering and deserialization logic to use vector indexing and vector-based allocations.
lib/Epub/Epub/ParsedText.h	Switch internal storage to vectors; adjust hyphenation helper signature accordingly.
lib/Epub/Epub/ParsedText.cpp	Remove list/iterator patterns, use vector indexing/move ranges, and add end-of-layout consumption via vector erase.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lib/Epub/Epub/ParsedText.cpp

osteotek

looks good to me, works in my testing

Co-authored-by: Cursor <[email protected]> # Conflicts: # lib/Epub/Epub/ParsedText.cpp

BlindBat and others added 2 commits February 20, 2026 09:12

Added missing .erase calls

ee9aa18

znelson commented Feb 20, 2026

View reviewed changes

znelson requested review from daveallie and osteotek February 20, 2026 15:50

znelson marked this pull request as ready for review February 20, 2026 15:50

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

thx @coderabbitai

9bbe994

osteotek requested a review from Copilot February 20, 2026 18:42

Copilot started reviewing on behalf of osteotek February 20, 2026 18:42 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

lib/Epub/Epub/ParsedText.cpp Show resolved Hide resolved

lib/Epub/Epub/ParsedText.cpp Show resolved Hide resolved

osteotek previously approved these changes Feb 20, 2026

View reviewed changes

Merge branch 'master' into list-to-vector

011acaf

Co-authored-by: Cursor <[email protected]> # Conflicts: # lib/Epub/Epub/ParsedText.cpp

znelson dismissed osteotek’s stale review via 011acaf February 22, 2026 03:46

znelson requested a review from osteotek February 22, 2026 03:50

daveallie approved these changes Feb 22, 2026

View reviewed changes

daveallie merged commit 3696794 into crosspoint-reader:master Feb 22, 2026
6 checks passed

This was referenced Feb 22, 2026

perf: Replace std::list with std::vector in text layout #802

Closed

Regression after introduction of new framework #1092

Closed

znelson deleted the list-to-vector branch February 26, 2026 14:53

coderabbitai bot mentioned this pull request Feb 28, 2026

fix: Hanging indent (negative text-indent) and em-unit sizing #1229

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Replace std::list with std::vector in text layout#1038

perf: Replace std::list with std::vector in text layout#1038
daveallie merged 4 commits intocrosspoint-reader:masterfrom
znelson:list-to-vector

znelson commented Feb 20, 2026

Uh oh!

coderabbitai bot commented Feb 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

znelson Feb 20, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

znelson commented Feb 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

osteotek left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

znelson commented Feb 20, 2026

Summary

Why

Files changed

AI Usage

Test plan

Uh oh!

coderabbitai bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

znelson Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

znelson commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

List vs Vector Text Layout Performance Comparison

Headline numbers

Early flush behavior

The three early flush passes compared

Medium paragraphs

Smallest paragraphs

Memory

Phase breakdown

Erase cost

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

osteotek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coderabbitai bot commented Feb 20, 2026 •

edited

Loading

znelson commented Feb 20, 2026 •

edited

Loading