fix: Account for nbsp; character as non-breaking space#757
Conversation
- Added `WordAttach` enum with the 3 kinds of attachments: Normal, Continues (for when tags close and there shouldn't be an extra space), and NonBreaking for non-breaking space characters - Extended existing logic with `continuesVec` to support using `WordAttach` instead of boolean
|
Looks good on my books but I'd love if @IjonFryderyk can test or else share an ebook file that caused his problem so we can check. |
|
Is there a way to test this locally without flashing to my X4? Sorry, I'm a noob with embedded development - not sure if that's possible with ESP32 firmware. |
…king-space * master: feat: Add percentage support to CSS properties (crosspoint-reader#738) Use GITHUB_REF_NAME over GITHUB_HEAD_REF in release candidate workflow Add release candidate workflow fix: Allow OTA update from RC build to full release (crosspoint-reader#778) fix(ui): Add Back label in KOReader Sync screen (crosspoint-reader#770) fix: Add EPUB 3 cover image detection (crosspoint-reader#760) feat: A web editor for settings (crosspoint-reader#667) feat: add HalStorage (crosspoint-reader#656) perf: optimize drawPixel() (crosspoint-reader#748) feat: wakeup target detection (crosspoint-reader#731) fix: Scrolling page items calculation (crosspoint-reader#716) refactor: Rename "Embedded Style" to "Book's Embedded Style" (crosspoint-reader#746) feat: optimize fillRectDither (crosspoint-reader#737)
It's all good! Currently, there isn't a great way of testing CrossPoint without flashing on-device. There's some steps being taken in other PRs/forks to create an emulated testing environment, but they're still works in progress and a bit limited right now. If you're comfortable with flashing a new firmware, you can download the |
|
Here's an EPUB file that contains |
…king-space * master: feat: use natural sort in file browser (crosspoint-reader#722) fix: issue if book href are absolute url and not relative to server (crosspoint-reader#741) feat: unify navigation handling with system-wide continuous navigation (crosspoint-reader#600) feat: Add Italian hyphenation support (crosspoint-reader#584) feat: Add percentage support to CSS properties (crosspoint-reader#738) Use GITHUB_REF_NAME over GITHUB_HEAD_REF in release candidate workflow Move release candidate workflow to manual dispatch fix: Allow OTA update from RC build to full release (crosspoint-reader#778) refactor: Rename "Embedded Style" to "Book's Embedded Style" (crosspoint-reader#746) perf: optimize drawPixel() (crosspoint-reader#748) feat: wakeup target detection (crosspoint-reader#731) fix: Scrolling page items calculation (crosspoint-reader#716) feat: optimize fillRectDither (crosspoint-reader#737) fix: increase lyra sideButtonHintsWidth to 30 (crosspoint-reader#727) fix: Remove separations after style changes (crosspoint-reader#720) fix: Lag before displaying covers on home screen (crosspoint-reader#721) feat: Add Settings for toggling CSS on or off (crosspoint-reader#717) Use GITHUB_HEAD_REF release: 1.0.0
…king-space * master: fix: Reduce MIN_SIZE_FOR_POPUP to 10KB (crosspoint-reader#809) docs: Update USER_GUIDE.md (crosspoint-reader#817) fix: Prevent sleeping when in OPDS browser / downloading books (crosspoint-reader#818) feat: Extend python debugging monitor functionality (keyword filter / suppress) (crosspoint-reader#810) docs: Update USER_GUIDE.md (crosspoint-reader#808) feat: Connect to last wifi by default (crosspoint-reader#752)
This reverts commit b64860f.
- Brought back HTML entity table from previous commit and refactored it to use a static const char * table with linear lookup to reduce heap allocations. - Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities correctly, without needing them defined in DOCTYPE - Added handling for ` ` so that the text stays together and doesn't break onto a new line with text separated by an ` `
|
@osteotek @IjonFryderyk I've just pushed up a new commit that I've thoroughly tested with the Polish EPUB and my own test EPUBs. Let me know if this does the trick to fix things! @daveallie I followed your suggestion of bringing back the HTML entity parsing code, and I made some minor changes to reduce heap allocations, since I was hitting heap issues when loading the Polish EPUB (chapters are ~200+ pages on my device, and it would hit an error when it was processing all the chapter's pages at the beginning of a cacheless load). |
There was a problem hiding this comment.
Pull request overview
This PR restores HTML entity handling in the EPUB chapter HTML parser to correctly interpret entities like and treat non-breaking spaces as unbreakable word joins in the line-breaking algorithm (addressing issue #743).
Changes:
- Added an Expat
XML_SetDefaultHandlerExpandhandler to intercept and expand HTML entity references during parsing. - Introduced an HTML entity lookup table (
lookupHtmlEntity) and integrated it into chapter parsing. - Implemented explicit handling of U+00A0 (NBSP) in text parsing and adjusted word-width measurement for standalone space tokens.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/Epub/Epub/parsers/ChapterHtmlSlimParser.h | Declares the new Expat default handler callback. |
| lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp | Expands HTML entities via Expat default handler and adds NBSP-aware word continuation logic. |
| lib/Epub/Epub/htmlEntities.h | Declares entity replacement/lookup helpers. |
| lib/Epub/Epub/htmlEntities.cpp | Adds entity lookup table and helper implementations. |
| lib/Epub/Epub/ParsedText.cpp | Special-cases measuring width for a single-space “word” token. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@jdk2pq I've tested this on the polish epub, seems to be working good, thank you. Some of the Copilot suggestions seems valid to me, but ultimately up to if you want to address them |
…king-space * master: fix: chore: make all debug messages uniform (crosspoint-reader#825) fix: Show "Back" in file browser if not in root, "Home" otherwise. (crosspoint-reader#822) fix: Manually trigger GPIO update in File Browser mode (crosspoint-reader#819)
…p unused code in htmlEntities, and add proper em, en, and thin space instead of empty strings
| const size_t keyLen = strlen(key); | ||
| if (static_cast<size_t>(len) == keyLen && memcmp(entity, key, keyLen) == 0) { |
| const char* value; | ||
| }; | ||
|
|
||
| static const EntityPair ENTITY_LOOKUP[] = { |
There was a problem hiding this comment.
possible improvement for future: maybe compile this list as a tree structure, to avoid linear search and save space
|
Merged, as the open topics can be done in future PRs. |
* master: feat: use pre-compressed HTML pages (crosspoint-reader#861) docs: Add requirement device be on when flashing (crosspoint-reader#877) fix: Account for `nbsp;` character as non-breaking space (crosspoint-reader#757) feat: Add central logging pragma (crosspoint-reader#843)
…reader#757) ## Summary Closes crosspoint-reader#743. **What is the goal of this PR?** - Add back handling for HTML entities in expat. This was originally part of the code that got removed [here](crosspoint-reader#274) - Handle ` ` characters to resolve issue crosspoint-reader#743 **What changes are included?** - Brought back HTML entity table from previous commit and refactored it to use a static const char * table with linear lookup to reduce heap allocations. - Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities correctly, without needing them defined in DOCTYPE - Added handling for ` ` so that the text stays together and doesn't break onto a new line with text separated by an ` ` ## Additional Context - This supersedes [this PR](crosspoint-reader#751) that simply handled `nbsp;` as whitespace. Instead, we want that character to serve its true purpose and affect the line-breaking algorithm. - Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub) with ` ` characters examples at the end of the book --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _**YES**_, Claude Code
…reader#757) ## Summary Closes crosspoint-reader#743. **What is the goal of this PR?** - Add back handling for HTML entities in expat. This was originally part of the code that got removed [here](crosspoint-reader#274) - Handle ` ` characters to resolve issue crosspoint-reader#743 **What changes are included?** - Brought back HTML entity table from previous commit and refactored it to use a static const char * table with linear lookup to reduce heap allocations. - Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities correctly, without needing them defined in DOCTYPE - Added handling for ` ` so that the text stays together and doesn't break onto a new line with text separated by an ` ` ## Additional Context - This supersedes [this PR](crosspoint-reader#751) that simply handled `nbsp;` as whitespace. Instead, we want that character to serve its true purpose and affect the line-breaking algorithm. - Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub) with ` ` characters examples at the end of the book --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _**YES**_, Claude Code
…reader#757) ## Summary Closes crosspoint-reader#743. **What is the goal of this PR?** - Add back handling for HTML entities in expat. This was originally part of the code that got removed [here](crosspoint-reader#274) - Handle ` ` characters to resolve issue crosspoint-reader#743 **What changes are included?** - Brought back HTML entity table from previous commit and refactored it to use a static const char * table with linear lookup to reduce heap allocations. - Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities correctly, without needing them defined in DOCTYPE - Added handling for ` ` so that the text stays together and doesn't break onto a new line with text separated by an ` ` ## Additional Context - This supersedes [this PR](crosspoint-reader#751) that simply handled `nbsp;` as whitespace. Instead, we want that character to serve its true purpose and affect the line-breaking algorithm. - Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub) with ` ` characters examples at the end of the book --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _**YES**_, Claude Code

Summary
Closes #743.
What is the goal of this PR?
characters to resolve issue Single-character prepositions merge with following words in some EPUB files #743What changes are included?
XML_SetDefaultHandlerExpandin expat to parse out the entities correctly, without needing them defined in DOCTYPE so that the text stays together and doesn't break onto a new line with text separated by an Additional Context
nbsp;as whitespace. Instead, we want that character to serve its true purpose and affect the line-breaking algorithm. characters examples at the end of the bookAI Usage
While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it
helps set the right context for reviewers.
Did you use AI tools to help write this code? YES, Claude Code