fix: Account for nbsp; character as whitespace#751
fix: Account for nbsp; character as whitespace#751jdk2pq wants to merge 2 commits intocrosspoint-reader:masterfrom
Conversation
- Fixed an issue with `nbsp;` not being counted as whitespace, which caused words to incorrectly join together
|
One thing worth considering before merging this is if this is the behavior we want for |
|
I'm also going to tag in @osteotek since he and I both typically have opinions about line breaking and hyphenation-related stuff ;) My inclination is that a non-breaking space absolutely should be a non-breaking space. I know there are files that overuse them, but publishers are often (and should be) using them intentionally and it should be respected. And often when nbsp are used interword, the word on one side or the other is short so it doesn't screw up justification etc. too much, e.g.,
|
|
I think this is not as simple as treating nbsp as whitespace. Nbsp should be treated as literally non breaking whitespace - space to delimit words, but should not break lines. We're not as limited in width space to break them, like in examples provided by @lukestein |
|
@lukestein @osteotek Makes sense to me! I'll close this and open a separate PR for that shortly. Thanks for the feedback! |
## Summary Closes #743. **What is the goal of this PR?** - Add back handling for HTML entities in expat. This was originally part of the code that got removed [here](#274) - Handle ` ` characters to resolve issue #743 **What changes are included?** - Brought back HTML entity table from previous commit and refactored it to use a static const char * table with linear lookup to reduce heap allocations. - Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities correctly, without needing them defined in DOCTYPE - Added handling for ` ` so that the text stays together and doesn't break onto a new line with text separated by an ` ` ## Additional Context - This supersedes [this PR](#751) that simply handled `nbsp;` as whitespace. Instead, we want that character to serve its true purpose and affect the line-breaking algorithm. - Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub) with ` ` characters examples at the end of the book --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _**YES**_, Claude Code
…reader#757) ## Summary Closes crosspoint-reader#743. **What is the goal of this PR?** - Add back handling for HTML entities in expat. This was originally part of the code that got removed [here](crosspoint-reader#274) - Handle ` ` characters to resolve issue crosspoint-reader#743 **What changes are included?** - Brought back HTML entity table from previous commit and refactored it to use a static const char * table with linear lookup to reduce heap allocations. - Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities correctly, without needing them defined in DOCTYPE - Added handling for ` ` so that the text stays together and doesn't break onto a new line with text separated by an ` ` ## Additional Context - This supersedes [this PR](crosspoint-reader#751) that simply handled `nbsp;` as whitespace. Instead, we want that character to serve its true purpose and affect the line-breaking algorithm. - Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub) with ` ` characters examples at the end of the book --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _**YES**_, Claude Code
…reader#757) ## Summary Closes crosspoint-reader#743. **What is the goal of this PR?** - Add back handling for HTML entities in expat. This was originally part of the code that got removed [here](crosspoint-reader#274) - Handle ` ` characters to resolve issue crosspoint-reader#743 **What changes are included?** - Brought back HTML entity table from previous commit and refactored it to use a static const char * table with linear lookup to reduce heap allocations. - Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities correctly, without needing them defined in DOCTYPE - Added handling for ` ` so that the text stays together and doesn't break onto a new line with text separated by an ` ` ## Additional Context - This supersedes [this PR](crosspoint-reader#751) that simply handled `nbsp;` as whitespace. Instead, we want that character to serve its true purpose and affect the line-breaking algorithm. - Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub) with ` ` characters examples at the end of the book --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _**YES**_, Claude Code
…reader#757) ## Summary Closes crosspoint-reader#743. **What is the goal of this PR?** - Add back handling for HTML entities in expat. This was originally part of the code that got removed [here](crosspoint-reader#274) - Handle ` ` characters to resolve issue crosspoint-reader#743 **What changes are included?** - Brought back HTML entity table from previous commit and refactored it to use a static const char * table with linear lookup to reduce heap allocations. - Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities correctly, without needing them defined in DOCTYPE - Added handling for ` ` so that the text stays together and doesn't break onto a new line with text separated by an ` ` ## Additional Context - This supersedes [this PR](crosspoint-reader#751) that simply handled `nbsp;` as whitespace. Instead, we want that character to serve its true purpose and affect the line-breaking algorithm. - Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub) with ` ` characters examples at the end of the book --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _**YES**_, Claude Code
Summary
What is the goal of this PR?
nbsp;not being counted as whitespace, which caused words to incorrectly join togetherWhat changes are included?
nbsp;characters similar to how we already handle BOM charactersAdditional Context
AI Usage
While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it
helps set the right context for reviewers.
Did you use AI tools to help write this code? *YES, Claude Code