Skip to content

fix: Account for nbsp; character as whitespace#751

Closed
jdk2pq wants to merge 2 commits intocrosspoint-reader:masterfrom
jdk2pq:fix/account-for-nbsp-character-as-whitespace
Closed

fix: Account for nbsp; character as whitespace#751
jdk2pq wants to merge 2 commits intocrosspoint-reader:masterfrom
jdk2pq:fix/account-for-nbsp-character-as-whitespace

Conversation

@jdk2pq
Copy link
Contributor

@jdk2pq jdk2pq commented Feb 7, 2026

Summary

What is the goal of this PR?

  • Fixed an issue with nbsp; not being counted as whitespace, which caused words to incorrectly join together

What changes are included?

  • Added separate handling for nbsp; characters similar to how we already handle BOM characters

Additional Context


AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? *YES, Claude Code

- Fixed an issue with `nbsp;` not being counted as whitespace, which caused words to incorrectly join together
@jdk2pq jdk2pq added reader Related to the core reader experience language Related to language or character set labels Feb 7, 2026
@jdk2pq
Copy link
Contributor Author

jdk2pq commented Feb 7, 2026

One thing worth considering before merging this is if this is the behavior we want for nbsp; characters, because I definitely also see a case for using these characters as intended and not breaking lines on them in the line-breaking algorithm. It wouldn't be too difficult of an addition to add this behavior, but the reason I didn't do this initially is because we're already so limited in screen real estate that it felt like the better tradeoff to just treat nbsp; as regular ol' whitespace. Happy to consider other opinions on this though, and if there's consensus on the other solution, I can close this and put up a PR for that instead.

@lukestein
Copy link
Contributor

lukestein commented Feb 7, 2026

I'm also going to tag in @osteotek since he and I both typically have opinions about line breaking and hyphenation-related stuff ;)

My inclination is that a non-breaking space absolutely should be a non-breaking space. I know there are files that overuse them, but publishers are often (and should be) using them intentionally and it should be respected. And often when nbsp are used interword, the word on one side or the other is short so it doesn't screw up justification etc. too much, e.g.,

  • 7:00~p.m.
  • Marcus~Jr.
  • C.~S. Lewis
  • see p.~27
  • JPY~4,000
  • 65°~C
  • or apparently single-letter prepositions in Polish as in OP's issue

@osteotek
Copy link
Member

osteotek commented Feb 7, 2026

I think this is not as simple as treating nbsp as whitespace. Nbsp should be treated as literally non breaking whitespace - space to delimit words, but should not break lines. We're not as limited in width space to break them, like in examples provided by @lukestein

@jdk2pq
Copy link
Contributor Author

jdk2pq commented Feb 7, 2026

@lukestein @osteotek Makes sense to me! I'll close this and open a separate PR for that shortly. Thanks for the feedback!

@jdk2pq jdk2pq closed this Feb 7, 2026
jonasdiemer pushed a commit that referenced this pull request Feb 13, 2026
## Summary

Closes #743.

**What is the goal of this PR?**

- Add back handling for HTML entities in expat. This was originally part
of the code that got removed
[here](#274)
- Handle ` ` characters to resolve issue #743 

**What changes are included?**

- Brought back HTML entity table from previous commit and refactored it
to use a static const char * table with linear lookup to reduce heap
allocations.
- Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities
correctly, without needing them defined in DOCTYPE
- Added handling for ` ` so that the text stays together and
doesn't break onto a new line with text separated by an ` `

## Additional Context

- This supersedes [this
PR](#751)
that simply handled `nbsp;` as whitespace. Instead, we want that
character to serve its true purpose and affect the line-breaking
algorithm.
- Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub)
with ` ` characters examples at the end of the book

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**YES**_, Claude Code
Unintendedsideeffects pushed a commit to Unintendedsideeffects/crosspoint-reader that referenced this pull request Feb 17, 2026
…reader#757)

## Summary

Closes crosspoint-reader#743.

**What is the goal of this PR?**

- Add back handling for HTML entities in expat. This was originally part
of the code that got removed
[here](crosspoint-reader#274)
- Handle ` ` characters to resolve issue crosspoint-reader#743 

**What changes are included?**

- Brought back HTML entity table from previous commit and refactored it
to use a static const char * table with linear lookup to reduce heap
allocations.
- Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities
correctly, without needing them defined in DOCTYPE
- Added handling for ` ` so that the text stays together and
doesn't break onto a new line with text separated by an ` `

## Additional Context

- This supersedes [this
PR](crosspoint-reader#751)
that simply handled `nbsp;` as whitespace. Instead, we want that
character to serve its true purpose and affect the line-breaking
algorithm.
- Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub)
with ` ` characters examples at the end of the book

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**YES**_, Claude Code
saslv pushed a commit to saslv/crosspoint-reader that referenced this pull request Feb 19, 2026
…reader#757)

## Summary

Closes crosspoint-reader#743.

**What is the goal of this PR?**

- Add back handling for HTML entities in expat. This was originally part
of the code that got removed
[here](crosspoint-reader#274)
- Handle ` ` characters to resolve issue crosspoint-reader#743 

**What changes are included?**

- Brought back HTML entity table from previous commit and refactored it
to use a static const char * table with linear lookup to reduce heap
allocations.
- Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities
correctly, without needing them defined in DOCTYPE
- Added handling for ` ` so that the text stays together and
doesn't break onto a new line with text separated by an ` `

## Additional Context

- This supersedes [this
PR](crosspoint-reader#751)
that simply handled `nbsp;` as whitespace. Instead, we want that
character to serve its true purpose and affect the line-breaking
algorithm.
- Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub)
with ` ` characters examples at the end of the book

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**YES**_, Claude Code
el pushed a commit to el/crosspoint-reader that referenced this pull request Feb 19, 2026
…reader#757)

## Summary

Closes crosspoint-reader#743.

**What is the goal of this PR?**

- Add back handling for HTML entities in expat. This was originally part
of the code that got removed
[here](crosspoint-reader#274)
- Handle ` ` characters to resolve issue crosspoint-reader#743 

**What changes are included?**

- Brought back HTML entity table from previous commit and refactored it
to use a static const char * table with linear lookup to reduce heap
allocations.
- Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities
correctly, without needing them defined in DOCTYPE
- Added handling for ` ` so that the text stays together and
doesn't break onto a new line with text separated by an ` `

## Additional Context

- This supersedes [this
PR](crosspoint-reader#751)
that simply handled `nbsp;` as whitespace. Instead, we want that
character to serve its true purpose and affect the line-breaking
algorithm.
- Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub)
with ` ` characters examples at the end of the book

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**YES**_, Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

language Related to language or character set reader Related to the core reader experience

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Single-character prepositions merge with following words in some EPUB files

3 participants