Skip to content

Remove HTML entity parsing#274

Merged
daveallie merged 1 commit intomasterfrom
chore/remove-html-entity-parsing
Jan 7, 2026
Merged

Remove HTML entity parsing#274
daveallie merged 1 commit intomasterfrom
chore/remove-html-entity-parsing

Conversation

@daveallie
Copy link
Member

Summary

  • Remove HTML entity parsing
    • This has been completely useless since the introduction of expat
    • expat tries to parse all entities in the document, but only knows of HTML ones
    • Parsing will never end with HTML entities in the text, so the additional step to parse them that we had went completely unused
    • We should figure out the best way to parse that content in the future, but for now remove that module as it generates a lot of heap allocations with its map and strings

@daveallie daveallie merged commit 2b12a65 into master Jan 7, 2026
1 check passed
@daveallie daveallie deleted the chore/remove-html-entity-parsing branch January 7, 2026 12:09
yingirene pushed a commit to yingirene/crosspoint-reader that referenced this pull request Jan 16, 2026
## Summary

* Remove HTML entity parsing
  * This has been completely useless since the introduction of expat
* expat tries to parse all entities in the document, but only knows of
HTML ones
* Parsing will never end with HTML entities in the text, so the
additional step to parse them that we had went completely unused
* We should figure out the best way to parse that content in the future,
but for now remove that module as it generates a lot of heap allocations
with its map and strings
jonasdiemer pushed a commit that referenced this pull request Feb 13, 2026
## Summary

Closes #743.

**What is the goal of this PR?**

- Add back handling for HTML entities in expat. This was originally part
of the code that got removed
[here](#274)
- Handle ` ` characters to resolve issue #743 

**What changes are included?**

- Brought back HTML entity table from previous commit and refactored it
to use a static const char * table with linear lookup to reduce heap
allocations.
- Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities
correctly, without needing them defined in DOCTYPE
- Added handling for ` ` so that the text stays together and
doesn't break onto a new line with text separated by an ` `

## Additional Context

- This supersedes [this
PR](#751)
that simply handled `nbsp;` as whitespace. Instead, we want that
character to serve its true purpose and affect the line-breaking
algorithm.
- Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub)
with ` ` characters examples at the end of the book

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**YES**_, Claude Code
Unintendedsideeffects pushed a commit to Unintendedsideeffects/crosspoint-reader that referenced this pull request Feb 17, 2026
## Summary

* Remove HTML entity parsing
  * This has been completely useless since the introduction of expat
* expat tries to parse all entities in the document, but only knows of
HTML ones
* Parsing will never end with HTML entities in the text, so the
additional step to parse them that we had went completely unused
* We should figure out the best way to parse that content in the future,
but for now remove that module as it generates a lot of heap allocations
with its map and strings
Unintendedsideeffects pushed a commit to Unintendedsideeffects/crosspoint-reader that referenced this pull request Feb 17, 2026
…reader#757)

## Summary

Closes crosspoint-reader#743.

**What is the goal of this PR?**

- Add back handling for HTML entities in expat. This was originally part
of the code that got removed
[here](crosspoint-reader#274)
- Handle ` ` characters to resolve issue crosspoint-reader#743 

**What changes are included?**

- Brought back HTML entity table from previous commit and refactored it
to use a static const char * table with linear lookup to reduce heap
allocations.
- Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities
correctly, without needing them defined in DOCTYPE
- Added handling for ` ` so that the text stays together and
doesn't break onto a new line with text separated by an ` `

## Additional Context

- This supersedes [this
PR](crosspoint-reader#751)
that simply handled `nbsp;` as whitespace. Instead, we want that
character to serve its true purpose and affect the line-breaking
algorithm.
- Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub)
with ` ` characters examples at the end of the book

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**YES**_, Claude Code
saslv pushed a commit to saslv/crosspoint-reader that referenced this pull request Feb 19, 2026
…reader#757)

## Summary

Closes crosspoint-reader#743.

**What is the goal of this PR?**

- Add back handling for HTML entities in expat. This was originally part
of the code that got removed
[here](crosspoint-reader#274)
- Handle ` ` characters to resolve issue crosspoint-reader#743 

**What changes are included?**

- Brought back HTML entity table from previous commit and refactored it
to use a static const char * table with linear lookup to reduce heap
allocations.
- Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities
correctly, without needing them defined in DOCTYPE
- Added handling for ` ` so that the text stays together and
doesn't break onto a new line with text separated by an ` `

## Additional Context

- This supersedes [this
PR](crosspoint-reader#751)
that simply handled `nbsp;` as whitespace. Instead, we want that
character to serve its true purpose and affect the line-breaking
algorithm.
- Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub)
with ` ` characters examples at the end of the book

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**YES**_, Claude Code
el pushed a commit to el/crosspoint-reader that referenced this pull request Feb 19, 2026
…reader#757)

## Summary

Closes crosspoint-reader#743.

**What is the goal of this PR?**

- Add back handling for HTML entities in expat. This was originally part
of the code that got removed
[here](crosspoint-reader#274)
- Handle ` ` characters to resolve issue crosspoint-reader#743 

**What changes are included?**

- Brought back HTML entity table from previous commit and refactored it
to use a static const char * table with linear lookup to reduce heap
allocations.
- Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities
correctly, without needing them defined in DOCTYPE
- Added handling for ` ` so that the text stays together and
doesn't break onto a new line with text separated by an ` `

## Additional Context

- This supersedes [this
PR](crosspoint-reader#751)
that simply handled `nbsp;` as whitespace. Instead, we want that
character to serve its true purpose and affect the line-breaking
algorithm.
- Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub)
with ` ` characters examples at the end of the book

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**YES**_, Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant