Skip to content

feat: dict based Hyphenation#305

Merged
daveallie merged 83 commits intocrosspoint-reader:masterfrom
osteotek:hyphenation-v3
Jan 19, 2026
Merged

feat: dict based Hyphenation#305
daveallie merged 83 commits intocrosspoint-reader:masterfrom
osteotek:hyphenation-v3

Conversation

@osteotek
Copy link
Member

@osteotek osteotek commented Jan 9, 2026

Summary

  • Adds (optional) Hyphenation for English, French, German, Russian languages

Additional Context

- Added EnglishHyphenator and RussianHyphenator classes to handle language-specific hyphenation rules.
- Introduced HyphenationCommon for shared utilities and character classification functions.
- Updated ParsedText to utilize hyphenation when laying out text.
- Enhanced the hyphenation logic to consider word splitting based on available width and character properties.
- Refactored existing code to improve readability and maintainability, including the use of iterators and lambda functions for line processing.
- Added necessary includes and organized header files for better structure.
- Introduced hyphenationEnabled flag in ParsedText and Section classes.
- Updated constructors and methods to handle hyphenation settings.
- Modified settings file versioning to include hyphenationEnabled.
- Enhanced settings UI to allow toggling of hyphenation feature.
…ed performance in word splitting and line breaking
…h calculations for improved clarity and performance
…te logic to remove surrounding punctuation; add explicit hyphen handling in breakOffsets function.
@jlaunay
Copy link

jlaunay commented Jan 15, 2026

@osteotek Same here, I tested it with French like @lukestein did with English and everything seems to work perfectly. In my opinion, it's an essential feature for proper epub rendering especially on a small screen. Thanks for your work.

@lukestein
Copy link
Contributor

CI seems to want the PR title renamed from "feature: " to "feat: "

Available types:
 - feat: A new feature
 - fix: A bug fix
 - docs: Documentation only changes
 - style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
 - refactor: A code change that neither fixes a bug nor adds a feature
 - perf: A code change that improves performance
 - test: Adding missing tests or correcting existing tests
 - build: Changes that affect the build system or external dependencies (example scopes: gulp, broccoli, npm)
 - ci: Changes to our CI configuration files and scripts (example scopes: Travis, Circle, BrowserStack, SauceLabs)
 - chore: Other changes that don't modify src or test files
 - revert: Reverts a previous commit

@osteotek osteotek changed the title feature: dict based Hyphenation feat: dict based Hyphenation Jan 15, 2026
@lukestein
Copy link
Contributor

Sorry to be repeatedly thanking @osteotek, but much appreciate that you keep merging the master commits into your branch. Makes using/testing your feature easy.

@fischer-hub
Copy link
Contributor

@osteotek Same here, I tested it with French like @lukestein did with English and everything seems to work perfectly. In my opinion, it's an essential feature for proper epub rendering especially on a small screen. Thanks for your work.

German also looked good to me while testing over the last days, great work really!

@lukestein
Copy link
Contributor

@osteotek, I don't know if this is a reversion or something specific about the epub I'm reading now, but with the latest commits there may be a problem with failure to wrap after dashes:

image

@osteotek
Copy link
Member Author

osteotek commented Jan 16, 2026

@lukestein yes, there was a change in recent commits on how we define explicit hyphen. We need to decide if we want to break on such dashes. Technically rules of hyphenation in English do not allow that, but it might make sense on such a small screen

@lukestein
Copy link
Contributor

@lukestein yes, there was a change in recent commits on how we define explicit hyphen. We need to decide if we want to break on such dashes. Technically rules of hyphenation in English do not allow that, but it might make sense on such a small screen

Continue to appreciate your work on this!

My screenshot here shows em dashes (—) and my understanding is that all major English style guides suggest allowing line breaks after an em (Chicago, APA, MLA, AP, Oxford, and NY Times; some of these also allow breaking before an em but I wouldn't recommend it).

I also think that especially on small screens, this is really important.

I'm less opinionated about en dashes. Style guide generally discourage breaking after these.

For explicit hyphens, I just know your earlier solution fixed #217 and I want to make sure the final code still does.

Once again, many thanks.

@lukestein
Copy link
Contributor

lukestein commented Jan 16, 2026

@osteotek, with caveats that I don't entirely understand the code, I chatted a bit with AI about it (and my understanding of typography), where it eventually suggested this to me. You're the expert, but just in case useful:

Looking at the C++ diff in your image, it appears the developer removed several Unicode codepoints from the isExplicitHyphen function. By removing them, the engine no longer recognizes these characters as valid "hard" break points, which explains why your em dash wrapping recently broke.

To restore the behavior you want—where the engine breaks lines after these characters—you should selectively add specific cases back.

Based on the typographic standards we discussed, you should restore the following:

  • case 0x2014: (Em Dash): This is the most critical. Restoring this will allow the engine to break a line immediately following an em dash, preventing the "word—word" strings from being treated as a single unbreakable block.
  • case 0x2013: (En Dash): While style guides suggest avoiding breaks in simple ranges, an en dash is an explicit hyphenation point in complex compound adjectives (e.g., "pre–World War II"). Including it provides the engine with a necessary fallback for long strings.
  • case 0x2E3A: and case 0x2E3B: (Two-em and Three-em Dashes): These are used in bibliography work or to indicate missing text. If they aren't in the explicit hyphen list, the engine will likely overflow the line or create massive gaps because these characters are quite wide.
  • case 0x2012: (Figure Dash): This is a dash the width of a digit. Like the en dash, it is often used in phone numbers or numerical data where a break point might be needed on very narrow screens.

@osteotek
Copy link
Member Author

@lukestein agreed, restored breaking after en/em dashes. Should work the same as before

@lukestein
Copy link
Contributor

@lukestein agreed, restored breaking after en/em dashes. Should work the same as before

Looking great with this commit 🙌 Will keep testing (i.e., reading books)

IMG_2894

@lukestein
Copy link
Contributor

lukestein commented Jan 18, 2026

Hi @osteotek.

Personally, I think from a user experience perspective the current PR could/should be merged as-is (pending @daveallie's review ofc) and then there is room for potential refinements later. Thank you!

However, one strong suggestion based on continued testing is to please add the ellipsis character (, U+2026) to the isExplicitHyphen list.


Other likely good candidates forisExplicitHyphen (although these haven't actually come up in my reading) informed by conversations with LLM:

Character Unicode Usage Why include it?
Solidus (Slash) U+002F / Essential for breaking URLs and "and/or" constructs.
Backslash U+005C \ Critical for technical text, file paths, and coding documentation.
Underscore U+005F _ Prevents "runaway" line lengths in usernames or code snippets.
Middle Dot U+00B7 · Acts as a semantic separator in dictionaries or stylistic lists.
Ellipsis U+2026 Prevents justification failure when dialogue lacks following spaces.
Midline Horizontal Ellipsis U+22EF Useful for mathematical sequences and technical notation.
    case 0x002F:  // Solidus (slash)
    case 0x005C:  // Backslash
    case 0x005F:  // Underscore
    case 0x00B7:  // Middle dot
    case 0x2026:  // Ellipsis
    case 0x22EF:  // Midline horizontal ellipsis

daveallie
daveallie previously approved these changes Jan 19, 2026
Copy link
Member

@daveallie daveallie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incredible work here @osteotek!

Thank you so much for adding such a widely requested community feature. If you're interested, I've also invited you to join the organisation.

@daveallie daveallie enabled auto-merge (squash) January 19, 2026 12:54
@daveallie daveallie merged commit 8824c87 into crosspoint-reader:master Jan 19, 2026
1 check passed
jdk2pq added a commit to jdk2pq/crosspoint-reader that referenced this pull request Jan 20, 2026
* origin:
  fix: truncate chapter names that are too long (crosspoint-reader#422)
  feat: dict based Hyphenation (crosspoint-reader#305)
  fix: render U+FFFD replacement character instead of ? (crosspoint-reader#366)
  fix: Invert colors on home screen cover overlay when recent book is selected (crosspoint-reader#390)
  Adds KOReader Sync support (crosspoint-reader#232)
  feat: Change keyboard "caps" to "shift" & Wrap Keyboard (crosspoint-reader#377)
  fix: XTC 1-bit thumb BMP polarity inversion (crosspoint-reader#373)
yingirene pushed a commit to yingirene/crosspoint-reader that referenced this pull request Jan 25, 2026
* Adds (optional) Hyphenation for English, French, German, Russian
languages

* Included hyphenation dictionaries add approximately 280kb to the flash
usage (German alone takes 200kb)
* Trie encoded dictionaries are adopted from hypher project
(https://github.com/typst/hypher)
* Soft hyphens (and other explicit hyphens) take precedence over
dict-based hyphenation. Overall, the hyphenation rules are quite
aggressive, as I believe it makes more sense on our smaller screen.

---------

Co-authored-by: Dave Allie <[email protected]>
daveallie pushed a commit that referenced this pull request Jan 27, 2026
## Summary

* Add additional punctuation marks to the list of characters that can be
immediately followed by a line break even where there is no explicit
space

## Additional Context

* Huge appreciation to @osteotek for his amazing work on hyphenation.
Reading on the device is so much better now.
* I am getting bad line breaks when ellipses (…) are between words and
book file does not explicitly include some kind of breaking space.
* Per
[discussion](#305 (comment)),
several new characters are added in this PR to the `isExplicitHyphen`
list to allow line breaks immediately after them:

Character | Unicode | Usage | Why include it?
-- | -- | -- | --
Solidus (Slash) | U+002F | / | Essential for breaking URLs and "and/or"
constructs.
Backslash | U+005C | \ | Critical for technical text, file paths, and
coding documentation.
Underscore | U+005F | _ | Prevents "runaway" line lengths in usernames
or code snippets.
Middle Dot | U+00B7 | · | Acts as a semantic separator in dictionaries
or stylistic lists.
Ellipsis | U+2026 | … | Prevents justification failure when dialogue
lacks following spaces.
Midline Horizontal Ellipsis | U+22EF | ⋯ | Useful for mathematical
sequences and technical notation.


### Example:

This shows an example of what line breaking looks like *with* this PR.
Note the line break after "matter…" (which would not previously have
been allowed). It's particularly important here because the book
includes non-breaking spaces in "Mr. Aldrich" and "Mr. Rockefeller."


![IMG_2917](https://github.com/user-attachments/assets/8fa610a9-91dd-407f-8526-0019a8a7195f)

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? **PARTIALLY**
Jessica765 pushed a commit to Jessica765/crosspoint-reader that referenced this pull request Feb 3, 2026
…r#425)

## Summary

* Add additional punctuation marks to the list of characters that can be
immediately followed by a line break even where there is no explicit
space

## Additional Context

* Huge appreciation to @osteotek for his amazing work on hyphenation.
Reading on the device is so much better now.
* I am getting bad line breaks when ellipses (…) are between words and
book file does not explicitly include some kind of breaking space.
* Per
[discussion](crosspoint-reader#305 (comment)),
several new characters are added in this PR to the `isExplicitHyphen`
list to allow line breaks immediately after them:

Character | Unicode | Usage | Why include it?
-- | -- | -- | --
Solidus (Slash) | U+002F | / | Essential for breaking URLs and "and/or"
constructs.
Backslash | U+005C | \ | Critical for technical text, file paths, and
coding documentation.
Underscore | U+005F | _ | Prevents "runaway" line lengths in usernames
or code snippets.
Middle Dot | U+00B7 | · | Acts as a semantic separator in dictionaries
or stylistic lists.
Ellipsis | U+2026 | … | Prevents justification failure when dialogue
lacks following spaces.
Midline Horizontal Ellipsis | U+22EF | ⋯ | Useful for mathematical
sequences and technical notation.


### Example:

This shows an example of what line breaking looks like *with* this PR.
Note the line break after "matter…" (which would not previously have
been allowed). It's particularly important here because the book
includes non-breaking spaces in "Mr. Aldrich" and "Mr. Rockefeller."


![IMG_2917](https://github.com/user-attachments/assets/8fa610a9-91dd-407f-8526-0019a8a7195f)

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? **PARTIALLY**
Unintendedsideeffects pushed a commit to Unintendedsideeffects/crosspoint-reader that referenced this pull request Feb 17, 2026
## Summary

* Adds (optional) Hyphenation for English, French, German, Russian
languages

## Additional Context

* Included hyphenation dictionaries add approximately 280kb to the flash
usage (German alone takes 200kb)
* Trie encoded dictionaries are adopted from hypher project
(https://github.com/typst/hypher)
* Soft hyphens (and other explicit hyphens) take precedence over
dict-based hyphenation. Overall, the hyphenation rules are quite
aggressive, as I believe it makes more sense on our smaller screen.

---------

Co-authored-by: Dave Allie <[email protected]>
Unintendedsideeffects pushed a commit to Unintendedsideeffects/crosspoint-reader that referenced this pull request Feb 17, 2026
…r#425)

## Summary

* Add additional punctuation marks to the list of characters that can be
immediately followed by a line break even where there is no explicit
space

## Additional Context

* Huge appreciation to @osteotek for his amazing work on hyphenation.
Reading on the device is so much better now.
* I am getting bad line breaks when ellipses (…) are between words and
book file does not explicitly include some kind of breaking space.
* Per
[discussion](crosspoint-reader#305 (comment)),
several new characters are added in this PR to the `isExplicitHyphen`
list to allow line breaks immediately after them:

Character | Unicode | Usage | Why include it?
-- | -- | -- | --
Solidus (Slash) | U+002F | / | Essential for breaking URLs and "and/or"
constructs.
Backslash | U+005C | \ | Critical for technical text, file paths, and
coding documentation.
Underscore | U+005F | _ | Prevents "runaway" line lengths in usernames
or code snippets.
Middle Dot | U+00B7 | · | Acts as a semantic separator in dictionaries
or stylistic lists.
Ellipsis | U+2026 | … | Prevents justification failure when dialogue
lacks following spaces.
Midline Horizontal Ellipsis | U+22EF | ⋯ | Useful for mathematical
sequences and technical notation.


### Example:

This shows an example of what line breaking looks like *with* this PR.
Note the line break after "matter…" (which would not previously have
been allowed). It's particularly important here because the book
includes non-breaking spaces in "Mr. Aldrich" and "Mr. Rockefeller."


![IMG_2917](https://github.com/user-attachments/assets/8fa610a9-91dd-407f-8526-0019a8a7195f)

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? **PARTIALLY**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants