feat: dict based Hyphenation by osteotek · Pull Request #305 · crosspoint-reader/crosspoint-reader

osteotek · 2026-01-09T18:34:46Z

Summary

Adds (optional) Hyphenation for English, French, German, Russian languages

Additional Context

Included hyphenation dictionaries add approximately 280kb to the flash usage (German alone takes 200kb)
Trie encoded dictionaries are adopted from hypher project (https://github.com/typst/hypher)
Soft hyphens (and other explicit hyphens) take precedence over dict-based hyphenation. Overall, the hyphenation rules are quite aggressive, as I believe it makes more sense on our smaller screen.

- Added EnglishHyphenator and RussianHyphenator classes to handle language-specific hyphenation rules. - Introduced HyphenationCommon for shared utilities and character classification functions. - Updated ParsedText to utilize hyphenation when laying out text. - Enhanced the hyphenation logic to consider word splitting based on available width and character properties. - Refactored existing code to improve readability and maintainability, including the use of iterators and lambda functions for line processing. - Added necessary includes and organized header files for better structure.

…ssing

- Introduced hyphenationEnabled flag in ParsedText and Section classes. - Updated constructors and methods to handle hyphenation settings. - Modified settings file versioning to include hyphenationEnabled. - Enhanced settings UI to allow toggling of hyphenation feature.

…ing and line breaking logic

…rocessing

…ed performance in word splitting and line breaking

…h calculations for improved clarity and performance

…intainability

…handling

…ctuation functions

…punctuation checks

…vowel detection

…e readability

…te logic to remove surrounding punctuation; add explicit hyphen handling in breakOffsets function.

…ext and Hyphenator

jlaunay · 2026-01-15T15:23:28Z

@osteotek Same here, I tested it with French like @lukestein did with English and everything seems to work perfectly. In my opinion, it's an essential feature for proper epub rendering especially on a small screen. Thanks for your work.

lukestein · 2026-01-15T15:33:38Z

CI seems to want the PR title renamed from "feature: " to "feat: "

Available types:
 - feat: A new feature
 - fix: A bug fix
 - docs: Documentation only changes
 - style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
 - refactor: A code change that neither fixes a bug nor adds a feature
 - perf: A code change that improves performance
 - test: Adding missing tests or correcting existing tests
 - build: Changes that affect the build system or external dependencies (example scopes: gulp, broccoli, npm)
 - ci: Changes to our CI configuration files and scripts (example scopes: Travis, Circle, BrowserStack, SauceLabs)
 - chore: Other changes that don't modify src or test files
 - revert: Reverts a previous commit

…on logic

lukestein · 2026-01-15T17:33:32Z

Sorry to be repeatedly thanking @osteotek, but much appreciate that you keep merging the master commits into your branch. Makes using/testing your feature easy.

fischer-hub · 2026-01-16T09:59:51Z

@osteotek Same here, I tested it with French like @lukestein did with English and everything seems to work perfectly. In my opinion, it's an essential feature for proper epub rendering especially on a small screen. Thanks for your work.

German also looked good to me while testing over the last days, great work really!

lukestein · 2026-01-16T12:08:47Z

@osteotek, I don't know if this is a reversion or something specific about the epub I'm reading now, but with the latest commits there may be a problem with failure to wrap after dashes:

osteotek · 2026-01-16T12:19:00Z

@lukestein yes, there was a change in recent commits on how we define explicit hyphen. We need to decide if we want to break on such dashes. Technically rules of hyphenation in English do not allow that, but it might make sense on such a small screen

lukestein · 2026-01-16T13:04:00Z

@lukestein yes, there was a change in recent commits on how we define explicit hyphen. We need to decide if we want to break on such dashes. Technically rules of hyphenation in English do not allow that, but it might make sense on such a small screen

Continue to appreciate your work on this!

My screenshot here shows em dashes (—) and my understanding is that all major English style guides suggest allowing line breaks after an em (Chicago, APA, MLA, AP, Oxford, and NY Times; some of these also allow breaking before an em but I wouldn't recommend it).

I also think that especially on small screens, this is really important.

I'm less opinionated about en dashes. Style guide generally discourage breaking after these.

For explicit hyphens, I just know your earlier solution fixed #217 and I want to make sure the final code still does.

Once again, many thanks.

lukestein · 2026-01-16T14:31:41Z

@osteotek, with caveats that I don't entirely understand the code, I chatted a bit with AI about it (and my understanding of typography), where it eventually suggested this to me. You're the expert, but just in case useful:

Looking at the C++ diff in your image, it appears the developer removed several Unicode codepoints from the isExplicitHyphen function. By removing them, the engine no longer recognizes these characters as valid "hard" break points, which explains why your em dash wrapping recently broke.

To restore the behavior you want—where the engine breaks lines after these characters—you should selectively add specific cases back.

Based on the typographic standards we discussed, you should restore the following:

case 0x2014: (Em Dash): This is the most critical. Restoring this will allow the engine to break a line immediately following an em dash, preventing the "word—word" strings from being treated as a single unbreakable block.

case 0x2013: (En Dash): While style guides suggest avoiding breaks in simple ranges, an en dash is an explicit hyphenation point in complex compound adjectives (e.g., "pre–World War II"). Including it provides the engine with a necessary fallback for long strings.

case 0x2E3A: and case 0x2E3B: (Two-em and Three-em Dashes): These are used in bibliography work or to indicate missing text. If they aren't in the explicit hyphen list, the engine will likely overflow the line or create massive gaps because these characters are quite wide.

case 0x2012: (Figure Dash): This is a dash the width of a digit. Like the en dash, it is often used in phone numbers or numerical data where a break point might be needed on very narrow screens.

osteotek · 2026-01-16T14:53:42Z

@lukestein agreed, restored breaking after en/em dashes. Should work the same as before

lukestein · 2026-01-16T15:28:31Z

@lukestein agreed, restored breaking after en/em dashes. Should work the same as before

Looking great with this commit 🙌 Will keep testing (i.e., reading books)

lukestein · 2026-01-18T15:30:59Z

Hi @osteotek.

Personally, I think from a user experience perspective the current PR could/should be merged as-is (pending @daveallie's review ofc) and then there is room for potential refinements later. Thank you!

However, one strong suggestion based on continued testing is to please add the ellipsis character (…, U+2026) to the isExplicitHyphen list.

Other likely good candidates forisExplicitHyphen (although these haven't actually come up in my reading) informed by conversations with LLM:

Character	Unicode	Usage	Why include it?
Solidus (Slash)	U+002F	/	Essential for breaking URLs and "and/or" constructs.
Backslash	U+005C	\	Critical for technical text, file paths, and coding documentation.
Underscore	U+005F	_	Prevents "runaway" line lengths in usernames or code snippets.
Middle Dot	U+00B7	·	Acts as a semantic separator in dictionaries or stylistic lists.
Ellipsis	U+2026	…	Prevents justification failure when dialogue lacks following spaces.
Midline Horizontal Ellipsis	U+22EF	⋯	Useful for mathematical sequences and technical notation.

    case 0x002F:  // Solidus (slash)
    case 0x005C:  // Backslash
    case 0x005F:  // Underscore
    case 0x00B7:  // Middle dot
    case 0x2026:  // Ellipsis
    case 0x22EF:  // Midline horizontal ellipsis

daveallie

Incredible work here @osteotek!

Thank you so much for adding such a widely requested community feature. If you're interested, I've also invited you to join the organisation.

* origin: fix: truncate chapter names that are too long (crosspoint-reader#422) feat: dict based Hyphenation (crosspoint-reader#305) fix: render U+FFFD replacement character instead of ? (crosspoint-reader#366) fix: Invert colors on home screen cover overlay when recent book is selected (crosspoint-reader#390) Adds KOReader Sync support (crosspoint-reader#232) feat: Change keyboard "caps" to "shift" & Wrap Keyboard (crosspoint-reader#377) fix: XTC 1-bit thumb BMP polarity inversion (crosspoint-reader#373)

* Adds (optional) Hyphenation for English, French, German, Russian languages * Included hyphenation dictionaries add approximately 280kb to the flash usage (German alone takes 200kb) * Trie encoded dictionaries are adopted from hypher project (https://github.com/typst/hypher) * Soft hyphens (and other explicit hyphens) take precedence over dict-based hyphenation. Overall, the hyphenation rules are quite aggressive, as I believe it makes more sense on our smaller screen. --------- Co-authored-by: Dave Allie <[email protected]>

@osteotek

## Summary * Add additional punctuation marks to the list of characters that can be immediately followed by a line break even where there is no explicit space ## Additional Context * Huge appreciation to @osteotek for his amazing work on hyphenation. Reading on the device is so much better now. * I am getting bad line breaks when ellipses (…) are between words and book file does not explicitly include some kind of breaking space. * Per [discussion](#305 (comment)), several new characters are added in this PR to the `isExplicitHyphen` list to allow line breaks immediately after them: Character | Unicode | Usage | Why include it? -- | -- | -- | -- Solidus (Slash) | U+002F | / | Essential for breaking URLs and "and/or" constructs. Backslash | U+005C | \ | Critical for technical text, file paths, and coding documentation. Underscore | U+005F | _ | Prevents "runaway" line lengths in usernames or code snippets. Middle Dot | U+00B7 | · | Acts as a semantic separator in dictionaries or stylistic lists. Ellipsis | U+2026 | … | Prevents justification failure when dialogue lacks following spaces. Midline Horizontal Ellipsis | U+22EF | ⋯ | Useful for mathematical sequences and technical notation. ### Example: This shows an example of what line breaking looks like *with* this PR. Note the line break after "matter…" (which would not previously have been allowed). It's particularly important here because the book includes non-breaking spaces in "Mr. Aldrich" and "Mr. Rockefeller." ![IMG_2917](https://github.com/user-attachments/assets/8fa610a9-91dd-407f-8526-0019a8a7195f) --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? **PARTIALLY**

@osteotek

…r#425) ## Summary * Add additional punctuation marks to the list of characters that can be immediately followed by a line break even where there is no explicit space ## Additional Context * Huge appreciation to @osteotek for his amazing work on hyphenation. Reading on the device is so much better now. * I am getting bad line breaks when ellipses (…) are between words and book file does not explicitly include some kind of breaking space. * Per [discussion](crosspoint-reader#305 (comment)), several new characters are added in this PR to the `isExplicitHyphen` list to allow line breaks immediately after them: Character | Unicode | Usage | Why include it? -- | -- | -- | -- Solidus (Slash) | U+002F | / | Essential for breaking URLs and "and/or" constructs. Backslash | U+005C | \ | Critical for technical text, file paths, and coding documentation. Underscore | U+005F | _ | Prevents "runaway" line lengths in usernames or code snippets. Middle Dot | U+00B7 | · | Acts as a semantic separator in dictionaries or stylistic lists. Ellipsis | U+2026 | … | Prevents justification failure when dialogue lacks following spaces. Midline Horizontal Ellipsis | U+22EF | ⋯ | Useful for mathematical sequences and technical notation. ### Example: This shows an example of what line breaking looks like *with* this PR. Note the line break after "matter…" (which would not previously have been allowed). It's particularly important here because the book includes non-breaking spaces in "Mr. Aldrich" and "Mr. Rockefeller." ![IMG_2917](https://github.com/user-attachments/assets/8fa610a9-91dd-407f-8526-0019a8a7195f) --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? **PARTIALLY**

## Summary * Adds (optional) Hyphenation for English, French, German, Russian languages ## Additional Context * Included hyphenation dictionaries add approximately 280kb to the flash usage (German alone takes 200kb) * Trie encoded dictionaries are adopted from hypher project (https://github.com/typst/hypher) * Soft hyphens (and other explicit hyphens) take precedence over dict-based hyphenation. Overall, the hyphenation rules are quite aggressive, as I believe it makes more sense on our smaller screen. --------- Co-authored-by: Dave Allie <[email protected]>

@osteotek

…r#425) ## Summary * Add additional punctuation marks to the list of characters that can be immediately followed by a line break even where there is no explicit space ## Additional Context * Huge appreciation to @osteotek for his amazing work on hyphenation. Reading on the device is so much better now. * I am getting bad line breaks when ellipses (…) are between words and book file does not explicitly include some kind of breaking space. * Per [discussion](crosspoint-reader#305 (comment)), several new characters are added in this PR to the `isExplicitHyphen` list to allow line breaks immediately after them: Character | Unicode | Usage | Why include it? -- | -- | -- | -- Solidus (Slash) | U+002F | / | Essential for breaking URLs and "and/or" constructs. Backslash | U+005C | \ | Critical for technical text, file paths, and coding documentation. Underscore | U+005F | _ | Prevents "runaway" line lengths in usernames or code snippets. Middle Dot | U+00B7 | · | Acts as a semantic separator in dictionaries or stylistic lists. Ellipsis | U+2026 | … | Prevents justification failure when dialogue lacks following spaces. Midline Horizontal Ellipsis | U+22EF | ⋯ | Useful for mathematical sequences and technical notation. ### Example: This shows an example of what line breaking looks like *with* this PR. Note the line break after "matter…" (which would not previously have been allowed). It's particularly important here because the book includes non-breaking spaces in "Mr. Aldrich" and "Mr. Rockefeller." ![IMG_2917](https://github.com/user-attachments/assets/8fa610a9-91dd-407f-8526-0019a8a7195f) --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? **PARTIALLY**

osteotek added 30 commits December 17, 2025 18:10

clang format fix

13a6c43

Remove fallback break index logic from Hyphenator

074bab8

Merge branch 'master' into hyphenation-v2

c813a2f

Add comments to clarify hyphenation logic and structure in Epub proce…

6366870

…ssing

comments

b768c4b

Merge branch 'master' into hyphenation-v2

26bea34

clang format fix

47b1409

Merge branch 'master' into hyphenation-v2

a1ef3b9

Update settings count for CrossPointSettings and SettingsActivity

27d84c8

Merge branch 'master' into hyphenation-v2

54d7a94

Implement hyphenation support in text layout by enhancing word splitt…

e7edcb6

…ing and line breaking logic

Add comments to clarify pre-splitting of oversized tokens in layout p…

85df6d7

…rocessing

Refactor hyphenation logic to utilize a prefix width cache for improv…

4acdad4

…ed performance in word splitting and line breaking

Refactor ParsedText to remove PrefixWidthCache and simplify word widt…

73c8b17

…h calculations for improved clarity and performance

Enhance line breaking logic with detailed comments for clarity and ma…

4ee9783

…intainability

Fix spacing calculation for justified text in extractLine method

a1f8230

Refactor computeLineBreaks to simplify logic and improve hyphenation …

a3dc96a

…handling

Add punctuation handling: implement isPunctuation and trimTrailingPun…

e156790

…ctuation functions

Remove additional punctuation cases from isPunctuation function

a0113b5

Refactor hyphenation logic: update isAlphabetic function and enhance …

0fa5029

…punctuation checks

Enhance hyphenation logic: add morphology break handling and improve …

5d00e5a

…vowel detection

format fix

247463a

Refactor breakOffsets function: simplify return statements and improv…

3806f18

…e readability

Disable hyphenation feature in CrossPointSettings

3cf52d8

format fix

23183a6

Rename trimTrailingPunctuation to trimSurroundingPunctuation and upda…

f6767c8

…te logic to remove surrounding punctuation; add explicit hyphen handling in breakOffsets function.

Add explicit hyphen handling and improve hyphenation logic in ParsedT…

cb1ecdb

…ext and Hyphenator

Update subproject reference in open-x4-sdk

ae71752

osteotek changed the title ~~feature: dict based Hyphenation~~ feat: dict based Hyphenation Jan 15, 2026

osteotek added 2 commits January 15, 2026 21:48

refactor: unify punctuation trimming to handle footnotes in hyphenati…

f028725

…on logic

revert isAlphabetic change

bb5fd0c

restore breaking after en/em dashes

80c5e99

Merge remote-tracking branch 'origin/master' into hyphenation-v3

389f697

daveallie previously approved these changes Jan 19, 2026

View reviewed changes

Update settings count

e187373

daveallie dismissed their stale review via e187373 January 19, 2026 12:53

daveallie approved these changes Jan 19, 2026

View reviewed changes

daveallie enabled auto-merge (squash) January 19, 2026 12:54

daveallie merged commit 8824c87 into crosspoint-reader:master Jan 19, 2026
1 check passed

lukestein mentioned this pull request Jan 19, 2026

fix: Allow line break after ellipsis and underscore #425

Merged

lukestein mentioned this pull request Feb 21, 2026

fix: Correct hyphenation of URLs #1068

Open

Uh oh!

Conversation

osteotek commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Additional Context

Uh oh!

jlaunay commented Jan 15, 2026

Uh oh!

lukestein commented Jan 15, 2026

Uh oh!

lukestein commented Jan 15, 2026

Uh oh!

fischer-hub commented Jan 16, 2026

Uh oh!

lukestein commented Jan 16, 2026

Uh oh!

osteotek commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukestein commented Jan 16, 2026

Uh oh!

lukestein commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

osteotek commented Jan 16, 2026

Uh oh!

lukestein commented Jan 16, 2026

Uh oh!

lukestein commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daveallie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

osteotek commented Jan 9, 2026 •

edited

Loading

osteotek commented Jan 16, 2026 •

edited

Loading

lukestein commented Jan 16, 2026 •

edited

Loading

lukestein commented Jan 18, 2026 •

edited

Loading