feat: dict based Hyphenation#305
Conversation
- Added EnglishHyphenator and RussianHyphenator classes to handle language-specific hyphenation rules. - Introduced HyphenationCommon for shared utilities and character classification functions. - Updated ParsedText to utilize hyphenation when laying out text. - Enhanced the hyphenation logic to consider word splitting based on available width and character properties. - Refactored existing code to improve readability and maintainability, including the use of iterators and lambda functions for line processing. - Added necessary includes and organized header files for better structure.
- Introduced hyphenationEnabled flag in ParsedText and Section classes. - Updated constructors and methods to handle hyphenation settings. - Modified settings file versioning to include hyphenationEnabled. - Enhanced settings UI to allow toggling of hyphenation feature.
…ing and line breaking logic
…ed performance in word splitting and line breaking
…h calculations for improved clarity and performance
…ctuation functions
…punctuation checks
…te logic to remove surrounding punctuation; add explicit hyphen handling in breakOffsets function.
…ext and Hyphenator
|
@osteotek Same here, I tested it with French like @lukestein did with English and everything seems to work perfectly. In my opinion, it's an essential feature for proper epub rendering especially on a small screen. Thanks for your work. |
|
CI seems to want the PR title renamed from "feature: " to "feat: " |
|
Sorry to be repeatedly thanking @osteotek, but much appreciate that you keep merging the master commits into your branch. Makes using/testing your feature easy. |
German also looked good to me while testing over the last days, great work really! |
|
@osteotek, I don't know if this is a reversion or something specific about the epub I'm reading now, but with the latest commits there may be a problem with failure to wrap after dashes: |
|
@lukestein yes, there was a change in recent commits on how we define explicit hyphen. We need to decide if we want to break on such dashes. Technically rules of hyphenation in English do not allow that, but it might make sense on such a small screen |
Continue to appreciate your work on this! My screenshot here shows em dashes (—) and my understanding is that all major English style guides suggest allowing line breaks after an em (Chicago, APA, MLA, AP, Oxford, and NY Times; some of these also allow breaking before an em but I wouldn't recommend it). I also think that especially on small screens, this is really important. I'm less opinionated about en dashes. Style guide generally discourage breaking after these. For explicit hyphens, I just know your earlier solution fixed #217 and I want to make sure the final code still does. Once again, many thanks. |
|
@osteotek, with caveats that I don't entirely understand the code, I chatted a bit with AI about it (and my understanding of typography), where it eventually suggested this to me. You're the expert, but just in case useful:
|
|
@lukestein agreed, restored breaking after en/em dashes. Should work the same as before |
Looking great with this commit 🙌 Will keep testing (i.e., reading books) |
|
Hi @osteotek. Personally, I think from a user experience perspective the current PR could/should be merged as-is (pending @daveallie's review ofc) and then there is room for potential refinements later. Thank you! However, one strong suggestion based on continued testing is to please add the ellipsis character ( Other likely good candidates for
|
* origin: fix: truncate chapter names that are too long (crosspoint-reader#422) feat: dict based Hyphenation (crosspoint-reader#305) fix: render U+FFFD replacement character instead of ? (crosspoint-reader#366) fix: Invert colors on home screen cover overlay when recent book is selected (crosspoint-reader#390) Adds KOReader Sync support (crosspoint-reader#232) feat: Change keyboard "caps" to "shift" & Wrap Keyboard (crosspoint-reader#377) fix: XTC 1-bit thumb BMP polarity inversion (crosspoint-reader#373)
* Adds (optional) Hyphenation for English, French, German, Russian languages * Included hyphenation dictionaries add approximately 280kb to the flash usage (German alone takes 200kb) * Trie encoded dictionaries are adopted from hypher project (https://github.com/typst/hypher) * Soft hyphens (and other explicit hyphens) take precedence over dict-based hyphenation. Overall, the hyphenation rules are quite aggressive, as I believe it makes more sense on our smaller screen. --------- Co-authored-by: Dave Allie <[email protected]>
## Summary * Add additional punctuation marks to the list of characters that can be immediately followed by a line break even where there is no explicit space ## Additional Context * Huge appreciation to @osteotek for his amazing work on hyphenation. Reading on the device is so much better now. * I am getting bad line breaks when ellipses (…) are between words and book file does not explicitly include some kind of breaking space. * Per [discussion](#305 (comment)), several new characters are added in this PR to the `isExplicitHyphen` list to allow line breaks immediately after them: Character | Unicode | Usage | Why include it? -- | -- | -- | -- Solidus (Slash) | U+002F | / | Essential for breaking URLs and "and/or" constructs. Backslash | U+005C | \ | Critical for technical text, file paths, and coding documentation. Underscore | U+005F | _ | Prevents "runaway" line lengths in usernames or code snippets. Middle Dot | U+00B7 | · | Acts as a semantic separator in dictionaries or stylistic lists. Ellipsis | U+2026 | … | Prevents justification failure when dialogue lacks following spaces. Midline Horizontal Ellipsis | U+22EF | ⋯ | Useful for mathematical sequences and technical notation. ### Example: This shows an example of what line breaking looks like *with* this PR. Note the line break after "matter…" (which would not previously have been allowed). It's particularly important here because the book includes non-breaking spaces in "Mr. Aldrich" and "Mr. Rockefeller."  --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? **PARTIALLY**
…r#425) ## Summary * Add additional punctuation marks to the list of characters that can be immediately followed by a line break even where there is no explicit space ## Additional Context * Huge appreciation to @osteotek for his amazing work on hyphenation. Reading on the device is so much better now. * I am getting bad line breaks when ellipses (…) are between words and book file does not explicitly include some kind of breaking space. * Per [discussion](crosspoint-reader#305 (comment)), several new characters are added in this PR to the `isExplicitHyphen` list to allow line breaks immediately after them: Character | Unicode | Usage | Why include it? -- | -- | -- | -- Solidus (Slash) | U+002F | / | Essential for breaking URLs and "and/or" constructs. Backslash | U+005C | \ | Critical for technical text, file paths, and coding documentation. Underscore | U+005F | _ | Prevents "runaway" line lengths in usernames or code snippets. Middle Dot | U+00B7 | · | Acts as a semantic separator in dictionaries or stylistic lists. Ellipsis | U+2026 | … | Prevents justification failure when dialogue lacks following spaces. Midline Horizontal Ellipsis | U+22EF | ⋯ | Useful for mathematical sequences and technical notation. ### Example: This shows an example of what line breaking looks like *with* this PR. Note the line break after "matter…" (which would not previously have been allowed). It's particularly important here because the book includes non-breaking spaces in "Mr. Aldrich" and "Mr. Rockefeller."  --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? **PARTIALLY**
## Summary * Adds (optional) Hyphenation for English, French, German, Russian languages ## Additional Context * Included hyphenation dictionaries add approximately 280kb to the flash usage (German alone takes 200kb) * Trie encoded dictionaries are adopted from hypher project (https://github.com/typst/hypher) * Soft hyphens (and other explicit hyphens) take precedence over dict-based hyphenation. Overall, the hyphenation rules are quite aggressive, as I believe it makes more sense on our smaller screen. --------- Co-authored-by: Dave Allie <[email protected]>
…r#425) ## Summary * Add additional punctuation marks to the list of characters that can be immediately followed by a line break even where there is no explicit space ## Additional Context * Huge appreciation to @osteotek for his amazing work on hyphenation. Reading on the device is so much better now. * I am getting bad line breaks when ellipses (…) are between words and book file does not explicitly include some kind of breaking space. * Per [discussion](crosspoint-reader#305 (comment)), several new characters are added in this PR to the `isExplicitHyphen` list to allow line breaks immediately after them: Character | Unicode | Usage | Why include it? -- | -- | -- | -- Solidus (Slash) | U+002F | / | Essential for breaking URLs and "and/or" constructs. Backslash | U+005C | \ | Critical for technical text, file paths, and coding documentation. Underscore | U+005F | _ | Prevents "runaway" line lengths in usernames or code snippets. Middle Dot | U+00B7 | · | Acts as a semantic separator in dictionaries or stylistic lists. Ellipsis | U+2026 | … | Prevents justification failure when dialogue lacks following spaces. Midline Horizontal Ellipsis | U+22EF | ⋯ | Useful for mathematical sequences and technical notation. ### Example: This shows an example of what line breaking looks like *with* this PR. Note the line break after "matter…" (which would not previously have been allowed). It's particularly important here because the book includes non-breaking spaces in "Mr. Aldrich" and "Mr. Rockefeller."  --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? **PARTIALLY**


Summary
Additional Context