Skip to content

fix: Fix hyphenation and rendering of decomposed characters#1037

Merged
daveallie merged 7 commits intocrosspoint-reader:masterfrom
jpirnay:fix-hyphenation
Feb 22, 2026
Merged

fix: Fix hyphenation and rendering of decomposed characters#1037
daveallie merged 7 commits intocrosspoint-reader:masterfrom
jpirnay:fix-hyphenation

Conversation

@jpirnay
Copy link
Contributor

@jpirnay jpirnay commented Feb 20, 2026

Summary

  • This PR fixes decomposed diacritic handling end-to-end:
    • Hyphenation: normalize common Latin base+combining sequences to precomposed codepoints before Liang pattern matching, so decomposed words hyphenate correctly
    • Rendering: correct combining-mark placement logic so non-spacing marks are attached to the preceding base glyph in normal and rotated text rendering paths, with corresponding text-bounds consistency updates.
    • Hyphenation around non breaking space variants have been fixed (and extended)
    • Hyphenation of terms that already included of hyphens were fixed to include Liang pattern application (eg "US-Satellitensystem" was exclusively broken at the existing hyphen)

Additional Context

  • Before
2
  • After
fix1 fix2
  • Note 1: the hyphenation fix is not a 100% bullet proof implementation. It adds composition of common base+combining sequences (e.g. O + U+0308 -> Ö) during codepoint collection. A complete solution would require implementing proper Unicode normalization (at least NFC, possibly NFKC in specific cases) before hyphenation and rendering, instead of hand-mapping a few combining marks. That was beyond the scope of this fix.

  • Note 2: the render fix should be universal and not limited to the constraints outlined above: it properly x-centers the compund glyph over the previous one, and it uses at least 1pt of visual distance in y.

Before:
Image

After:
Image


AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? PARTIALLY

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

📝 Walkthrough

Walkthrough

Adds base-glyph-aware combining-mark handling across codepoint collection, font bounds calculation, and rendering: common Western diacritics are precomposed when possible; combining marks are positioned relative to the preceding base glyph (raised) and do not advance cursor or base advance metrics.

Changes

Cohort / File(s) Summary
Font bounds
lib/EpdFont/EpdFont.cpp
Adds detection of combining marks and tracks last base glyph (lastBaseX/lastBaseAdvance/lastBaseTop/hasBaseGlyph). Computes glyph positions for combining marks relative to base glyph, updates min/max bounds using base-aware coordinates, and prevents cursor advance for combining marks.
Precomposition / Hyphenation
lib/Epub/Epub/hyphenation/HyphenationCommon.cpp
Performs in-place NFC-like composition for common Western European combining diacritics: if a combining mark follows a base and a precomposed scalar exists, replaces the previous base scalar and skips emitting the combining mark.
Text rendering / advances
lib/GfxRenderer/GfxRenderer.cpp
Adds combining-mark handling to drawText/drawTextRotated90CW/getTextAdvanceX: tracks last base glyph state, renders combining glyphs at adjusted (raised) coordinates without advancing cursor, and excludes combining marks from advance width.
Unicode utility
lib/Utf8/Utf8.h
Adds inline utf8IsCombiningMark(uint32_t) to identify combining diacritical marks by codepoint ranges for use across font/rendering/path logic.

Sequence Diagram(s)

sequenceDiagram
    participant Input as TextInput
    participant Collector as HyphenationCommon::collectCodepoints
    participant Font as EpdFont::getTextBounds
    participant Renderer as GfxRenderer::drawText
    participant Output as FrameBuffer

    Input->>Collector: stream codepoints
    alt combining follows base and precompose exists
        Collector-->>Collector: replace previous base with precomposed scalar
    else
        Collector->>Collector: emit codepoint as-is
    end
    Collector->>Font: request glyph metrics & bounds (per glyph)
    Font-->>Font: update lastBaseX/advance/top when base glyph seen
    opt combining mark with base present
        Font-->>Font: compute glyphBaseX/glyphBaseY (raiseBy, MIN_COMBINING_GAP_PX)
    end
    Font->>Renderer: glyph positions & metrics
    Renderer->>Renderer: draw base glyphs (advance cursor)
    opt combining mark
        Renderer->>Renderer: draw combining glyph at adjusted coord (no cursor advance)
    end
    Renderer->>Output: rendered pixels
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • daveallie
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: fixing hyphenation and rendering of decomposed characters, which aligns with the core objectives of this pull request.
Description check ✅ Passed The pull request description is directly related to the changeset, clearly outlining the fixes for decomposed diacritic handling in hyphenation and rendering across multiple files with before/after examples.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@jpirnay
Copy link
Contributor Author

jpirnay commented Feb 20, 2026

Testfile:
hyphenation_demo.zip

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cdd362 and 32134b5.

📒 Files selected for processing (2)
  • lib/EpdFont/EpdFont.cpp
  • lib/Utf8/Utf8.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (3)
lib/Utf8/Utf8.h (1)

13-19: LGTM!

The combining mark ranges are accurate per Unicode standards, and the inline placement is appropriate for this simple predicate. The intentional limitation to common Western diacritics (as noted in the PR) is a reasonable trade-off for the hyphenation use case.

lib/EpdFont/EpdFont.cpp (2)

31-35: LGTM!

Clear naming for the base-glyph tracking state, and constexpr for the gap constant is appropriate.


49-72: LGTM!

The combining mark positioning logic is sound:

  • Vertical gap enforcement ensures marks don't overlap base glyphs
  • Horizontal centering over the base glyph's advance width is a reasonable default
  • Cursor advancement correctly skipped for combining marks
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/EpdFont/EpdFont.cpp`:
- Around line 7-16: Remove the local anonymous-namespace helper bool
isCombiningMark(...) and replace all uses of isCombiningMark in EpdFont.cpp with
the already-included utf8IsCombiningMark from Utf8.h; ensure the file keeps the
`#include` "Utf8.h" and delete the duplicate function definition to avoid
divergence and rely solely on utf8IsCombiningMark wherever combining-mark checks
are performed (e.g., where isCombiningMark was previously called).

@lukestein
Copy link
Contributor

Thank you very much for this! Tagging in our hyphenation expert @osteotek. I've been really hoping we could get the fix that this promises:

Hyphenation of terms that already included of hypehns were fixed to include Liang pattern application (eg "US-Satellitensystem" was exclusively broken at the existing hyphen)

I..e., allowing additional hyphenation of words that also include an explicit hyphen so that the occasional multiply-hyphenated-phrases have more places to get broken. (I use large or extra-large fonts, so I'm particularly sensitive to hyphenation quality.)

@osteotek
Copy link
Member

Ok, since popular opinion seems to be in favor of breaking already hyphenated words, i'm OK with it too 😃

Copy link
Contributor

@cbix cbix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment much on the code (probably could with more time at my hands tho 👀), but this fixes the two issues (#998) from my original epubs:
Image
Image

@daveallie daveallie merged commit 5f5561b into crosspoint-reader:master Feb 22, 2026
7 checks passed
daveallie pushed a commit that referenced this pull request Feb 22, 2026
## Summary

* This PR fixes decomposed diacritic handling end-to-end:
- Hyphenation: normalize common Latin base+combining sequences to
precomposed codepoints before Liang pattern matching, so decomposed
words hyphenate correctly
- Rendering: correct combining-mark placement logic so non-spacing marks
are attached to the preceding base glyph in normal and rotated text
rendering paths, with corresponding text-bounds consistency updates.
- Hyphenation around non breaking space variants have been fixed (and
extended)
- Hyphenation of terms that already included of hyphens were fixed to
include Liang pattern application (eg "US-Satellitensystem" was
*exclusively* broken at the existing hyphen)

## Additional Context

* Before
<img width="800" height="480" alt="2"
src="https://github.com/user-attachments/assets/b9c515c4-ab75-45cc-8b52-f4d86bce519d"
/>


* After
<img width="480" height="800" alt="fix1"
src="https://github.com/user-attachments/assets/4999f6a8-f51c-4c0a-b144-f153f77ddb57"
/>
<img width="800" height="480" alt="fix2"
src="https://github.com/user-attachments/assets/7355126b-80c7-441f-b390-4e0897ee3fb6"
/>

* Note 1: the hyphenation fix is not a 100% bullet proof implementation.
It adds composition of *common* base+combining sequences (e.g. O +
U+0308 -> Ö) during codepoint collection. A complete solution would
require implementing proper Unicode normalization (at least NFC,
possibly NFKC in specific cases) before hyphenation and rendering,
instead of hand-mapping a few combining marks. That was beyond the scope
of this fix.

* Note 2: the render fix should be universal and not limited to the
constraints outlined above: it properly x-centers the compund glyph over
the previous one, and it uses at least 1pt of visual distance in y.

Before:
<img width="478" height="167" alt="Image"
src="https://github.com/user-attachments/assets/f8db60d5-35b1-4477-96d0-5003b4e4a2a1"
/>

After: 
<img width="479" height="180" alt="Image"
src="https://github.com/user-attachments/assets/1b48ef97-3a77-475a-8522-23f4aca8e904"
/>

* This should resolve the issues described in #998 
---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**PARTIALLY**_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants