Skip to content

fix: add CJK Extension H/I to isCJK and fix pre-wrap fast path#33

Open
xr843 wants to merge 1 commit intochenglou:mainfrom
xr843:fix/cjk-extension-h-i-coverage
Open

fix: add CJK Extension H/I to isCJK and fix pre-wrap fast path#33
xr843 wants to merge 1 commit intochenglou:mainfrom
xr843:fix/cjk-extension-h-i-coverage

Conversation

@xr843
Copy link
Copy Markdown

@xr843 xr843 commented Mar 30, 2026

Summary

Two small, targeted fixes:

  1. Add missing CJK Extension H and I ranges to isCJK()

    CJK Unified Ideographs Extension H (U+31350–U+323AF, Unicode 15.0) and Extension I (U+2EBF0–U+2F7FF, Unicode 15.1) are not covered by the current range checks. Characters from these blocks fall through without CJK per-character line breaking or kinsoku attachment rules.

    This directly addresses the documented requirement in CLAUDE.md:

    "Astral CJK ideographs, compatibility ideographs, and the later extension blocks must still hit the CJK path"

    The new ranges are placed adjacent to their neighboring extensions (I after F, H after G) to keep the block ordering clear.

  2. Remove no-op .replace() on normalizeWhitespacePreWrap fast path

    When the guard !/[\r\f]/.test(text) is true, no \r exists in the text, so the subsequent .replace(/\r\n/g, '\n') can never match — it always returns the input unchanged. The fast path now returns text directly.

Changes

  • src/analysis.ts: Add Extension H/I ranges to isCJK(); simplify pre-wrap fast path
  • src/layout.test.ts: Add Extension H/I ranges to isWideCharacter() test helper; add test case verifying Extension H/I characters get CJK break behavior

Test plan

  • bun test — 61 tests pass (including new Extension H/I test)
  • tsc --noEmit — no type errors
  • Verified Extension H (U+31350) and Extension I (U+2EBF0) now return true from isCJK()
  • Verified the pre-wrap fast path is functionally equivalent (no \r → no \r\n → replace was always a no-op)

Two small fixes:

1. Add missing CJK Unified Ideographs Extension H (U+31350–U+323AF,
   Unicode 15.0) and Extension I (U+2EBF0–U+2F7FF, Unicode 15.1) to
   `isCJK()`. Without these ranges, characters from these blocks do not
   get per-character CJK line breaking or kinsoku rules. This aligns
   with the documented requirement that "later extension blocks must
   still hit the CJK path."

2. Remove a no-op `.replace(/\r\n/g, '\n')` on the fast path of
   `normalizeWhitespacePreWrap()`. When the guard `!/[\r\f]/` is true,
   no `\r` is present, so `\r\n` cannot match either—the replace
   always returns the input unchanged.
(c >= 0x2B740 && c <= 0x2B81F) ||
(c >= 0x2B820 && c <= 0x2CEAF) ||
(c >= 0x2CEB0 && c <= 0x2EBEF) ||
(c >= 0x2EBF0 && c <= 0x2F7FF) ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, nice catch on the missing extensions! Two issues I spotted in the diff:

  1. Wrong upper bound for Extension H
    The added range (c >= 0x2EBF0 && c <= 0x2F7FF) overshoots. CJK Extension H (Unicode 15.0) ends at 0x2EE5F, not 0x2F7FF. The range 0x2EE60–0x2F7FF contains unassigned blocks and bleeds into CJK Compatibility territory. It should be:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(c >= 0x2EBF0 && c <= 0x2EE5F) // CJK Extension H

})

test('treats CJK Extension H and I ideographs as CJK break units', () => {
const extH = String.fromCodePoint(0x31350) // CJK Extension H (Unicode 15.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const extH = String.fromCodePoint(0x31350) // comment says "Extension H" — but 0x31350 is Extension I
const extI = String.fromCodePoint(0x2EBF0) // comment says "Extension I" — but 0x2EBF0 is Extension H

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants