Skip to content

Update CMU pronunciation dictionary to latest version with deduplication#444

Merged
dhdaines merged 1 commit intomainfrom
update-cmudict
Dec 1, 2025
Merged

Update CMU pronunciation dictionary to latest version with deduplication#444
dhdaines merged 1 commit intomainfrom
update-cmudict

Conversation

@lenzo-ka
Copy link
Contributor

Description

This PR updates the English pronunciation dictionary (cmudict-en-us.dict) to the latest version from the CMUdict repository, with proper handling of stress marker removal and automatic deduplication of pronunciation variants. The processing tool was also added to the cmudict repo in this PR .

Changes

Dictionary Processing:

  • Strip stress markers (0, 1, 2) from all phonemes for PocketSphinx compatibility
  • Remove duplicate pronunciations created by stress normalization
  • Renumber pronunciation variants sequentially after deduplication
  • Strip trailing comments from dictionary entries

Statistics:

  • Previous: 134,782 entries
  • Updated: 134,860 entries (net +78 entries)
  • Removed: 306 duplicate pronunciations after stress removal
  • Added: 384 new words and variants from updated CMUdict

Notable Updates:

  • Many words updated with improved phonetic transcriptions
  • New place names and proper nouns added
  • Obsolete pronunciation variants removed
  • Words ending in -ism updated with correct schwa pronunciation (e.g., athleticism, realism)
  • Improved pronunciations for common words (e.g., gloucester, worcestershire)

Verification steps

The updated dictionary has been tested and verified to maintain full compatibility with PocketSphinx:

  1. Dictionary format validated (all entries well-formed)
  2. No duplicate pronunciations present
  3. Variant numbering sequential and correct
  4. Successfully used for speech recognition with PocketSphinx decoder
  5. Tested on real audio samples with multiple language models

Source

Dictionary generated from CMUdict commit as of October 2025 using automated conversion script that:

  • Strips stress markers for PocketSphinx compatibility
  • Deduplicates variants that become identical after stress removal
  • Preserves all unique pronunciation variants
  • Maintains alphabetical ordering

The conversion is deterministic and reproducible from the upstream CMUdict source.

Update cmudict-en-us.dict from latest CMUdict source with proper handling
of stress marker removal and duplicate elimination.

Source: https://github.com/cmusphinx/cmudict

Changes:
- Updated from older CMUdict snapshot to current version
- Strip stress markers (0, 1, 2) from all phonemes
- Remove duplicate pronunciations created by stress normalization
- Renumber pronunciation variants sequentially after deduplication
- Strip trailing comments from dictionary entries

Statistics:
- Previous: 134,782 entries
- Current: 134,860 entries (net +78 entries)
- Removed: 306 duplicate pronunciations after stress removal
- Added: 384 new words and variants from updated CMUdict

Notable changes:
- Many words updated with improved phonetic transcriptions
- New place names and proper nouns added
- Obsolete pronunciation variants removed
- Words ending in -ism updated with correct schwa pronunciation

The new dictionary maintains full compatibility with PocketSphinx while
incorporating the latest lexical improvements from the CMUdict project.
@lenzo-ka lenzo-ka requested a review from dhdaines October 24, 2025 17:45
Copy link
Contributor

@dhdaines dhdaines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Glad to know we finally have a pronunciation for "Taoiseach" :-)

@dhdaines dhdaines merged commit ecfea52 into main Dec 1, 2025
21 checks passed
@dhdaines dhdaines deleted the update-cmudict branch December 1, 2025 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants