Update CMU pronunciation dictionary to latest version with deduplication#444
Merged
Update CMU pronunciation dictionary to latest version with deduplication#444
Conversation
Update cmudict-en-us.dict from latest CMUdict source with proper handling of stress marker removal and duplicate elimination. Source: https://github.com/cmusphinx/cmudict Changes: - Updated from older CMUdict snapshot to current version - Strip stress markers (0, 1, 2) from all phonemes - Remove duplicate pronunciations created by stress normalization - Renumber pronunciation variants sequentially after deduplication - Strip trailing comments from dictionary entries Statistics: - Previous: 134,782 entries - Current: 134,860 entries (net +78 entries) - Removed: 306 duplicate pronunciations after stress removal - Added: 384 new words and variants from updated CMUdict Notable changes: - Many words updated with improved phonetic transcriptions - New place names and proper nouns added - Obsolete pronunciation variants removed - Words ending in -ism updated with correct schwa pronunciation The new dictionary maintains full compatibility with PocketSphinx while incorporating the latest lexical improvements from the CMUdict project.
dhdaines
approved these changes
Dec 1, 2025
Contributor
dhdaines
left a comment
There was a problem hiding this comment.
Great! Glad to know we finally have a pronunciation for "Taoiseach" :-)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR updates the English pronunciation dictionary (
cmudict-en-us.dict) to the latest version from the CMUdict repository, with proper handling of stress marker removal and automatic deduplication of pronunciation variants. The processing tool was also added to the cmudict repo in this PR .Changes
Dictionary Processing:
Statistics:
Notable Updates:
Verification steps
The updated dictionary has been tested and verified to maintain full compatibility with PocketSphinx:
Source
Dictionary generated from CMUdict commit as of October 2025 using automated conversion script that:
The conversion is deterministic and reproducible from the upstream CMUdict source.