Skip to content

feat: language-aware filler word removal#971

Merged
cjpais merged 2 commits intomainfrom
remove-some-filler-words
Mar 6, 2026
Merged

feat: language-aware filler word removal#971
cjpais merged 2 commits intomainfrom
remove-some-filler-words

Conversation

@cjpais
Copy link
Copy Markdown
Owner

@cjpais cjpais commented Mar 6, 2026

  • Filler words are now selected based on the app's UI language instead of using a hardcoded universal list
  • Words like "um", "eh", and "ha" are only removed for languages where they are truly fillers (e.g., preserved in Portuguese where "um" = "a/an", Spanish
    where "ha" = "has")
  • Adds custom_filler_words setting for users who want to override the defaults (set to [] to disable filtering entirely)
  • Covers all 17 supported UI languages with conservative fallback for unknown languages

Addresses the concerns raised in #943. Don't just remove filler words that conflict with European languages (which regresses English), this takes the language-aware approach discussed in that thread.

Known limitation: custom_filler_words is a power-user setting with no UI. It can only be changed by editing settings_store.json directly. Changes to custom_filler_words field require an app restart to take effect

The setting has three states:

  • null or absent (default) — Uses the built-in filler word list for the user's selected app language. For English this includes words like "uh", "um", "hmm", "ah", etc. For other languages, the list is tailored to avoid removing words that are real in that language (e.g., "um" means "a/an" in Portuguese, "ha" means "has" in Spanish).
  • Empty list [] — Disables filler word removal entirely.
  • Word list ["like", "okay", ...] — Completely overrides the language defaults. Only the words you specify will be filtered. The built-in filler words for your language will not be applied — if you still want those removed, you need to include them in your list as well.

The filler word list is selected based on the app language setting (the UI language), not the transcription language. For unknown or unsupported languages, a conservative fallback list is used that avoids removing ambiguous words like "um", "eh", and "ha".

AlexanderYastrebov and others added 2 commits March 2, 2026 22:19
Removed three filler words from the transcription filter that are actual
words in European languages:

- 'um' - Portuguese/Spanish indefinite article meaning 'a/an' (masculine)
  Example: Portuguese 'Isto é um teste' (This is a test)
  Removing this was breaking Portuguese transcriptions

- 'ha' - Spanish/Italian/Norwegian/Swedish auxiliary verb meaning 'has/have'
  Example: Spanish 'Él ha comido' (He has eaten)
  This is a very common verb form in Romance and Scandinavian languages

- 'eh' - Italian interjection and Canadian English discourse marker
  While less critical, this can appear in legitimate Italian speech

The remaining filler words are primarily English
vocalized hesitations that don't conflict with common words in other
European languages.

Updated tests to use 'uhm' instead of 'um' where needed.

Fixes #941
@cjpais cjpais changed the title Remove some filler words feat: language-aware filler word removal Mar 6, 2026
@cjpais cjpais merged commit 615b3c9 into main Mar 6, 2026
4 checks passed
thukabjj added a commit to thukabjj/Handy that referenced this pull request Mar 7, 2026
Upstream changes:
- Language-aware filler word removal (cjpais#971)
- Portable mode NSIS installer (cjpais#807)
- Italian translation fix (cjpais#973)
- Bun setup for Windows ARM64 (cjpais#965)
- tauri-plugin-dialog upgraded to 2.6
- Linux install docs and AppImage troubleshooting (cjpais#951)

Conflict resolution:
- package.json: take upstream dialog 2.6
- Cargo.toml: take upstream dialog 2.6, keep fork deps (symphonia, keyring)
- text.rs: take upstream's language-aware filter_transcription_output, keep fork test
- settings/mod.rs: keep fork fields + add upstream custom_filler_words
- bun.lock: take upstream version
fxbenard pushed a commit to fxbenard/Parler that referenced this pull request Mar 11, 2026
* Remove filler words that conflict with European languages

Removed three filler words from the transcription filter that are actual
words in European languages:

- 'um' - Portuguese/Spanish indefinite article meaning 'a/an' (masculine)
  Example: Portuguese 'Isto é um teste' (This is a test)
  Removing this was breaking Portuguese transcriptions

- 'ha' - Spanish/Italian/Norwegian/Swedish auxiliary verb meaning 'has/have'
  Example: Spanish 'Él ha comido' (He has eaten)
  This is a very common verb form in Romance and Scandinavian languages

- 'eh' - Italian interjection and Canadian English discourse marker
  While less critical, this can appear in legitimate Italian speech

The remaining filler words are primarily English
vocalized hesitations that don't conflict with common words in other
European languages.

Updated tests to use 'uhm' instead of 'um' where needed.

Fixes cjpais#941

* customize list

---------

Co-authored-by: Alexander Yastrebov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants