Skip to content

MatteoFasulo/silent_speech

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voicing Silent Speech

This repository contains code for synthesizing speech audio from silently mouthed words captured with electromyography (EMG). This is a fork of the original Gaddy's Silent Speech repository, updated to support the TinyMyo EMG foundation model for transduction and recognition tasks.

Key Changes in this Fork

  • TinyMyo Integration: Added support for pre-training and fine-tuning using the TinyMyo foundation model.
  • Improved Tooling: Integrated uv for faster dependency management and PyTorch setup.
  • Streamlined Dataset Pipeline: Added HDF5 dataset building scripts for much faster training initialization.
  • WandB Logging: Added experiment tracking via Weights & Biases.
  • Updated Decoding: Replaced DeepSpeech dependencies with SpeechBrain and torchaudio built-in CTC decoders for better compatibility with modern hardware.

Environment Setup

The code requires Python 3.10+. We recommend using uv to sync dependencies:

uv sync

You must also initialize submodules for Hifi-GAN and phoneme alignment data:

git submodule update --init
# Extract alignments if not already done
tar -xvzf text_alignments/text_alignments.tar.gz

Data & Preprocessing

The EMG and audio data can be downloaded from Zenodo. However, we provide a script to automate this process via the download_data.py script. After downloading, you can optionally clean and resample the audio for faster training without having to do it on-the-fly.

An HDF5 dataset builder is also included to convert the raw data into a format that allows for much faster loading during training, suitable for HPC environments.

1. Download Data

Configure your $DATA_PATH in config/transduction_model.json, then run:

python download_data.py

2. Audio Cleaning (Optional)

Speeds up training by saving resampled audio files:

python data_collection/clean_audio.py

3. Build HDF5 Dataset

Required for performant training. This builds the dataset once rather than on-the-fly:

python build_hdf5.py

EMG to Audio (Transduction)

This model synthesizes audio features (MFCCs) from EMG signals, which are then converted to wav via a vocoder.

Training

python transduction_model.py

Configuration (hyperparameters, paths, WandB) is managed via config/transduction_model.json.

Note: If the HiFi-GAN vocoder checkpoint is not found locally, the script will automatically download and extract it from Zenodo to ensure a seamless setup.

Evaluation

To evaluate a saved model checkpoint and generate audio samples:

python transduction_model.py --evaluate_saved "./output/model_best.pt" --output_dir "./eval_results"

EMG to Text (Recognition)

Directly convert silent speech to text using a CTC decoder.

For recognition, we use SpeechBrain's built-in CTC decoder with a KenLM n-gram language model for improved performance. The get_lexicon.py script generates the necessary lexicon from the training data.

Setup Decoder

Download the pre-trained KenLM language model and generate the custom lexicon from your dataset in one step:

python get_lexicon.py

This script downloads the Librispeech 4-gram model into the KenLM/ directory and creates gaddy_lexicon.txt containing all unique words found in the local HDF5 datasets.

Run

python recognition_model.py                                      # Train
python recognition_model.py --evaluate_saved "path/to/model.pt"  # Evaluate

Configuration (hyperparameters, paths, WandB) is managed via config/recognition_model.json.

Documentation

This project uses MkDocs with the Material theme and mkdocstrings for API documentation.

To build and view the documentation locally:

# Serve the documentation
mkdocs serve

The documentation includes a Quick Start guide, detailed project sections, and automatically generated API references for all core modules.

Resources

About

EMG-to-Audio and EMG-to-Text synthesis for silent speech, featuring TinyMyo foundation model integration amd HDF5 optimized data pipelines

Resources

License

Stars

Watchers

Forks

Contributors

Languages

  • Python 100.0%