A GUI application for preprocessing audio files for speech-to-text (STT) workflows. Optimizes audio before sending to multimodal APIs like Google Gemini or ASR models.
- PyQt6 GUI with drag-and-drop support
- Batch processing - process single files, multiple files, or entire folders
- Persistent settings - remembers your output folder preference
- Safe processing - outputs as
<filename>_processed.mp3to never overwrite originals
- Convert to Mono - STT models don't need stereo
- Downsample to 16kHz - matches what most APIs use internally
- Speech EQ - 80Hz-8kHz bandpass filter for voice clarity
- Gentle Compression - evens out speech dynamics
- Truncate Silences - removes long pauses
- Normalize Audio - consistent levels without clipping
- Export as MP3 - compressed format suitable for API upload
# Clone the repository
git clone https://github.com/danielrosehill/Voice-Prompt-Cleanup-Script.git
cd Voice-Prompt-Cleanup-Script
# Build the package
./build-deb.sh
# Install
sudo apt install ./build/voice-prompt-cleanup_1.0.0-1.debIf installing manually, you need:
- Python 3
- PyQt6 (
pip install PyQt6orsudo apt install python3-pyqt6) - ffmpeg (
sudo apt install ffmpeg)
# Install dependencies
sudo apt install python3-pyqt6 ffmpeg
# Run directly
./voice_prompt_cleanup_gui.pyLaunch from your application menu as "Voice Prompt Cleanup" or run:
voice-prompt-cleanup- Add files by dragging them onto the window, or use "Add Files..." / "Add Folder..."
- Set output folder (optional) - enable custom output folder to save all processed files to one location
- Click Process Files
./process_audio.sh input.mp3 [output.mp3]To update to the latest version:
cd Voice-Prompt-Cleanup-Script
./update-package.shThis will pull the latest changes, rebuild, and reinstall the package.
Input: MP3, WAV, FLAC, OGG, M4A, AAC, WMA, OPUS, WEBM, MP4, MKV, AVI, MOV
Output: MP3 (64kbps, 16kHz mono)
Settings are stored in ~/.config/voice-prompt-cleanup/settings.json:
- Output folder path
- Whether to use custom output folder
- Last used input folder
Primary target: Google Gemini Audio Understanding and similar multimodal APIs
- Accepts: MP3, WAV, FLAC, OGG, etc.
- Typically downsamples to 16kHz internally
- Often has file size limits (e.g., 20MB for Gemini)
The preprocessing optimizes for these constraints while maintaining speech quality.
MIT License - see LICENSE for details.