Automated agent workflow to convert voice recordings into polished blog posts using Gemini AI.
This project provides a complete pipeline to:
- Preprocess raw audio files for optimal speech-to-text performance
- Transcribe audio using Gemini 2.5 Flash with light redaction (removes filler words, adds paragraphs)
- Generate formatted blog posts optimized for web presentation
input/audio-file/{folder}/raw.mp3
↓ [Step 1: Preprocess]
output/{folder}/processed.mp3
↓ [Step 2: Transcribe]
output/{folder}/transcript.txt
↓ [Step 3: Generate Blog]
output/{folder}/blog_post.md
- Converts stereo to mono
- Removes silence while keeping natural pauses
- Reduces background noise
- Normalizes audio levels
- Applies dynamic range compression
- Optimizes for STT (16kHz sample rate)
- Sends processed audio to Gemini API
- Light redaction:
- Removes filler words (um, uh, like, you know, etc.)
- Organizes into paragraphs based on topic changes
- Adds proper spacing
- Maintains original meaning and speaker's voice
- Converts transcript to formatted blog post
- Creates compelling title
- Adds introduction and conclusion
- Organizes with subheadings (## and ###)
- Optimizes for web readability
- Uses proper markdown formatting
./setup.shThis will:
- Create virtual environment
- Install dependencies
- Create .env file template
- Set up directory structure
Edit .env and add your Gemini API key:
GEMINI_API_KEY=your_api_key_herePlace your raw audio file:
input/audio-file/1/raw.mp3./run.sh 1This will generate:
output/1/processed.mp3- Preprocessed audiooutput/1/transcript.txt- Lightly redacted transcriptoutput/1/blog_post.md- Formatted blog post
Voice-Blog-Creator/
├── input/ # Input files
│ └── audio-file/
│ └── {folder}/
│ └── raw.mp3 # Your original audio
│
├── output/ # Output files
│ └── {folder}/
│ ├── processed.mp3 # Preprocessed audio
│ ├── transcript.txt # Lightly redacted transcript
│ └── blog_post.md # Formatted blog post
│
├── scripts/ # Processing scripts
│ ├── preprocess_audio.py
│ ├── gemini_transcribe.py
│ ├── gemini_blog_post.py
│ └── workflow.py # Orchestration
│
├── setup.sh # Setup script
├── run.sh # Run workflow
├── requirements.txt # Python dependencies
└── .env # API keys (not in git)
./run.sh 1 # Process folder 1
./run.sh 1 --force # Force regenerate all files
./run.sh 1 --verbose # Show detailed processing
./run.sh 1 --steps 2 3 # Run only steps 2 and 3# Step 1: Preprocess audio
.venv/bin/python scripts/preprocess_audio.py \
--input input/audio-file/1/raw.mp3 \
--output output/1/processed.mp3
# Step 2: Transcribe
.venv/bin/python scripts/gemini_transcribe.py \
--input output/1/processed.mp3 \
--output output/1/transcript.txt
# Step 3: Generate blog
.venv/bin/python scripts/gemini_blog_post.py \
--input output/1/transcript.txt \
--output output/1/blog_post.md- Python 3.12+
- UV package manager
- Gemini API key
- ffmpeg (for audio processing)
- Clear Separation: Input and output files are clearly separated
- Smart Caching: Skips steps if output already exists (use
--forceto override) - Flexible Pipeline: Run individual steps or full workflow
- Verbose Logging: Debug with
-vflag - Quality Optimization: Each step optimized for its specific purpose
- Gemini 2.5 Flash is used for both transcription and blog generation
- Estimated cost: ~$0.01-0.05 per hour of audio (varies by length and complexity)
Run ./setup.sh to create the environment and install dependencies.
Make sure .env contains your valid Gemini API key:
GEMINI_API_KEY=your_actual_api_key_hereEnsure your audio file is placed at:
input/audio-file/{folder_number}/raw.mp3