Automatic Speech Recognition with Gemma
I've created a complete ASR (Automatic Speech Recognition) demo using Docker Compose with the following architecture:
🏗️ Architecture Overview
3 Microservices:
- Ollama Service - Runs Gemma 2:2B model for text enhancement
- ASR Service - FastAPI backend with Whisper for transcription
- Web UI - Nginx-served interactive frontend
🚀 Key Features
Audio Input:
- ✅ Browser-based recording with microphone
- ✅ File upload with drag & drop (MP3, WAV, M4A, OGG)
Processing Pipeline:
- ✅ Whisper (tiny model) for fast speech-to-text
- ✅ Ollama Gemma 2:2B for text enhancement and correction
- ✅ Processing time tracking
User Experience:
- ✅ Real-time recording with timer
- ✅ Health status monitoring
- ✅ Side-by-side comparison of raw vs enhanced text
- ✅ Responsive modern UI
📁 Quick Setup
- Create project structure:
mkdir asr-demo && cd asr-demo
-
Save all files to their respective directories:
docker-compose.ymlin root- ASR service files in
asr-service/ - Web UI files in
web-ui/
-
Start services:
chmod +x startup.sh
./startup.sh start
# OR
docker-compose up --build -d
- Access demo: http://localhost:3000
🎯 Demo Optimizations
- Small footprint - Uses Whisper
tinymodel and Gemma2B - Fast startup - Optimized Docker layers
- Resource efficient - ~4GB RAM requirement
- Development friendly - Hot reload support
The demo showcases a complete speech-to-text pipeline with AI enhancement, perfect for understanding how modern ASR systems work with LLMs for text improvement!
You can find the code here.

Comments