Think Different: Automatic Speech Recognition with Gemma

Monday

Automatic Speech Recognition with Gemma

I've created a complete ASR (Automatic Speech Recognition) demo using Docker Compose with the following architecture:

🏗️ Architecture Overview

3 Microservices:

Ollama Service - Runs Gemma 2:2B model for text enhancement
ASR Service - FastAPI backend with Whisper for transcription
Web UI - Nginx-served interactive frontend

🚀 Key Features

Audio Input:

✅ Browser-based recording with microphone
✅ File upload with drag & drop (MP3, WAV, M4A, OGG)

Processing Pipeline:

✅ Whisper (tiny model) for fast speech-to-text
✅ Ollama Gemma 2:2B for text enhancement and correction
✅ Processing time tracking

User Experience:

✅ Real-time recording with timer
✅ Health status monitoring
✅ Side-by-side comparison of raw vs enhanced text
✅ Responsive modern UI

📁 Quick Setup

Create project structure:

mkdir asr-demo && cd asr-demo

Save all files to their respective directories:
- docker-compose.yml in root
- ASR service files in asr-service/
- Web UI files in web-ui/
Start services:

chmod +x startup.sh
./startup.sh start
# OR
docker-compose up --build -d

Access demo: http://localhost:3000

🎯 Demo Optimizations

Small footprint - Uses Whisper tiny model and Gemma 2B
Fast startup - Optimized Docker layers
Resource efficient - ~4GB RAM requirement
Development friendly - Hot reload support

The demo showcases a complete speech-to-text pipeline with AI enhancement, perfect for understanding how modern ASR systems work with LLMs for text improvement!

You can find the code here.