Automatic Speech Recognition with Gemma



I've created a complete ASR (Automatic Speech Recognition) demo using Docker Compose with the following architecture:

🏗️ Architecture Overview

3 Microservices:

  1. Ollama Service - Runs Gemma 2:2B model for text enhancement
  2. ASR Service - FastAPI backend with Whisper for transcription
  3. Web UI - Nginx-served interactive frontend

🚀 Key Features

Audio Input:

  • ✅ Browser-based recording with microphone
  • ✅ File upload with drag & drop (MP3, WAV, M4A, OGG)

Processing Pipeline:

  • Whisper (tiny model) for fast speech-to-text
  • Ollama Gemma 2:2B for text enhancement and correction
  • ✅ Processing time tracking

User Experience:

  • ✅ Real-time recording with timer
  • ✅ Health status monitoring
  • ✅ Side-by-side comparison of raw vs enhanced text
  • ✅ Responsive modern UI

📁 Quick Setup

  1. Create project structure:
mkdir asr-demo && cd asr-demo
  1. Save all files to their respective directories:

    • docker-compose.yml in root
    • ASR service files in asr-service/
    • Web UI files in web-ui/
  2. Start services:

chmod +x startup.sh
./startup.sh start
# OR
docker-compose up --build -d
  1. Access demo: http://localhost:3000

🎯 Demo Optimizations

  • Small footprint - Uses Whisper tiny model and Gemma 2B
  • Fast startup - Optimized Docker layers
  • Resource efficient - ~4GB RAM requirement
  • Development friendly - Hot reload support

The demo showcases a complete speech-to-text pipeline with AI enhancement, perfect for understanding how modern ASR systems work with LLMs for text improvement!

You can find the code here.

Comments

Popular posts from this blog

Self-contained Raspberry Pi surveillance System Without Continue Internet

COBOT with GenAI and Federated Learning

AI in Education: Embracing Change for Future-Ready Learning