Local-Only Deployment
Run VoiceGateway entirely on local hardware with zero cloud dependencies. Uses Ollama for LLM, Whisper for STT, and Kokoro for TTS. Ideal for air-gapped environments, development without API keys, or
Local-Only Deployment
Run VoiceGateway entirely on local hardware with zero cloud dependencies. Uses Ollama for LLM, Whisper for STT, and Kokoro for TTS. Ideal for air-gapped environments, development without API keys, or privacy-sensitive deployments.
Prerequisites
Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull qwen2.5:3bInstall VoiceGateway with Local Providers
pip install voicegateway[whisper,kokoro]Whisper requires torch and will download model weights on first use. Kokoro requires the kokoro package.
Configuration
Create voicegw.yaml:
providers:
ollama:
base_url: http://localhost:11434
whisper: {}
kokoro: {}
models:
stt:
local/whisper-large-v3:
provider: whisper
model: large-v3
local/whisper-base:
provider: whisper
model: base
llm:
ollama/qwen2.5:3b:
provider: ollama
model: qwen2.5:3b
ollama/llama3.2:1b:
provider: ollama
model: llama3.2:1b
tts:
local/kokoro:
provider: kokoro
stacks:
local:
stt: local/whisper-large-v3
llm: ollama/qwen2.5:3b
tts: local/kokoro
fast:
stt: local/whisper-base
llm: ollama/llama3.2:1b
tts: local/kokoro
fallbacks:
stt:
- local/whisper-large-v3
- local/whisper-base
llm:
- ollama/qwen2.5:3b
- ollama/llama3.2:1b
projects:
local-dev:
name: Local Development
daily_budget: 0 # Unlimited (local models are free)
tags: [development, local]
default_project: local-dev
cost_tracking:
enabled: true # Still tracks requests, costs will be $0.00
observability:
latency_tracking: trueBasic Usage
from voicegateway import inference
# default_project: local-dev in voicegw.yaml means the inference
# factories pick up local-dev automatically. All local, no API keys.
stt = inference.STT("local/whisper-large-v3")
llm = inference.LLM("ollama/qwen2.5:3b")
tts = inference.TTS("local/kokoro")LiveKit Agent with Local Models
from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, cli
from livekit.plugins import silero
from voicegateway import inference
async def entrypoint(ctx: JobContext):
await ctx.connect()
session = AgentSession(
vad=silero.VAD.load(),
stt=inference.STT("local/whisper-large-v3"),
llm=inference.LLM("ollama/qwen2.5:3b"),
tts=inference.TTS("local/kokoro"),
)
await session.start(
agent=Agent(
instructions=(
"You are a helpful voice assistant running entirely on local hardware. "
"Be concise: local models work best with shorter responses."
),
),
room=ctx.room,
)
if __name__ == "__main__":
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))Docker Compose with Ollama
For a containerized local-only setup:
version: "3.8"
services:
voicegateway:
build:
context: .
dockerfile: src/voicegateway/Dockerfile
container_name: voicegateway
ports:
- "8080:8080"
volumes:
- voicegw-data:/data
- ./voicegw.yaml:/app/voicegw.yaml:ro
environment:
- VOICEGW_CONFIG=/app/voicegw.yaml
- VOICEGW_DB_PATH=/data/voicegw.db
depends_on:
- ollama
networks:
- voicegw-net
ollama:
image: ollama/ollama:latest
container_name: voicegateway-ollama
ports:
- "11434:11434"
volumes:
- ollama-models:/root/.ollama
networks:
- voicegw-net
# The dashboard runs inside the voicegateway service: the daemon
# mounts the React SPA at / and the dashboard API at /api/* on
# the same port as the public HTTP API. No second service needed.
volumes:
voicegw-data:
ollama-models:
networks:
voicegw-net:Update voicegw.yaml to point Ollama at the container:
providers:
ollama:
base_url: http://ollama:11434Then start and pull the model:
docker compose up -d
docker exec voicegateway-ollama ollama pull qwen2.5:3bUsing Piper TTS as an Alternative
If Kokoro is not available, Piper is another local TTS option:
providers:
piper: {}
models:
tts:
local/piper:
provider: piper
default_voice: en_US-lessac-mediumpip install voicegateway[piper]Performance Considerations
Local models have different performance characteristics than cloud APIs:
| Metric | Cloud (Deepgram + GPT-4.1) | Local (Whisper + Qwen2.5) |
|---|---|---|
| STT TTFB | ~100-200ms | ~500-2000ms (depends on GPU) |
| LLM TTFB | ~200-500ms | ~300-3000ms (depends on model size) |
| TTS TTFB | ~100-300ms | ~200-1000ms |
| Cost | ~$0.01-0.05/request | $0.00 |
Tips for optimizing local performance:
- GPU acceleration: ensure CUDA/Metal is available for Whisper and Ollama
- Smaller models: use
local/whisper-baseinstead oflocal/whisper-large-v3for faster STT - Quantized LLMs: Ollama automatically uses quantized models (Q4_0, Q4_K_M)
- Keep models warm: Ollama keeps the most recent model in memory; avoid switching frequently
Hybrid: Local Fallback for Cloud
A common pattern is to use cloud providers normally but fall back to local models when they are unavailable or the budget is exceeded:
fallbacks:
stt:
- deepgram/nova-3
- local/whisper-large-v3
llm:
- openai/gpt-4.1-mini
- ollama/qwen2.5:3b
tts:
- cartesia/sonic-3
- local/kokoro
projects:
prod:
daily_budget: 50.00
budget_action: throttle # Falls back to local on exceedSee Fallback Chains and Budget Enforcement for more details.
LiveKit FallbackAdapter Integration
This page shows how to compose VoiceGateway's `inference` factories with LiveKit's `FallbackAdapter` to get runtime, error-driven failover during an active call. VoiceGateway's own resolver-time fallb
Multi-Project Setup
Configure multiple projects with different model stacks, budgets, and tracking. This is useful when you have separate teams, environments, or products sharing a single VoiceGateway instance.