Voice Platform
Your voice. Your hardware. Nobody else’s business.
halo-ai’s voice platform handles synthesis, recognition, and cloning – all running locally, all private by default. Thirty seconds of reference audio is enough to build a voice model. That model never leaves your machine.
Core Stack
| Component | Engine | Purpose |
|---|---|---|
| Text-to-Speech | Kokoro TTS | High-fidelity voice synthesis |
| Speech-to-Text | Whisper | Transcription and real-time recognition |
| Voice Understanding | Voxtral (Mistral) | Multimodal voice + text reasoning |
| Voice Cloning | 30-second capture | Custom voice model from minimal audio |
Voxtral — Mistral Voice Weights
We run Voxtral, Mistral’s multimodal voice model. It understands audio natively — not just transcription, but comprehension. Feed it a voice recording and it reasons about what was said, the tone, the intent. Combined with Whisper for raw STT and Kokoro for output, the full pipeline is:
Voice in (Whisper) → Understanding (Voxtral) → Reasoning (Qwen3-30B) → Voice out (Kokoro)
All local. All on AMD Strix Halo. No cloud API. The weights run on the same hardware as everything else — 128GB unified memory means the voice model and the LLM coexist without swapping.
Voice-as-a-Service
The voice platform powers multiple services across the halo-ai ecosystem:
Audiobooks
Feed it a manuscript. Pick a voice – yours, a clone, or one of the built-in models. Get a produced audiobook with chapter markers, consistent pacing, and natural inflection. Amp handles the mastering.
Music Production
Voice models integrated into the music pipeline. Sing lead vocals without singing. Layer harmonies from a single voice source. The Downcomers – halo-ai’s resident band – use cloned vocals for every track.
Game Voices
Dynamic character dialog generated in real time. Dealer writes the lines, the voice platform speaks them. Every NPC has a voice. Every run sounds different.
Live Streaming Co-Host
A real-time voice companion for streams. Responds to chat, comments on gameplay, and maintains a consistent persona throughout the broadcast. Low latency. Natural cadence.
The Downcomers
halo-ai’s AI band. Heavy blues, bagpipes meeting electric guitar, AC/DC crossed with Led Zeppelin. All vocals are cloned. All instruments are synthesized or sampled. The music is real. The band is not.
Memorial Voice Cloning
Preserve a voice that matters to you. Thirty seconds of audio from a phone call, a voicemail, a home video. The model captures tone, cadence, and character. It stays on your hardware, encrypted, for as long as you want it there.
Privacy
The voice model never uploads. Never phones home. Never trains on your data for someone else’s benefit. What you record stays recorded on your drives and nowhere else. This is not negotiable.
Designed and built by the architect.