Building Voice AI That Doesn't Suck: Real-Time Conversational Interfaces in Production

Everyone wants to build "the next Jarvis." Call a number, talk naturally, and get intelligent responses. It sounds simple — until you try to build it.

I've built voice AI systems for multiple projects: an AI administrative assistant that takes phone calls, schedules appointments, and responds to inquiries; voice interfaces for SOA Assist Pro to help agents navigate Medicare forms; and experimental interfaces for customer support automation.

Here's what I learned the hard way: Most voice AI feels clunky because latency, conversation design, and interruption handling are harder problems than prompt engineering.

Why Most Voice AI Feels Terrible

Think about the last time you called an automated customer service line. You probably experienced:

Long pauses where you're not sure if the system heard you
The bot talking over you when you try to interrupt
Misunderstanding what you said, even when you spoke clearly
Robotic, unnatural pacing and tone
Getting stuck in loops where you can't reach a human

These aren't AI problems — they're engineering problems. The underlying LLM is perfectly capable of understanding and responding. But the infrastructure between "user speaks" and "bot responds" is where everything falls apart.

The Stack That Actually Works

After trying multiple approaches, here's the stack I've converged on for production voice AI:

Twilio for Telephony

Twilio handles the phone infrastructure: receiving calls, managing sessions, and streaming audio. It's rock-solid, well-documented, and handles edge cases (dropped calls, poor connections) better than anything I've tried to build myself.

Key features:

WebSocket streaming: Real-time bidirectional audio with low latency
TwiML: Simple XML-based call control (greetings, transfers, recordings)
Call recording: Built-in, compliant, useful for debugging and training
Global infrastructure: Low-latency phone numbers in 100+ countries

Azure Speech for TTS/STT

For Text-to-Speech (TTS) and Speech-to-Text (STT), I use Azure Speech Services. I've tried OpenAI's Whisper, Google Cloud Speech, and AWS Transcribe. Azure wins on the combination of:

Latency: Sub-500ms for both STT and TTS in most regions
Natural voices: Neural TTS voices sound genuinely human, not robotic
Customization: SSML support for pacing, emphasis, pauses
Streaming: Both STT and TTS support streaming, critical for real-time feel
Cost: More affordable than OpenAI's TTS for production volume

Note: OpenAI's Whisper is excellent for accuracy, but it's not real-time. For voice calls, you need streaming STT, and Azure handles this better.

OpenAI (or Claude) for Conversational Intelligence

For the conversational logic — understanding intent, generating responses, maintaining context — I use OpenAI's GPT-4 or Anthropic's Claude, depending on the use case.

Key considerations:

Streaming responses: GPT-4 Turbo with streaming gives you token-by-token output, which you can start converting to speech before the full response is done. This dramatically reduces perceived latency.
Short prompts: Voice conversations need concise responses. A 300-word response that reads well on paper takes 90 seconds to speak — way too long.
Function calling: For actions like "schedule an appointment" or "look up an order," use function calling to trigger backend APIs instead of making the LLM do everything.

The Architecture: How It All Connects

Here's the flow for a typical voice AI call:

User calls the Twilio number
Twilio receives the call and opens a WebSocket connection to your server
Your server streams audio chunks to Azure Speech STT
Azure returns transcribed text in real-time (partial results, then final)
When the user stops speaking (detected by silence), send the transcription to OpenAI
OpenAI streams back a response, token by token
As tokens arrive, batch them and send to Azure TTS for conversion to speech
Stream the generated audio back to Twilio
Twilio plays the audio to the user over the phone
Repeat steps 3-9 for the next turn in the conversation

Simple in theory. Complex in practice.

The Hard Parts: Latency, Interruptions, and Conversation Design

1. Latency Is Everything

In a phone conversation, anything over 2 seconds of silence feels broken. Users start repeating themselves or hang up.

Your latency budget:

STT: 200-500ms to get a final transcription
LLM inference: 500-1500ms for GPT-4 to start streaming (first token)
TTS: 200-400ms to convert the first sentence to audio
Network overhead: 100-300ms across hops

Total: 1-2.7 seconds in the best case. You're already at the edge of acceptable.

How to optimize:

Stream everything: Use streaming STT, streaming LLM responses, and streaming TTS. Don't wait for complete outputs.
Start speaking early: As soon as you have the first sentence from the LLM, convert it to speech and start playing. The user hears a response while the rest is still generating.
Pre-cache common responses: For frequent queries ("What are your hours?"), pre-generate and cache the audio. Serve it instantly.
Use faster models when possible: GPT-4o-mini has 3-5x lower latency than GPT-4 for first token. Use it for simple queries.
Run servers close to users: Deploy in the same region as your Twilio numbers and Azure Speech endpoints. Every 100ms matters.

2. Handling Interruptions

In human conversation, we interrupt each other all the time. "Can I schedule an appoint—" "Sure, what day works for you?"

Most voice AI doesn't handle this. The bot keeps talking even when you start speaking. It's infuriating.

The solution:

Barge-in detection: Monitor incoming audio while the bot is speaking. If speech is detected above a threshold, immediately stop the TTS playback.
Clear the buffers: Cancel any pending TTS and LLM streaming. Don't let old content leak into the new turn.
Acknowledge the interruption: Optional, but human-like: "Oh, sorry—go ahead."
Context preservation: Keep the conversation history so the bot knows what it was about to say, in case it's relevant.

This is technically tricky. You need real-time audio level monitoring, WebSocket control messages, and careful state management to avoid race conditions.

3. Conversation Design (Not Just Prompt Engineering)

Designing for voice is fundamentally different than designing for chat. Here's what I learned:

Keep Responses Short

A paragraph that reads quickly takes forever to speak. Aim for 1-2 sentences per turn. If you need to convey more, break it into back-and-forth.

Bad: "Thanks for calling! I can help you schedule an appointment, check your order status, update your account information, or answer general questions about our products and services. What would you like to do today?"

Good: "Hi! How can I help you today?"

Clarify Ambiguity Early

STT isn't perfect, especially with background noise, accents, or domain-specific terms. If you're not sure what the user said, ask for confirmation.

Example: "Did you say Tuesday at 2pm, or Thursday at 2pm?"

Provide Escape Hatches

Always give users a way out. "If you'd like to speak to a human, just say 'agent' or press 0."

Nothing destroys trust faster than trapping someone in a bot loop with no way to escalate.

Use Filler Words Strategically

Humans say "um," "let me see," "just a moment" to fill silence while thinking. Bots should too. If you need to query a database or call an API, have the bot say "Let me check that for you" while the request is in flight. It makes latency feel intentional, not broken.

Test With Real Phone Calls

Don't just test in a quiet room with a clear microphone. Call from your car, from a coffee shop, with background noise. Real-world audio quality is much worse than you expect.

Handling Background Noise and Accents

Phone calls happen in noisy environments: cars, offices, streets. And callers have accents, speak quickly, or mumble.

Strategies that help:

Noise suppression: Azure Speech has built-in noise reduction. It's not perfect, but it helps with background chatter, traffic, etc.
Confidence scores: Azure STT returns confidence scores per word. If confidence is low (<0.6), ask the user to repeat.
Contextual hints: If you know the domain (e.g., scheduling appointments), provide a phrase list to Azure STT to boost recognition of specific terms like "Tuesday," "2pm," doctor names, etc.
Fallback to spelling: For critical info (names, email addresses), ask users to spell it out. "Can you spell your last name for me?"

Cost Considerations: Voice AI Is Expensive

Let's talk money. Voice AI isn't cheap, especially at scale.

Typical costs for a 5-minute call:

Twilio: $0.013/min for inbound calls in the US → $0.065
Azure STT: $1/hour for standard recognition → $0.083
Azure TTS: $16/1M characters (≈1000 words per minute) → $0.08
OpenAI GPT-4: ~10K tokens for a 5-min conversation → $0.30
Total: ~$0.53 per 5-minute call

That's $6.36 per hour of conversation, or $636 for 100 hours. If you're running a customer support line with hundreds of calls per day, it adds up fast.

How to control costs:

Use cheaper models: GPT-4o-mini is 10x cheaper for many use cases.
Cache responses: For FAQs, pre-generate and cache audio.
Tier by complexity: Route simple queries to cheaper models, complex ones to GPT-4.
Set time limits: Cap calls at 10 minutes, then offer to transfer to a human.
Monitor usage per customer: Flag and investigate outliers (someone making 50 calls/day probably isn't legitimate use).

Real-World Example: AI Administrative Assistant

In one of my projects, I built an AI assistant that answers calls for a small business. It handles appointment scheduling, basic questions, and routes urgent calls to humans.

What worked:

Streaming everything: latency stayed under 1.5 seconds for most turns
Pre-cached greetings and FAQs: instant responses for common questions
Clear escalation path: "If you need immediate help, I'll transfer you now"
Conversational function calling: "Let me check the calendar" → query Google Calendar API
Call recording + transcript logging: useful for debugging and training

What was hard:

Handling interruptions reliably (took multiple iterations to get barge-in working smoothly)
Accents and background noise (confidence score thresholds required tuning)
Keeping conversations on track (users ramble, change topics, or get confused)
Edge cases: multiple speakers on one line, speakerphone echo, poor cell connections

When to Use Voice AI (and When Not To)

Voice AI isn't always the right solution. Here's when it makes sense:

Good Use Cases:

High-volume, low-complexity queries: "What are your hours?" "Where's my order?"
After-hours support: Handle calls when humans aren't available
Appointment scheduling: Voice is more natural than form-filling
Triage and routing: Figure out what the caller needs, then route appropriately

Bad Use Cases:

Complex, sensitive issues: Healthcare diagnosis, legal advice, financial planning — these need humans
High-emotion situations: Angry customers, emergencies — escalate immediately
Tasks requiring visual elements: "Fill out this form" works better on web/app
Long-form content: Don't read a 10-minute policy over the phone

Key Takeaways

Latency is everything: Keep total response time under 2 seconds by streaming STT, LLM, and TTS.
The stack that works: Twilio + Azure Speech + OpenAI, with careful orchestration between them.
Interruption handling is critical: Implement barge-in detection so users can cut off the bot naturally.
Conversation design ≠ prompt engineering: Keep responses short, clarify ambiguity, provide escape hatches.
Voice AI is expensive: ~$0.50+ per 5-minute call. Cache aggressively, use cheaper models when possible.
Test with real phone calls: Background noise, accents, and poor connections break systems that work perfectly in dev.
Not every problem needs voice AI. Use it for high-volume, low-complexity tasks where it genuinely improves UX.

Building voice AI for your product? I'd love to hear about your challenges with latency, conversation design, or production deployment. Reach out at adamdugan6@gmail.com or connect with me on LinkedIn.