Building Voice AI That Doesn't Suck: Real-Time Conversational Interfaces in Production

Everyone wants to build "the next Jarvis." Call a number, talk naturally, and get intelligent responses. It sounds simple — until you try to build it.

I've built voice AI systems for multiple projects: an AI administrative assistant that takes phone calls, schedules appointments, and responds to inquiries; voice interfaces for SOA Assist Pro to help agents navigate Medicare forms; and experimental interfaces for customer support automation.

Here's what I learned the hard way: Most voice AI feels clunky because latency, conversation design, and interruption handling are harder problems than prompt engineering.

Why Most Voice AI Feels Terrible

Think about the last time you called an automated customer service line. You probably experienced:

These aren't AI problems — they're engineering problems. The underlying LLM is perfectly capable of understanding and responding. But the infrastructure between "user speaks" and "bot responds" is where everything falls apart.

The Stack That Actually Works

After trying multiple approaches, here's the stack I've converged on for production voice AI:

Twilio for Telephony

Twilio handles the phone infrastructure: receiving calls, managing sessions, and streaming audio. It's rock-solid, well-documented, and handles edge cases (dropped calls, poor connections) better than anything I've tried to build myself.

Key features:

Azure Speech for TTS/STT

For Text-to-Speech (TTS) and Speech-to-Text (STT), I use Azure Speech Services. I've tried OpenAI's Whisper, Google Cloud Speech, and AWS Transcribe. Azure wins on the combination of:

Note: OpenAI's Whisper is excellent for accuracy, but it's not real-time. For voice calls, you need streaming STT, and Azure handles this better.

OpenAI (or Claude) for Conversational Intelligence

For the conversational logic — understanding intent, generating responses, maintaining context — I use OpenAI's GPT-4 or Anthropic's Claude, depending on the use case.

Key considerations:

The Architecture: How It All Connects

Here's the flow for a typical voice AI call:

  1. User calls the Twilio number
  2. Twilio receives the call and opens a WebSocket connection to your server
  3. Your server streams audio chunks to Azure Speech STT
  4. Azure returns transcribed text in real-time (partial results, then final)
  5. When the user stops speaking (detected by silence), send the transcription to OpenAI
  6. OpenAI streams back a response, token by token
  7. As tokens arrive, batch them and send to Azure TTS for conversion to speech
  8. Stream the generated audio back to Twilio
  9. Twilio plays the audio to the user over the phone
  10. Repeat steps 3-9 for the next turn in the conversation

Simple in theory. Complex in practice.

The Hard Parts: Latency, Interruptions, and Conversation Design

1. Latency Is Everything

In a phone conversation, anything over 2 seconds of silence feels broken. Users start repeating themselves or hang up.

Your latency budget:

Total: 1-2.7 seconds in the best case. You're already at the edge of acceptable.

How to optimize:

2. Handling Interruptions

In human conversation, we interrupt each other all the time. "Can I schedule an appoint—" "Sure, what day works for you?"

Most voice AI doesn't handle this. The bot keeps talking even when you start speaking. It's infuriating.

The solution:

This is technically tricky. You need real-time audio level monitoring, WebSocket control messages, and careful state management to avoid race conditions.

3. Conversation Design (Not Just Prompt Engineering)

Designing for voice is fundamentally different than designing for chat. Here's what I learned:

Keep Responses Short

A paragraph that reads quickly takes forever to speak. Aim for 1-2 sentences per turn. If you need to convey more, break it into back-and-forth.

Bad: "Thanks for calling! I can help you schedule an appointment, check your order status, update your account information, or answer general questions about our products and services. What would you like to do today?"

Good: "Hi! How can I help you today?"

Clarify Ambiguity Early

STT isn't perfect, especially with background noise, accents, or domain-specific terms. If you're not sure what the user said, ask for confirmation.

Example: "Did you say Tuesday at 2pm, or Thursday at 2pm?"

Provide Escape Hatches

Always give users a way out. "If you'd like to speak to a human, just say 'agent' or press 0."

Nothing destroys trust faster than trapping someone in a bot loop with no way to escalate.

Use Filler Words Strategically

Humans say "um," "let me see," "just a moment" to fill silence while thinking. Bots should too. If you need to query a database or call an API, have the bot say "Let me check that for you" while the request is in flight. It makes latency feel intentional, not broken.

Test With Real Phone Calls

Don't just test in a quiet room with a clear microphone. Call from your car, from a coffee shop, with background noise. Real-world audio quality is much worse than you expect.

Handling Background Noise and Accents

Phone calls happen in noisy environments: cars, offices, streets. And callers have accents, speak quickly, or mumble.

Strategies that help:

Cost Considerations: Voice AI Is Expensive

Let's talk money. Voice AI isn't cheap, especially at scale.

Typical costs for a 5-minute call:

That's $6.36 per hour of conversation, or $636 for 100 hours. If you're running a customer support line with hundreds of calls per day, it adds up fast.

How to control costs:

Real-World Example: AI Administrative Assistant

In one of my projects, I built an AI assistant that answers calls for a small business. It handles appointment scheduling, basic questions, and routes urgent calls to humans.

What worked:

What was hard:

When to Use Voice AI (and When Not To)

Voice AI isn't always the right solution. Here's when it makes sense:

Good Use Cases:

Bad Use Cases:

Key Takeaways

Building voice AI for your product? I'd love to hear about your challenges with latency, conversation design, or production deployment. Reach out at adamdugan6@gmail.com or connect with me on LinkedIn.