VAPI (Voice API) is a developer platform for building real-time voice AI agents. It handles the full pipeline: inbound/outbound call management, speech-to-text (via Deepgram, AssemblyAI, or others), LLM inference (via OpenAI, Anthropic, or custom), and text-to-speech (via ElevenLabs, PlayHT, or built-in voices). You configure everything via API or dashboard and pay per minute of call time.

What STT model works best with VAPI for Australian English?

Deepgram Nova-2 is the best option for Australian English. It handles Australian accents significantly better than Whisper, has lower latency (~150ms vs ~300ms for Whisper), and supports real-time streaming rather than batch processing. Set the language to 'en-AU' in the Deepgram config for improved accuracy.

Review Last updated: April 22, 2026 By Roman Stanek ~1,600 words

VAPI Review: Honest Assessment After Running It in production

Q: How much does VAPI cost?

VAPI charges approximately $0.05 per minute for infrastructure. On top of that, you pay for your STT provider (Deepgram ~$0.0043/min), LLM (GPT-4o ~$0.01-0.03/min depending on token usage), and TTS (ElevenLabs ~$0.01/min). Total all-in cost for a 90-second call runs $0.06-0.09. There's no monthly platform fee -- pure usage-based pricing.

Q: What are the main problems with VAPI?

The three most common issues: (1) Endpointing -- the agent cuts in before the human finishes speaking if not tuned correctly. (2) Latency spikes -- occasionally a response takes 1.5-2 seconds instead of the usual 600-900ms, which sounds unnatural. (3) Documentation gaps -- some features are only discoverable through the Discord community or trial and error, not the official docs.

Q: Is VAPI better than Bland.ai or Retell?

VAPI gives the most control and lowest cost at scale. Bland.ai is fastest to set up (no-code friendly, flat $0.09/min pricing). Retell has the best analytics dashboard. If you're a developer who wants full control over the voice stack and will run high volume, VAPI wins. If you need something live in a day without touching an API, use Bland.ai.

VAPI is the voice AI infrastructure platform I use to run Amy, an AI cold caller making 75 calls a day to Australian tradespeople. I've been running it in production since late 2025. This is an honest review — what works, what breaks, how the pricing actually stacks up, and when you should use Bland.ai or Retell instead.

~$0.05

Per minute VAPI infrastructure cost

Source: VAPI pricing page, 2026

600ms

Typical end-to-end response latency

Source: Production logs, 2026

107M+

Minutes of voice AI processed by VAPI in 2025

Source: VAPI company blog, 2025

✓ Verdict: Best-in-class for developers who want full control. Not the fastest to get running, but the most powerful at scale.

What VAPI Is (and Isn't)

VAPI is infrastructure, not a finished product. You don't sign up and get a working AI caller. You get an API that lets you assemble one from best-of-breed components.

What VAPI handles:

Outbound and inbound call management (via Twilio, Vonage, or your own SIP trunk)
Real-time audio streaming to/from your STT, LLM, and TTS providers
Turn-taking logic — detecting when the human has finished speaking
Call recording, transcription storage, and webhook events
Concurrent call scaling — run hundreds of simultaneous calls without managing infrastructure

What VAPI doesn't handle:

Your script or conversation logic — that's your LLM system prompt
Lead list management or CRM — you build that
Compliance (DNC scrubbing, call hour rules) — entirely your responsibility
The voice itself — you bring ElevenLabs, PlayHT, or similar

The modularity is both the strength and the complexity. You can swap Deepgram for Whisper, GPT-4o for Claude, ElevenLabs for PlayHT — and tune each layer independently. But that also means more configuration surface area where things can go wrong.

Pricing: What You Actually Pay

VAPI's pricing page shows $0.05/min for infrastructure. Here's the full picture for a production setup:

VAPI infrastructure

~$0.050

Base platform fee

Deepgram Nova-2 (STT)

~$0.004

Real-time streaming model

GPT-4o (LLM)

~$0.015

Varies with token usage per turn

ElevenLabs (TTS)

~$0.010

Creator plan; scales with char count

Twilio (carrier)

~$0.008

Per-minute outbound; varies by country

Total (90-second call)

~$0.065–0.09

Answered calls only

Unanswered calls (voicemail, no answer) cost almost nothing — VAPI detects the voicemail greeting and hangs up in 4–6 seconds. You pay for maybe $0.005 per unanswered dial.

For 75 calls/day at 50% answer rate and 90-second average call duration: ~$5.25/day total. At a 1/20 conversion to booked meeting, that's $105 per booked meeting. For a service selling at $3,000–5,000 AUD, the ROI is obvious.

Pros and Cons: Production Reality

What works well

Full control over every layer of the stack
Cheapest at scale when you optimise each component
Excellent webhook system — every call event fires reliably
Good concurrent call handling — 50+ simultaneous calls without issues
Strong Discord community with real engineers answering questions
Swap any provider without rebuilding everything
Dashboard is clear; logs are detailed enough to debug

What's painful

Endpointing tuning takes time — default settings cause interruptions
Latency spikes to 1.5–2s happen occasionally, sounds unnatural
Documentation lags behind features — some things only discoverable via Discord
No built-in analytics — you log everything yourself via webhooks
ElevenLabs + VAPI integration has occasional audio glitches under high load
No no-code interface — non-developers will struggle

Configuration That Actually Works

These are the settings I run in production after months of tuning. Copy this as a starting point:

// VAPI assistant config -- production settings for outbound cold calling
{
  "transcriber": {
    "provider": "deepgram",
    "model": "nova-2",
    "language": "en-AU",        // AU English accuracy boost
    "smartFormat": true
  },
  "voice": {
    "provider": "11labs",
    "voiceId": "<your-elevenlabs-voice-id>", // Use an AU-friendly voice; test 2-3 before committing
    "stability": 0.5,
    "similarityBoost": 0.75
  },
  "model": {
    "provider": "openai",
    "model": "gpt-4o",
    "temperature": 0.4          // Lower = more consistent responses
  },
  "endpointingConfig": {
    "vadThreshold": 0.6,           // Higher = less sensitive, fewer false positives
    "silenceDurationMs": 700       // Wait 700ms of silence before responding
  },
  "backchannelingEnabled": false,    // Disable "mm-hmm" filler -- sounds weird at scale
  "backgroundDenoisingEnabled": true // Helps with tradie background noise
}
    

The single most important setting: silenceDurationMs. The default is too low — the agent cuts in while the human is still mid-sentence. Set it to 700ms and you eliminate 80% of the interruption problem. Go to 900ms if your audience speaks slowly or tends to pause mid-thought.

VAPI vs Bland.ai vs Retell: Which to Choose

Criteria	VAPI	Bland.ai	Retell AI
Time to first call	2–4 hours (dev setup)	30 minutes (no-code)	1–2 hours
Pricing model	~$0.05/min + providers	$0.09/min flat	~$0.07/min + providers
Cost at 10K min/month	~$650 (optimised)	$900 flat	~$750
Custom LLM	✓ Full control	✗ Limited	✓ Via webhook
Voice providers	ElevenLabs, PlayHT, OpenAI, Deepgram	Built-in + cloning	ElevenLabs, OpenAI, custom
Analytics dashboard	~ Basic logs	~ Moderate	✓ Best
Non-developer friendly	✗ No	✓ Yes	~ Moderate
Best for	Developers, high volume, custom stack	Fast setup, non-technical teams	Agencies managing multiple clients

My routing rule: VAPI if you're technical and running volume. Bland.ai if you need something live today without touching an API. Retell if you're an agency managing calling campaigns for multiple clients and need a clean reporting UI.

How I Score VAPI

Voice quality

4.5/5

Latency

4/5

Pricing

4.5/5

Ease of setup

3/5

Documentation

3/5

Reliability

4/5

Overall: 4/5 for developers, 2/5 for non-technical users. The platform has gotten meaningfully better in the 6 months I've been running it. The team ships fast. The latency issues are less frequent than they were in Q3 2025. The documentation still needs work.

Three Things I Wish I'd Known Before Starting

Test endpointing with real background noise before going live. The tradies I'm calling are sometimes on a job site — machinery, traffic, wind. Test your VAD threshold with audio that has ambient noise, not just a quiet office recording.
Build your logging layer before your first real campaign. VAPI fires webhooks for every event. If you don't have a system to capture them from day one, you'll lose call data you can't recover. I log everything to a Google Sheet via a simple FastAPI endpoint.
ElevenLabs latency varies significantly by voice. The Matilda voice I use is faster than most custom clones. If you clone a voice, test the actual latency before assuming it'll match what you've read in benchmarks.

When VAPI Doesn't Apply

You're not technical (or don't have a developer). VAPI requires API configuration, webhook handling, and debugging of voice pipeline settings. Without technical capability, use Bland.ai instead — it's designed for non-developers.
You need analytics without building them. VAPI's built-in dashboard is minimal. If you need conversion rates, call durations, and booking rates without building your own logging system, Retell has this out of the box.
You're running under 1,000 minutes a month. At low volume, the cost savings of VAPI over Bland.ai are negligible, but the setup complexity is the same. Bland.ai's simplicity is worth the $0.04/min premium at low volumes.
You need guaranteed SLA uptime. VAPI is robust but doesn't publish enterprise SLAs. For mission-critical calling operations (financial services, high-stakes healthcare scheduling), verify their current SLA terms before committing.

FAQ

What is VAPI?

VAPI is a developer platform for building real-time voice AI agents. It manages the full call pipeline — dialling, STT, LLM inference, TTS — and lets you plug in your own providers at each layer. You pay per minute of call time, with no monthly platform fee.

How much does VAPI cost in practice?

VAPI infrastructure runs ~$0.05/min. Add Deepgram Nova-2 (~$0.004/min), GPT-4o (~$0.015/min), ElevenLabs (~$0.010/min), and Twilio carrier (~$0.008/min). Total for a 90-second answered call: $0.065–0.09. Unanswered calls cost almost nothing (~$0.005).

What are the main problems with VAPI?

Three main issues: endpointing (agent cuts in too early — fix with silenceDurationMs: 700), occasional latency spikes to 1.5–2s, and documentation gaps where features are only discoverable through the Discord community.

Is VAPI better than Bland.ai or Retell?

VAPI is best for developers running high volume who want full stack control and lowest cost. Bland.ai is fastest to set up (no-code). Retell has the best analytics dashboard for agencies. There's no universally "better" — depends on your technical level and volume.

What STT model works best for Australian English?

Deepgram Nova-2 with language: "en-AU". It handles Australian accents better than Whisper, has lower latency (~150ms vs ~300ms), and supports real-time streaming rather than batch processing. This is a meaningful quality difference on a live call.

Want someone who actually runs VAPI to build your caller?

I run this in production daily. I know the endpointing settings, the voice configs, the CRM hooks. Apply to work with me directly — I'll tell you exactly what your setup looks like and what it costs.

Apply to Work 1-on-1 with Roman

Or join my free community — AI Mastery Genesis on Skool — where I drop the templates I use to build these agents.

Application-only · Roman reviews personally