OpenAI launches GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper in its API

On 7 May 2026, OpenAI added three new models to its Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. Each one is built for a specific job, and together they change what developers can reasonably build with voice AI today.

These are API-only releases. None of them are in ChatGPT yet. This is squarely aimed at developers and businesses building voice-first products.

Why three separate models?

Building voice AI has historically meant stitching together a pipeline: speech-to-text, then a language model, then text-to-speech. That chain introduces latency at every join, and the complexity of managing three separate systems adds up fast.

OpenAI’s approach here is to split responsibilities more cleanly across purpose-built models rather than using one model for everything. A model optimised for real-time translation doesn’t need the same architecture as one handling multi-turn reasoning. Keeping them separate means each one can be faster and cheaper for its specific task. You only pull out the full reasoning engine when you actually need it.

GPT-Realtime-2: the reasoning voice model

GPT-Realtime-2 is the flagship. It brings GPT-5-class reasoning to voice, which is a meaningful step up from the previous GPT-Realtime-1.5.

The context window has grown from 32K to 128K tokens, so longer conversations and more complex tasks stay coherent without the model losing track of earlier context. Reasoning effort is now adjustable, with five levels from minimal to xhigh, defaulting to low. That gives developers control over the cost-versus-capability trade-off depending on what a given interaction actually needs.

There are also some practical improvements to how the model handles the messy realities of voice conversation. It can say things like “let me check that” before responding, call multiple tools simultaneously, make those actions audible with phrases like “checking your calendar”, and recover gracefully when something goes wrong rather than going silent.

The benchmark numbers back it up: on Big Bench Audio, GPT-Realtime-2 at high reasoning hit 96.6% accuracy compared to 81.4% for GPT-Realtime-1.5. On Audio MultiChallenge, the xhigh setting reached a 48.5% average pass rate, up from 34.7%.

What this means for you: If you’re building a voice agent that handles anything more than simple commands, such as booking workflows, customer support, or anything requiring multi-step reasoning, GPT-Realtime-2 is the model to reach for. The adjustable reasoning effort also means you’re not paying for heavy computation on every single turn.

Pricing sits at $32 per million audio input tokens and $64 per million audio output tokens, with cached input tokens at $0.40 per million.

GPT-Realtime-Translate: live speech translation

GPT-Realtime-Translate handles speech-to-speech translation in real time, processing spoken input continuously across 70+ languages and outputting translated speech in 13 languages without requiring speakers to pause or complete full sentences.

It can also provide real-time transcription alongside the translated audio, which makes it useful for scenarios where you need both a written record and spoken output.

Early adopters give a good sense of the use cases. Vimeo demoed it translating a product education video live as it played, so global customers could hear updates in their preferred language without waiting for a separate dubbed version. Deutsche Telekom is testing it for multilingual customer interactions. BolnaAI, building for Indian language markets, reports 12.5% lower word error rates on Hindi, Tamil, and Telugu compared to their previous approach. Priceline is working toward voice-driven travel management where real-time translation is part of the on-the-ground experience.

What this means for you: If your product serves users across languages, this is the most direct path to live voice translation without building your own pipeline. At $0.034 per minute, it’s priced to be usable at scale.

GPT-Realtime-Whisper: streaming transcription

GPT-Realtime-Whisper is a streaming speech-to-text model. It transcribes audio as people speak rather than waiting for a full utterance, which makes real-time products feel more responsive.

The obvious applications are live captioning, meeting notes that update mid-conversation, and voice assistants that need to act on partial input. It also fits post-call workflows in sectors like customer support, healthcare, and sales where a transcript needs to be ready the moment a call ends.

What this means for you: If your product already relies on Whisper for transcription and latency matters, Realtime-Whisper is the upgrade. At $0.017 per minute, it’s the most affordable of the three models and a low-friction swap for transcription-heavy applications.

Two new voices and a graduation from beta

Two new voices, Cedar and Marin, are available exclusively through this API release.

More significantly, the Realtime API officially exits beta and is now generally available. For developers who were waiting before building production systems on it, that’s the signal worth noting.

The architecture underneath

All three models use an audio-in, audio-out approach. The model receives raw audio and responds with raw audio, with no intermediate conversion to text. That matters because the conversion step is where tone, pacing, and emotional nuance tend to get lost. Keeping everything in the audio domain preserves those qualities.

The API uses persistent WebSocket connections, and all three model IDs are now live in the OpenAI API docs: gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper. You can test them in the Playground today.

OpenAI has also published cookbook guides for combining the models, including how to pair Realtime-Translate with Realtime-Whisper for end-to-end pipelines, such as a live interpretation app that uses Whisper for transcription and Translate for speech output.

Safety and compliance

The Realtime API includes active classifiers that monitor live sessions and can stop conversations that violate safety rules. Developers can layer on additional guardrails via the Agents SDK. Policies prohibit spam, deception, and harmful redistribution of outputs, and developers are required to disclose when users are interacting with AI unless it’s contextually obvious. EU Data Residency is supported for regional compliance needs.

For most developers, the practical takeaway is straightforward. Voice AI went from a single monolithic model to three models with clear purposes, reasonable pricing, and a production-ready API. The question is no longer whether you can build a usable voice product; it’s which of these models fits the specific job you’re trying to do.