OpenAI brings GPT-5-level reasoning to its speech models

OpenAI brings GPT-5-level reasoning to its speech models OpenAI launched three new speech-focused models on Thursday: GPT-Realtime-2, its first voice model with what the company calls “GPT-5-class reasoning”; GPT-Realtime-Translate for live translations; and GPT-Realtime-Whisper for fast transcriptions. GPT-Realtime-2 The first GPT-Realtime model launched in the summer of 2025, focusing on providing a voice-native model that could interact with users in a far more natural way than previous models. OpenAI last updated GPT-Realtime in February with the launch of version 1.5. Now, with GPT-Realtime-2, the company promises an 11% performance improvement over GPT-Realtime-1.5. OpenAI extended the context window from 32,000 tokens — which was surely a pain point for developers — to 128,000 tokens. This will allow the model to work better for longer and handle more complex interactions, which is especially important in the voice-agent workflows OpenAI is targeting. But what really matters here is that OpenAI is bringing far more powerful reasoning to this class of models. As OpenAI notes in its announcement, “building useful voice products takes more than fast turn-taking and a natural-sounding voice. A voice agent needs to understand what someone means, keep track of context, recover when a request changes, use tools while the conversation continues, and respond in a way that feels appropriate to the moment.” With this update, developers can now, for example, have Realtime-2 start conversations with short preambles such as “let me check that,” so users know the agent is working. The model can now also make parallel tool calls, just like most modern agentic systems, and tell the user what it is doing. By default, the model’s reasoning effort is set to low, with developer options limited to minimal, low, medium, high, and xhigh. Developers will pay $32 per 1 million audio input tokens and $64 per 1 million output tokens. That’s the same price the company charged for GPT-Realtime-1.5. GPT-Realtime-Translate As the name implies, GPT-Realtime-Translate is OpenAI’s model for live translations. It can handle over 70 input languages and translate them into 13 output languages. While OpenAI’s speech models were already able to handle some translation tasks, this is the first time it is offering a dedicated model for this use case. In the API, developers will pay $0.034 per minute to use this capability. GPT-Realtime-Whisper GPT-Realtime-Whisper, meanwhile, is OpenAI’s latest streaming transcription model. Whisper has long been the company’s brand for speech-to-text models, and Whisper has remained one of the most popular open-weight models for this task since the first version launched back in 2022. The open model hasn’t seen an update in quite a while, though OpenAI has long offered transcription models through its API with gpt-4o-transcribe and 4o-mini-transcribe. This model is priced at $0.017 per minute. Building new kinds of apps In its announcement, OpenAI stresses that it sees three patterns in how developers are using voice AI. There is voice-to-action, which allows users to describe what they need and then have the system perform a task; system-to-voice for having the AI provide voice-based guidance (“Your inbound flight is delayed, but you can still make your connection.”); and voice-to-voice, arguably the most complex one of the three, for building live, interactive conversations across tasks and changing context.