Back to Feed
Edge

OpenAI Introduces Websocket-Based Execution Mode to Reduce Latency in Agentic Workflows

InfoQ
Leela Kumili
May 7, 2026
2 min read
Original
OpenAI has introduced a WebSocket-based execution mode for its responses API to improve the performance of agentic workflows used in coding agents and real-time AI systems. The change replaces the traditional HTTP request-response pattern with a persistent, bidirectional connection between client and server, targeting latency and coordination overhead in multi-step reasoning workflows. According to OpenAI, early production use shows up to 40% latency reduction and improved throughput in high-concurrency scenarios. The update addresses a growing bottleneck in agentic systems where each step in a workflow, such as tool calls, intermediate reasoning, and follow-up queries, previously required separate HTTP requests. As inference speeds improved, these repeated network round-trip times became a dominant source of latency and operational complexity. Traditional HTTP Flow (Source: OpenAI Blog Post) The WebSocket-based execution mode uses a long-lived, bidirectional connection to enable continuous data exchange without repeated handshakes. This supports streaming responses, faster tool execution, and more efficient coordination of multi-step workflows. It aligns with event-driven design patterns in distributed systems, where maintaining state across interactions improves responsiveness and throughput. The change reflects a broader focus on the transport layer in agentic systems, where communication patterns and connection management influence overall performance, as discussed in an AI Agent Transport Layer. Ofek Shaked, a Vibe Coder, described the change as, WebSockets for agent state is such an obvious but huge win. No more cold starts killing your multi-tool chains. OpenAI reported up to 40% latency reduction in early production use, along with sustained throughput of around 1,000 transactions per second and bursts up to 4,000 TPS. These results indicate that transport-level optimizations can significantly impact end-to-end AI system performance alongside model-level improvements. Gabriel Chua, DX Engineer @ OpenAI, stated You can warm up the connection by sending your system prompt and tool definitions first. It's Zero Data Retention (ZDR) compatible. Adoption has been immediate among developer tooling and coding agent platforms. Vercel integrated the WebSocket mode into its AI SDK and reported up to 40% latency reduction. Cline observed a 39% improvement in multi-file workflows, while Cursor reported gains of up to 30%. These results highlight how system-level optimizations outside the model itself are increasingly shaping real-world AI performance. Agent Workflow Evolution with Persistent Sessions (Source: OpenAI Blog Post) From an implementation perspective, developers integrate the WebSocket mode by replacing multiple HTTP calls with a single persistent session. This reduces repeated connection setup and simplifies orchestration logic across multi-step workflows. It also improves support for streaming use cases such as incremental code generation and interactive reasoning, where partial outputs can be consumed as they are produced. Kevin Cho, an engineer at Microsoft, noted that the approach reflects Going back to the original software stack problems. websockets and stateful connections. The shift introduces new system design considerations, including connection lifecycle management, backpressure under high concurrency, and reliability in distributed systems, aligning with established stateful system patterns. OpenAI released the feature in alpha after a two-month cycle to selected partners, including Codex. Codex has since migrated most Responses API traffic to WebSocket mode, indicating production readiness.