Beyond Simple API Requests: How OpenAI’s WebSocket Mode Changes the Game for Low Latency Voice Powered AI Experiences

In the world of Generative AI, latency is the ultimate killer of immersion. Until recently, building a voice-enabled AI agent felt like assembling a Rube Goldberg machine: you’d pipe audio to a Speech-to-Text (STT) model, send the transcript to a Large Language Model (LLM), and finally shuttle text to a Text-to-Speech (TTS) engine. Each hop added hundreds of milliseconds of lag.

OpenAI has collapsed this stack with the Realtime API. By offering a dedicated WebSocket mode, the platform provides a direct, persistent pipe into GPT-4o’s native multimodal capabilities. This represents a fundamental shift from stateless request-response cycles to stateful, event-driven streaming.

The Protocol Shift: Why WebSockets?

The industry has long relied on standard HTTP POST requests. While streaming text via Server-Sent Events (SSE) made LLMs feel faster, it remained a one-way street once initiated. The Realtime API utilizes the WebSocket protocol (wss://), providing a full-duplex communication channel.

For a developer building a voice assistant, this means the model can ‘listen’ and ‘talk’ simultaneously over a single connection. To connect, clients point to:

wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview

The Core Architecture: Sessions, Responses, and Items

Understanding the Realtime API requires mastering three specific entities:

The Session: The global configuration. Through a session.update event, engineers define the system prompt, voice (e.g., alloy, ash, coral), and audio formats.

The Item: Every conversation element—a user’s speech, a model’s output, or a tool call—is an item stored in the server-side conversation state.

The Response: A command to act. Sending a response.create event tells the server to examine the conversation state and generate an answer.

Audio Engineering: PCM16 and G.711

OpenAI’s WebSocket mode operates on raw audio frames encoded in Base64. It supports two primary formats:

PCM16: 16-bit Pulse Code Modulation at 24kHz (ideal for high-fidelity apps).

G.711: The 8kHz telephony standard (u-law and a-law), perfect for VoIP and SIP integrations.

Devs must stream audio in small chunks (typically 20-100ms) via input_audio_buffer.append events. The model then streams back response.output_audio.delta events for immediate playback.

VAD: From Silence to Semantics

A major update is the expansion of Voice Activity Detection (VAD). While standard server_vad uses silence thresholds, the new semantic_vad uses a classifier to understand if a user is truly finished or just pausing for thought. This prevents the AI from awkwardly interrupting a user who is mid-sentence, a common ‘uncanny valley’ issue in earlier voice AI.

The Event-Driven Workflow

Working with WebSockets is inherently asynchronous. Instead of waiting for a single response, you listen for a cascade of server events:

input_audio_buffer.speech_started: The model hears the user.

response.output_audio.delta: Audio snippets are ready to play.

response.output_audio_transcript.delta: Text transcripts arrive in real-time.

conversation.item.truncate: Used when a user interrupts, allowing the client to tell the server exactly where to “cut” the model’s memory to match what the user actually heard.

Key Takeaways

Full-Duplex, State-Based Communication: Unlike traditional stateless REST APIs, the WebSocket protocol (wss://) enables a persistent, bidirectional connection. This allows the model to ‘listen’ and ‘speak’ simultaneously while maintaining a live Session state, eliminating the need to resend the entire conversation history with every turn.

Native Multimodal Processing: The API bypasses the STT → LLM → TTS pipeline. By processing audio natively, GPT-4o reduces latency and can perceive and generate nuanced paralinguistic features like tone, emotion, and inflection that are typically lost in text transcription.

Granular Event Control: The architecture relies on specific server-sent events for real-time interaction. Key events include input_audio_buffer.append for streaming chunks to the model and response.output_audio.delta for receiving audio snippets, allowing for immediate, low-latency playback.

Advanced Voice Activity Detection (VAD): The transition from simple silence-based server_vad to semantic_vad allows the model to distinguish between a user pausing for thought and a user finishing their sentence. This prevents awkward interruptions and creates a more natural conversational flow.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.