Realtime
For voice assistants you can ship today, see Speech → Voice loop. The composed STT → LLM → TTS pipeline covers the common case and runs on the shipped speech primitives.
This page is about the other archetype: one long-lived duplex session where the model owns turn-taking, can interrupt itself on detected user speech, and can take camera frames alongside audio. That primitive isn’t shipped yet.
What Realtime adds
A pipeline of Transcriber → LanguageModel → SpeechSynthesizer gives
you most of a voice agent — but it has four properties a native
duplex API will improve on:
- Model-native barge-in. Voice loop’s interrupt fires on stop-words detected client-side from finals. A native session lets the model decide it’s been interrupted (server-side VAD on the input audio) and trim its own response — no keyword list, no client detection.
- Mid-utterance tool calls. In the pipeline, a turn is atomic: STT → LLM → TTS, then the next turn. A native session can interleave tool calls into the same continuous audio stream.
- Sub-200 ms turn-taking. The pipeline pays for every boundary (STT-final, LLM TTFT, TTS first-byte). A native session amortizes some of that overhead by keeping a single WebSocket open.
- Camera-in streams. When a provider ships realtime vision, pointing a phone camera at something and getting a spoken answer becomes one session — not a pipeline of “snapshot every N seconds → multimodal LM → TTS.”
If you don’t need any of those, the voice-loop pipeline is the simpler answer and exercises the same primitives you already use.
Coming soon
When this lands, @effect-uai/core will ship a RealtimeSession
service. Likely shape: a session value carrying a Queue<RealtimeInput>
you push into (audio frames, video frames, text, control) and a
Stream<RealtimeEvent> you consume from (audio chunks, transcript
deltas, tool calls, turn boundaries).
Provider candidates:
- OpenAI Realtime — WebSocket and WebRTC transports,
gpt-realtimefamily. Audio in / audio out today; vision input on the roadmap. - Google Gemini Live — WebSocket, audio + video in / audio + text out. The closest thing to “point your camera, get an answer” today.
The right primitives — backpressure, cancellation, interrupt semantics — get designed alongside the first integration, not in advance.
See also
- Voice loop — the ship-today voice agent.
- Speech — the one-direction primitives the voice loop is built from.