Streaming transcription
Live captions arrive as the user speaks, not after they finish.
This recipe connects a browser microphone to a realtime transcription
provider. The browser sends small PCM frames to a Bun server; the
server turns those frames into a Stream<Uint8Array> and gets
transcript events back.
Scenario. You’re building a captioning UI, a voice-search box, or the front half of a voice assistant. You want partial guesses to appear dimmed while the user is mid-sentence and finals to commit once they pause.
The Shape
streamTranscriptionFrom is live STT as a stream transformation:
import { Stream } from "effect"import * as Transcriber from "@effect-uai/core/Transcriber"
const transcripts = micFrames.pipe( Transcriber.streamTranscriptionFrom({ model: "scribe_v2_realtime", inputFormat: { container: "raw", encoding: "pcm_s16le", sampleRate: 16000, channels: 1 }, interimResults: true, }),)
// transcripts : Stream<TranscriptEvent, AiError>// each event is "partial" | "final" | "speech-started" | ...The recipe UI renders partial events as tentative text and final
events as committed transcript lines. index.ts is provider-agnostic;
run-bun.ts chooses OpenAI Realtime or ElevenLabs.
Run it
# Default: OpenAI Realtime (24 kHz pcm16)OPENAI_API_KEY=sk-... bun recipes/streaming-transcription/run-bun.ts
# ElevenLabs Scribe v2 Realtime (16 kHz pcm16)ELEVENLABS_API_KEY=... bun recipes/streaming-transcription/run-bun.ts --provider elevenlabsOpen http://localhost:3000, click Start, allow mic access, and talk. Partial transcripts appear dimmed; finals commit and stay bold.
Run with
bun, notpnpm tsx— the runner usesBun.serveandBun.buildglobals.
Env vars: OPENAI_API_KEY / ELEVENLABS_API_KEY depending on
provider; PORT optional (defaults to 3000).
How The Demo Flows
[Browser] getUserMedia → AudioWorklet → WebSocket ↕[Bun server] Stream<Uint8Array> → Transcriber.streamTranscriptionFrom → TranscriptEvent JSON[Browser] renders partial / final transcripts liveThe server owns the Effect Layer and the provider connection; the
browser is just a mic-to-WS adapter. Sample rates differ per provider
(OpenAI wants 24 kHz, ElevenLabs wants 16 kHz), so the client fetches
/config before it starts recording.
Provider Fit
Use a provider layer that registers the SttStreaming marker. That is
what keeps a sync-only provider from accidentally being used in a live
mic pipeline.
OpenAI Realtime and ElevenLabs both work here. Gemini’s transcription is sync-only, so it belongs in Basic transcription, not this recipe.
What This Generalizes To
Live transcription is usually the first half of a larger flow. Pipe
final events into search, commands, meeting notes, or an LLM. For the
full STT → LLM → TTS composition, see Voice loop.
The full source lives next to this README at
index.ts.