Transcription
Caption a podcast, transcribe a meeting, or stream a live mic — they’re the same call with different inputs.
Use transcribe when the audio already exists. Use
streamTranscriptionFrom when audio is arriving over time and the UI
should update while the person is still speaking.
Two Paths
import { transcribe, streamTranscriptionFrom } from "@effect-uai/core/Transcriber"
transcribe(request) // finished audio -> TranscriptResultstreamTranscriptionFrom(audioIn, request) // live audio stream -> TranscriptEvent streamThe sync path only needs the generic Transcriber tag. The streaming
path also needs the SttStreaming
marker, so sync-only providers fail at Effect.provide, not at
runtime.
The Shape
interface TranscriberService { readonly transcribe: (req: CommonTranscribeRequest) => Effect<TranscriptResult, AiError> readonly streamTranscriptionFrom: <E, R>( audioIn: Stream<Uint8Array, E, R>, req: CommonStreamTranscribeRequest, ) => Stream<TranscriptEvent, AiError | E, R>}The cross-provider request bag:
type CommonTranscribeRequest = { readonly audio: AudioSource readonly model: string readonly language?: string // ISO-639-1 / BCP-47 readonly prompt?: string | { readonly terms: ReadonlyArray<string> } readonly diarization?: boolean readonly wordTimestamps?: boolean}
type CommonStreamTranscribeRequest = Omit<CommonTranscribeRequest, "audio"> & { readonly inputFormat: AudioFormat readonly interimResults?: boolean readonly vadEvents?: boolean}Providers ignore options they don’t support, or reject them up front
with AiError.Unsupported / AiError.InvalidRequest — see the
per-provider pages.
Sync — transcribe
Use this for uploads, recordings, podcasts, and batch jobs. audio is
an AudioSource (URL, base64, or bytes); the adapter handles the
provider’s preferred wire form.
import { transcribe } from "@effect-uai/core/Transcriber"
const program = Effect.gen(function* () { const file = yield* readFile("meeting.mp3") return yield* transcribe({ audio: { _tag: "bytes", bytes: file, mimeType: "audio/mpeg" }, model: "gpt-4o-transcribe", language: "en", })})Returns TranscriptResult:
type TranscriptResult = { readonly text: string readonly languageCode?: string readonly durationSeconds?: number readonly words?: ReadonlyArray<WordTimestamp> readonly raw?: unknown}words only appears when wordTimestamps: true was requested and
the provider+model combination supports it. Today that means
OpenAI whisper-1 only — gpt-4o-transcribe and gpt-4o-mini-transcribe
return text-only; Gemini’s prompt-driven STT has no structured timing.
Passing wordTimestamps: true to an unsupporting provider fails with
AiError.Unsupported.
Streaming — streamTranscriptionFrom
Use this for live captions, voice search, or the front half of a voice
assistant. Audio bytes go in as a Stream<Uint8Array>. Transcript
events come out as partials, finals, and optional VAD events.
import { Stream } from "effect"import * as Transcriber from "@effect-uai/core/Transcriber"
const transcripts = micFrames.pipe( Transcriber.streamTranscriptionFrom({ model: "scribe_v2_realtime", inputFormat: { container: "raw", encoding: "pcm_s16le", sampleRate: 16000, channels: 1 }, interimResults: true, }), Stream.filter((e) => e._tag === "final"),)Both data-first (streamTranscriptionFrom(audioIn, req)) and
data-last forms work.
The underlying WebSocket is acquired on first pull and released when the output stream finalizes. You do not manage the scope by hand.
What You Get Back
partial updates are speculative. final is the committed text you
usually append to a transcript, send to search, or feed to an LLM.
type TranscriptEvent = | { readonly _tag: "partial"; readonly text: string; readonly words?: ...; readonly stability?: number } | { readonly _tag: "final"; readonly text: string; readonly words?: ...; readonly languageCode?: string } | { readonly _tag: "speech-started"; readonly atSeconds: number } | { readonly _tag: "utterance-ended"; readonly atSeconds: number } | { readonly _tag: "audio-event"; readonly label: string; ...} // (laughter), (music) — ElevenLabs | { readonly _tag: "metadata"; readonly raw: unknown } | { readonly _tag: "error"; readonly code?: string; readonly message: string }Type guards live next to the union:
import * as Transcript from "@effect-uai/core/Transcript"
if (Transcript.isFinal(event)) { // event.text is the committed transcript}Audio Format
inputFormat declares what’s in the byte stream:
type AudioFormat = { readonly container: "mp3" | "wav" | "ogg" | "opus" | "flac" | "aac" | "webm" | "raw" readonly encoding: "pcm_s16le" | "pcm_f32le" | "pcm_mulaw" | "pcm_alaw" | "mp3" | "opus" | ... readonly sampleRate: 8000 | 16000 | 22050 | 24000 | 32000 | 44100 | 48000 readonly channels?: 1 | 2}Mismatches with what the provider’s wire expects fail up front with
AiError.InvalidRequest. Common targets:
- ElevenLabs Scribe v2 Realtime — 16 kHz pcm s16le mono.
- OpenAI Realtime — 24 kHz pcm s16le mono.
- Inworld realtime — 16 kHz pcm s16le mono.
The browser side of a recipe typically uses an AudioWorklet to
decimate mic audio to the right rate before posting frames over a
WebSocket — see streaming-transcription
for the worklet.
VAD events (speech-started, utterance-ended) require
vadEvents: true and are not emitted by every provider. Non-fatal
provider issues can arrive as _tag: "error" events; fatal failures
still use the Stream error channel.
Next step
- Voice loop — live mic → LLM → TTS, the
flagship use of
streamTranscriptionFrom. - Basic transcription — sync
transcribeagainst OpenAI or Gemini. - Streaming transcription — Bun server bridging a browser mic to the realtime endpoint.
See also
- Synthesis — the other direction.
- Provider specifics: OpenAI, ElevenLabs, Gemini, Inworld.