Skip to content

ElevenLabs

ElevenLabs is the most complete speech surface today: streaming STT over WebSocket, streaming TTS with both chunked-HTTP and incremental-text-in WS, voice cloning supported by ID, all browser-friendly (no special peer deps).

Install

Terminal window
pnpm add @effect-uai/core @effect-uai/elevenlabs effect

Layers

LayerRegistersCapability markers
@effect-uai/elevenlabs/ElevenLabsTranscriberElevenLabsTranscriber + TranscriberSttStreaming
@effect-uai/elevenlabs/ElevenLabsSynthesizerElevenLabsSynthesizer + SpeechSynthesizerTtsIncrementalText
import { Config, Effect, Layer } from "effect"
import { FetchHttpClient } from "effect/unstable/http"
import * as Socket from "effect/unstable/socket/Socket"
import { layer as transcriberLayer } from "@effect-uai/elevenlabs/ElevenLabsTranscriber"
import { layer as synthLayer } from "@effect-uai/elevenlabs/ElevenLabsSynthesizer"
const eleven = Layer.unwrap(
Effect.gen(function* () {
const apiKey = yield* Config.redacted("ELEVENLABS_API_KEY")
return Layer.mergeAll(transcriberLayer({ apiKey }), synthLayer({ apiKey }))
}),
)
const mainLayer = eleven.pipe(
Layer.provide(FetchHttpClient.layer),
Layer.provide(Socket.layerWebSocketConstructorGlobal),
)

The synthesizer’s streamSynthesisFrom (incremental text-in) requires a WebSocketConstructorSocket.layerWebSocketConstructorGlobal binds globalThis.WebSocket, which works in Bun, Node 22+, and browsers.

Models

STT

ModelStreamingNotes
scribe_v2_realtime✓ (WS)The realtime variant — used by streamTranscriptionFrom
scribe_v2Sync variant; not exposed as a separate sync helper today
scribe_v1Legacy

@effect-uai/elevenlabs/ElevenLabsTranscriber registers SttStreaming and is the realtime path; sync transcription via the ElevenLabs REST endpoint isn’t wired up yet — for sync STT today, reach for OpenAITranscriber or GeminiTranscriber.

TTS

ModelStreamingIncremental-text-inNotes
eleven_flash_v2_5Sub-100 ms first-byte. Default for the voice-loop recipe.
eleven_turbo_v2_5Low latency, 32 languages
eleven_multilingual_v2Production-grade multilingual (29 languages)
eleven_v3Most expressive; inline audio-tag emotion (<laugh>, <whisper>, …); 70+ languages

Voice IDs are 20-character opaque slugs (e.g. JBFqnCBsd6RMkjVDRZzb). Same shape for stock and cloned voices — ElevenLabsVoiceId is just string. Browse the catalog via the ElevenLabs portal or GET /v1/voices.

Request shape

// STT streaming
type ElevenLabsStreamTranscribeRequest = {
readonly model: ElevenLabsSttModel // typically "scribe_v2_realtime"
readonly inputFormat: AudioFormat // 16 kHz pcm s16le mono
readonly language?: string
readonly interimResults?: boolean
readonly vadEvents?: boolean
readonly diarization?: boolean
}
// TTS sync + chunked + incremental
type ElevenLabsSynthesizeRequest = {
readonly model: ElevenLabsTtsModel
readonly voiceId: ElevenLabsVoiceId
readonly text: string // omitted on streamSynthesisFrom
readonly outputFormat?: AudioFormat
readonly voiceSettings?: VoiceSettings // stability, similarity_boost, style, use_speaker_boost
readonly seed?: number // deterministic generation
readonly previousText?: string // prosody context
readonly nextText?: string
}

voiceSettings exposes ElevenLabs’ prosody controls:

type VoiceSettings = {
readonly stability?: number // 0..1 — higher = more consistent
readonly similarityBoost?: number // 0..1 — clone fidelity
readonly style?: number // 0..1 — emotion intensity (v3+)
readonly useSpeakerBoost?: boolean
}

previousText / nextText thread context across sequential synthesis calls so prosody flows naturally between chunks — useful when you’re synthesizing one paragraph at a time.

Wire / auth notes

Realtime STT mints a single-use token via REST (POST /v1/single-use-token/realtime_scribe) and carries it as a ?token=… query param on the WS upgrade. No headers needed → works from the browser directly with globalThis.WebSocket. Expects PCM s16le at 16 kHz mono.

Realtime TTS (/stream-input) takes incoming text as JSON frames on a single WS, returns base64 PCM audio frames. The provider closes with code 1000 on a clean end — the adapter whitelists 1000 / 1001 / 1005 via closeCodeIsError so a graceful close doesn’t surface as a stream failure.

Output formats: PCM s16le at 16 / 22.05 / 24 / 44.1 kHz, plus MP3 at several bitrates and µ-law / A-law for telephony. PCM at 24 kHz is the sweet spot for browser AudioWorklet playback.

Errors

Standard HTTP → AiError mapping. Non-fatal mid-stream errors arrive as TranscriptEvents with _tag: "error" and don’t end the stream; fatal failures surface on the Stream’s error channel.

See also