Mistral
Mistral’s Voxtral family covers the whole speech surface: transcription
(one-shot and live), and text-to-speech with streaming output and
zero-shot voice cloning. It plugs into the generic Transcriber and
SpeechSynthesizer tags, so it drops straight into the
voice loop or any STT/TTS flow. The same package
also ships Mistral’s chat language model.
Install
pnpm add @effect-uai/core @effect-uai/mistral effectLive transcription runs over a WebSocket and needs the ws peer dep
(Node / Bun):
pnpm add wsBatch transcription and TTS don’t need it, so edge / browser builds stay slim.
Layers
Pick the transcriber that matches how you feed audio, a recorded file (batch) or a live microphone (realtime), and add the synthesizer for speech output:
| Layer | Use it for | Provides |
|---|---|---|
@effect-uai/mistral/MistralTranscriber | transcribe a finished audio file | Transcriber |
@effect-uai/mistral/MistralRealtimeTranscriber | transcribe a live audio stream | Transcriber + live streaming |
@effect-uai/mistral/MistralSynthesizer | turn text into speech | SpeechSynthesizer + streaming text input |
import { Config, Effect, Layer } from "effect"import { FetchHttpClient } from "effect/unstable/http"import { layer as realtimeLayer } from "@effect-uai/mistral/MistralRealtimeTranscriber"import { layer as synthLayer } from "@effect-uai/mistral/MistralSynthesizer"
const mistral = Layer.unwrap( Effect.gen(function* () { const apiKey = yield* Config.redacted("MISTRAL_API_KEY") return Layer.mergeAll(realtimeLayer({ apiKey }), synthLayer({ apiKey })) }),)
const mainLayer = mistral.pipe(Layer.provide(FetchHttpClient.layer))Transcription
Batch transcription takes an AudioSource and returns the text plus
optional word timestamps and speaker labels:
type MistralTranscribeRequest = { readonly model: MistralTranscribeModel // "voxtral-mini-latest" readonly audio: AudioSource readonly language?: string readonly diarization?: boolean // label speakers readonly wordTimestamps?: boolean // per-word timings readonly biasingTerms?: ReadonlyArray<string> // boost names / jargon}Voxtral transcribes 13 languages with diarization, word timestamps, and vocabulary biasing.
Live transcription (streamTranscriptionFrom) consumes a stream of
microphone audio (PCM s16le at 16 kHz mono) and emits partial
transcripts as you speak. Voxtral streams continuously rather than
chopping speech into turns, so the transcriber commits a final after
a brief pause; tune the pause with utteranceSilence (a Duration,
default 700 ms) on the layer config. That’s what gives a voice assistant its
per-utterance turns.
Text-to-speech
type MistralSynthesizeRequest = { readonly model: MistralTtsModel // "voxtral-mini-tts-2603" readonly voiceId: string // a preset voice id readonly refAudio?: string // base64 clip for instant voice cloning readonly text: string // omit when streaming text in readonly outputFormat?: AudioFormat}synthesize: full text in, one audio blob out.streamSynthesis/streamSynthesisFrom: stream audio out as it’s generated, the latter taking the text itself as a stream (it speaks one full utterance per call). Time-to-first-audio is around 0.8 s with thepcmformat.
Voices. Mistral ships preset voices; list the ones available to your
account with GET /v1/audio/voices?type=presets (ids look like
gb_jane_neutral, us_paul_neutral) and pass one as voiceId. For
zero-shot cloning, set refAudio to a 2-3 s base64 reference clip and it
takes over from voiceId.
Output formats. pcm (lowest latency, ideal for AudioWorklet
playback), wav, mp3, flac, and opus. Voxtral has no phoneme
controls, so inline pronunciation overrides aren’t supported.
Errors
Failures surface as typed AiError variants:
| Status | Error |
|---|---|
429 | AiError.RateLimited |
408/504 | AiError.Timeout |
401 | AiError.AuthFailed (auth) |
>= 500 | AiError.Unavailable |
| other 4xx | AiError.InvalidRequest |
Mid-stream transcription hiccups arrive as TranscriptEvents tagged
"error" without ending the stream.
See also
- Voice loop: runnable end-to-end with
--provider mistral(Voxtral STT + Mistral LLM + Voxtral TTS). - Speech overview: the generic tags and capability markers.
- Mistral language model: the chat side of the same package.