Skip to content

Mistral

Mistral’s Voxtral family covers the whole speech surface: transcription (one-shot and live), and text-to-speech with streaming output and zero-shot voice cloning. It plugs into the generic Transcriber and SpeechSynthesizer tags, so it drops straight into the voice loop or any STT/TTS flow. The same package also ships Mistral’s chat language model.

Install

Terminal window
pnpm add @effect-uai/core @effect-uai/mistral effect

Live transcription runs over a WebSocket and needs the ws peer dep (Node / Bun):

Terminal window
pnpm add ws

Batch transcription and TTS don’t need it, so edge / browser builds stay slim.

Layers

Pick the transcriber that matches how you feed audio, a recorded file (batch) or a live microphone (realtime), and add the synthesizer for speech output:

LayerUse it forProvides
@effect-uai/mistral/MistralTranscribertranscribe a finished audio fileTranscriber
@effect-uai/mistral/MistralRealtimeTranscribertranscribe a live audio streamTranscriber + live streaming
@effect-uai/mistral/MistralSynthesizerturn text into speechSpeechSynthesizer + streaming text input
import { Config, Effect, Layer } from "effect"
import { FetchHttpClient } from "effect/unstable/http"
import { layer as realtimeLayer } from "@effect-uai/mistral/MistralRealtimeTranscriber"
import { layer as synthLayer } from "@effect-uai/mistral/MistralSynthesizer"
const mistral = Layer.unwrap(
Effect.gen(function* () {
const apiKey = yield* Config.redacted("MISTRAL_API_KEY")
return Layer.mergeAll(realtimeLayer({ apiKey }), synthLayer({ apiKey }))
}),
)
const mainLayer = mistral.pipe(Layer.provide(FetchHttpClient.layer))

Transcription

Batch transcription takes an AudioSource and returns the text plus optional word timestamps and speaker labels:

type MistralTranscribeRequest = {
readonly model: MistralTranscribeModel // "voxtral-mini-latest"
readonly audio: AudioSource
readonly language?: string
readonly diarization?: boolean // label speakers
readonly wordTimestamps?: boolean // per-word timings
readonly biasingTerms?: ReadonlyArray<string> // boost names / jargon
}

Voxtral transcribes 13 languages with diarization, word timestamps, and vocabulary biasing.

Live transcription (streamTranscriptionFrom) consumes a stream of microphone audio (PCM s16le at 16 kHz mono) and emits partial transcripts as you speak. Voxtral streams continuously rather than chopping speech into turns, so the transcriber commits a final after a brief pause; tune the pause with utteranceSilence (a Duration, default 700 ms) on the layer config. That’s what gives a voice assistant its per-utterance turns.

Text-to-speech

type MistralSynthesizeRequest = {
readonly model: MistralTtsModel // "voxtral-mini-tts-2603"
readonly voiceId: string // a preset voice id
readonly refAudio?: string // base64 clip for instant voice cloning
readonly text: string // omit when streaming text in
readonly outputFormat?: AudioFormat
}
  • synthesize: full text in, one audio blob out.
  • streamSynthesis / streamSynthesisFrom: stream audio out as it’s generated, the latter taking the text itself as a stream (it speaks one full utterance per call). Time-to-first-audio is around 0.8 s with the pcm format.

Voices. Mistral ships preset voices; list the ones available to your account with GET /v1/audio/voices?type=presets (ids look like gb_jane_neutral, us_paul_neutral) and pass one as voiceId. For zero-shot cloning, set refAudio to a 2-3 s base64 reference clip and it takes over from voiceId.

Output formats. pcm (lowest latency, ideal for AudioWorklet playback), wav, mp3, flac, and opus. Voxtral has no phoneme controls, so inline pronunciation overrides aren’t supported.

Errors

Failures surface as typed AiError variants:

StatusError
429AiError.RateLimited
408/504AiError.Timeout
401AiError.AuthFailed (auth)
>= 500AiError.Unavailable
other 4xxAiError.InvalidRequest

Mid-stream transcription hiccups arrive as TranscriptEvents tagged "error" without ending the stream.

See also