Multimodal embedding

You have a corpus of product photos and product descriptions. A user uploads an image. You want both kinds of results, ranked by relevance.

This is what a multimodal embedding model is for. Images and text land in the same vector space, so cosine similarity works across the boundary (image vs. text, image vs. image, text vs. text) without a separate captioning pass.

Scenario. One image as the query, a corpus mixing images and text. Rank everything against the query in one cosine sweep.

One batch, mixed modalities

embedMany accepts a ReadonlyArray<EmbedInput> where each entry can be text, image, or a mixed content[]:

import { embedMany } from "@effect-uai/core/EmbeddingModel"
import * as Image from "@effect-uai/core/Image"

const inputs: ReadonlyArray<EmbedInput> = [
  { image: Image.imageBytes(doughBytes, "image/jpeg") },
  { image: Image.imageBytes(dragonBytes, "image/jpeg") },
  { text: "A photo of artisan sourdough bread" },
  { text: "A delicious croissant on a plate" },
]

const result = yield * embedMany({ model: "gemini-embedding-2", inputs })

One HTTP call covers the whole batch. The provider returns a Float32Embedding per input in the same order you sent them.

A multimodal embedding space lets you compare any pair:

const [queryResult, docsResult] =
  yield *
  Effect.all(
    [
      embed({ model, input: { image: Image.imageBytes(queryBytes, "image/jpeg") } }),
      embedMany({ model, inputs }),
    ],
    { concurrency: "unbounded" },
  )

const ranked = inputs
  .map((input, i) => ({
    input,
    score: Vector.cosine(queryResult.embedding.vector, docsResult.embeddings[i].vector),
  }))
  .sort((a, b) => b.score - a.score)

Same shape as basic embedding: the only difference is what’s in input and inputs.

A note on the modality gap

Cross-modal scores are noisier than same-modality scores. In practice you’ll often see image-image cosines clustered higher than image-text cosines, even when the image-text pairs are semantically closer. This is the modality gap: joint embedding spaces tend to cluster by modality before clustering by content.

Two practical takeaways:

Cosine thresholds don’t transfer between modality pairs. A 0.65 image-text score might mean strong relevance; a 0.65 image-image score might mean unrelated photos that share aesthetic. Calibrate thresholds per pair.
Rerank for cross-modal precision. When the modality gap dominates your top-K, a cross-encoder reranker that takes both modalities (Jina rerank-m0) recovers the ordering.

Provider support

Today, multimodal embedding lives on:

gemini-embedding-2: text, image, audio, video, PDF in one vector space. Does not honour task; instead, prepend a task instruction in the prompt text.
jina-embeddings-v4: text + image, retrieval-tuned, also supports multivector and sparse output.
jina-clip-v2: CLIP-style image/text only.

OpenAI’s embedding line is text-only. Cohere v4 and Voyage multimodal are on the embedding plan but not yet implemented.

Image input shapes

ImageSource is url / base64 / bytes: the same primitives language model image inputs use. Provider acceptance varies:

Provider	URL	Base64	Bytes
Gemini	rejected (no Files-API upload)	yes	yes (auto base64)
Jina v4	yes	yes	yes (auto base64)

If a layer can’t encode the shape you passed, it fails the request with AiError.InvalidRequest. No silent fallback.

Run it

GOOGLE_API_KEY=... pnpm tsx recipes/multimodal-embedding/index.ts

The full source is at recipes/multimodal-embedding/index.ts. The recipe fetches three Unsplash images, mixes them with text in one batch, and ranks against an image query.