live · platform v1.0.0 · 271 entities · 174 companies
#40 of 50

Multimodal

Models that see, hear, and read — and what that costs
Analogy first

What is multimodal?

A multimodal model processes more than one type of input — text, images, audio, or video. GPT-4 was text-only. GPT-4V added vision. GPT-4o added audio. Each modality added cost and capability.

The term entered common use in 2023 when OpenAI released GPT-4V and Google released Gemini with native multimodal support. Before this, separate models handled separate modalities — one for text, another for images, a third for speech.

Why it matters

Multimodal capability changes what applications are possible. A text-only model cannot read a screenshot, transcribe a meeting, or describe a product photo. A multimodal model can. But each additional modality has a token cost — images are tokenised at a fixed rate, audio at another. sourc.dev tracks vision and audio support as capability flags on every model.