Not all models work with the same kinds of input and output. This is where modality comes in.

Modality describes the form of information a model can accept or produce, such as:

  • text
  • images
  • audio
  • video
  • mixed combinations of the above

Common categories

A text-only LLM works with language tokens. A vision-language model can take text plus images. Speech models work with audio. Some systems are genuinely multimodal, meaning they can interpret or generate across several kinds of media.

Why modality matters

Modality changes the design space. A model that can see screenshots can help with UI review. A model that can hear audio can support transcription or voice workflows. A text-only model may still be excellent for planning, coding, or document work.

So when choosing a model, do not just ask “which one is smartest?” Also ask “which one works with the kinds of information this task actually uses?”