What are vision-language models?

A vision-language model is like a super-smart kid who can read books and also look at pictures, and understand what’s going on in both.

Imagine you have a friend who loves stories. This friend can not only read the words of a story but also look at a picture from that story and tell you what's happening. That's kind of how vision-language models work: they understand text and images, and can connect them together.

How They Work

Think of it like having two friends working together, one who reads, and one who looks at pictures. When they talk to each other, they learn to match what they see with what they read. Over time, they get really good at figuring out what a picture means just by reading the words next to it, or even guessing what words might go with a picture.

These models are used in things like apps that can describe what’s in a photo, or help you find the right book based on an image. They’re not magical, they're just really smart at combining two kinds of information: what we see and what we read.

Take the quiz →

Examples

  1. A child sees a picture of a cat and says, 'That's my favorite animal!'
  2. A robot looks at a photo and tells you what it sees.
  3. An app lets you describe an image with words.

Ask a question

See also

Discussion

Recent activity