← Back to overview

CLIP

Contrastive Language-Image Pre-training.

What is it?

CLIP is a model introduced by OpenAI in 2021 that learns to map images and text into the same vector space. A photo of a dog and the text "a photo of a dog" end up as nearby vectors. This means you can search images with text queries and vice versa — without any task-specific training.

How it's trained

CLIP is trained on hundreds of millions of image-text pairs scraped from the internet. The training objective is contrastive: given a batch of image-text pairs, the model learns to maximize the similarity between matching pairs and minimize the similarity between non-matching pairs.

The result is two encoders — one for images, one for text — that produce vectors in a shared space. The cosine similarity between an image vector and a text vector indicates how well they match.

In Membot

CLIP provides the image modality for cross-modal memory. When building a cartridge with images:

  1. Each image is embedded using CLIP's image encoder
  2. Each associated text is embedded using Nomic Embed (the text encoder)
  3. Hebbian learning binds the image and text representations together

After training, the neuromorphic layer can retrieve images from text queries or find text associated with similar images. Combined with CLAP (audio), this enables any-to-any modality search.

Further reading

CLAP — the audio equivalent
Hebbian Learning — how modalities are bound
Embeddings — the text embedding model