← Back to overview

CLAP

Contrastive Language-Audio Pre-training.

What is it?

CLAP is a model developed by Microsoft and collaborators (2022-2023) that maps audio and text into the same vector space. A recording of birdsong and the text "birds singing in a forest" produce similar vectors. This enables text-to-audio search and audio-to-text retrieval without task-specific fine-tuning.

How it works

Like CLIP for images, CLAP uses contrastive learning on paired audio-text data. The model learns two encoders:

The training pushes matching audio-text pairs together and non-matching pairs apart in vector space.

In Membot

CLAP provides the audio modality for cross-modal memory. Combined with CLIP (images) and Nomic Embed (text), it enables any combination of modalities:

The Hebbian weight matrix binds all three modalities together, so cross-modal retrieval works through attractor dynamics in the neuromorphic layer.

Further reading

CLIP — the image equivalent
Hebbian Learning — how modalities are bound
Neuromorphic Substrate — the physics layer