CLAP
Contrastive Language-Audio Pre-training.
What is it?
CLAP is a model developed by Microsoft and collaborators (2022-2023) that maps audio and text into the same vector space. A recording of birdsong and the text "birds singing in a forest" produce similar vectors. This enables text-to-audio search and audio-to-text retrieval without task-specific fine-tuning.
How it works
Like CLIP for images, CLAP uses contrastive learning on paired audio-text data. The model learns two encoders:
- Audio encoder — processes spectrograms or raw waveforms into vectors
- Text encoder — processes descriptions into vectors in the same space
The training pushes matching audio-text pairs together and non-matching pairs apart in vector space.
In Membot
CLAP provides the audio modality for cross-modal memory. Combined with CLIP (images) and Nomic Embed (text), it enables any combination of modalities:
- Text → Audio: "find sounds of rain"
- Audio → Text: "what documents mention this sound?"
- Audio → Image: via Hebbian binding through shared text space
The Hebbian weight matrix binds all three modalities together, so cross-modal retrieval works through attractor dynamics in the neuromorphic layer.
Further reading
CLIP — the image equivalent
Hebbian Learning — how modalities are bound
Neuromorphic Substrate — the physics layer