← Back to overview

Cosine Similarity

Measuring meaning by measuring angles.

The intuition

Imagine two arrows in a high-dimensional space. Each arrow represents the meaning of a piece of text — its embedding vector. Cosine similarity measures the angle between those arrows, ignoring their length. Two texts about the same topic point in roughly the same direction, so the angle between them is small and the cosine is close to 1.

It doesn't matter how long the vectors are (how many words, how much emphasis). What matters is the direction — the shape of meaning.

The math

# Cosine similarity formula cos(A, B) = (A · B) / (|A| × |B|) # Where: # A · B = dot product (sum of element-wise products) # |A|, |B| = magnitudes (Euclidean norms) # # Range: -1 (opposite) to 0 (unrelated) to 1 (identical)

In practice, embedding models produce normalized vectors, so the magnitudes are 1 and cosine similarity reduces to a simple dot product.

In Membot

Cosine similarity carries 70% of the search weight. It is the semantic backbone — it understands that "canine" and "dog" are related even though they share no characters. The embeddings come from Nomic Embed, a local 768-dimensional model that runs without API calls.

Why not cosine alone?

Cosine captures meaning but can miss exact keywords. A query for "error code 4012" might match texts about errors in general but miss the one that contains the exact code. That's why Membot blends cosine with Hamming distance (which provides a fast, coarse semantic check on binary signatures) and keyword boosting (which catches exact term matches).

Further reading

Embeddings — how text becomes vectors
Hamming Distance — the binary complement
Sign-Zero Encoding — compressing vectors to bits