Sign-Zero Encoding
33x compression that preserves semantic fingerprints.
The idea
Take a float vector. For each dimension: if the value is positive, write 1. Otherwise, write 0. Then pack 8 bits into one byte.
A 768-dimensional float32 vector occupies 3,072 bytes. After sign-zero encoding and packing: 96 bytes. That's a 32x reduction. For Nomic's 768-dim model, the full index at 2.4M entries drops from gigabytes to a fraction.
Why it works
This is based on SimHash, introduced by Moses Charikar in 2002. The key insight: for vectors produced by contrastive training (like embedding models), the sign pattern captures most of the directional information. The Hamming distance between two sign-zero vectors correlates with the cosine distance between the original floats at r = 0.891.
Contrastive training spreads information across all dimensions roughly equally, so no single dimension dominates. Discarding magnitude loses surprisingly little.
Implementation
In NumPy, the entire encoding is one line:
That's it. No learned quantization, no calibration, no training. Just the sign of each dimension.
Scale impact
| Format | Size at 2.4M entries | Search speed |
|---|---|---|
| Full float32 | ~7 GB | Baseline |
| Sign-zero only | ~220 MB | ~30x faster (XOR + popcount) |
Further reading
Hamming Distance — comparing the binary vectors
Embeddings — where the float vectors come from
Cosine Similarity — the full-precision alternative