← Back to overview

Sign-Zero Encoding

33x compression that preserves semantic fingerprints.

The idea

Take a float vector. For each dimension: if the value is positive, write 1. Otherwise, write 0. Then pack 8 bits into one byte.

# Float vector (768 dimensions) [ 0.23, -0.41, 0.08, -0.15, 0.67, -0.02, 0.31, 0.0, ... ] # Sign-zero encoded [ 1, 0, 1, 0, 1, 0, 1, 0, ... ] # Packed into bytes [ 0xAA, ... ] # 768 dims → 96 bytes (was 3072 bytes as float32)

A 768-dimensional float32 vector occupies 3,072 bytes. After sign-zero encoding and packing: 96 bytes. That's a 32x reduction. For Nomic's 768-dim model, the full index at 2.4M entries drops from gigabytes to a fraction.

Why it works

This is based on SimHash, introduced by Moses Charikar in 2002. The key insight: for vectors produced by contrastive training (like embedding models), the sign pattern captures most of the directional information. The Hamming distance between two sign-zero vectors correlates with the cosine distance between the original floats at r = 0.891.

Contrastive training spreads information across all dimensions roughly equally, so no single dimension dominates. Discarding magnitude loses surprisingly little.

Implementation

In NumPy, the entire encoding is one line:

binary = np.packbits((vectors > 0).astype(np.uint8), axis=1)

That's it. No learned quantization, no calibration, no training. Just the sign of each dimension.

Scale impact

FormatSize at 2.4M entriesSearch speed
Full float32~7 GBBaseline
Sign-zero only~220 MB~30x faster (XOR + popcount)

Further reading

Hamming Distance — comparing the binary vectors
Embeddings — where the float vectors come from
Cosine Similarity — the full-precision alternative