Zaxy 2.2 ANN Path Engineering Plan

Purpose

The 2.1 vector-scale lane measured the embedded ANN (Kuzu HNSW) path below exact brute-force search on every axis that matters, so 2.1 shipped with VECTOR_ANN_THRESHOLD raised to keep ANN opt-in. This plan makes the ANN path genuinely better than exact search where exact search stops being viable, and lowers the threshold only when the lane proves it.

The guiding rule is unchanged from 2.1:

Defaults move on lane evidence, not on assertion.

Baseline (measured 2026-06-11, master @ post-2.1.0)

Full record: /tmp/zaxy-ann-baseline/BASELINE.md (lane runs at dim 64 and dim 1536; 100k dim-64 reference from the identical released code).

Metric dim 64, 10k dim 64, 100k dim 1536, 10k
ANN recall@10 0.9062 0.8969 0.5156
ANN p50 vs exact 9.8ms vs 6.1ms 37.9ms vs 17.0ms 26.5ms vs 8.9ms
ANN build (first query) 53s ~20.6 min 196s
int8 recall@10 pass (0.99+) 0.9938 0.6094
int8 p50 vs exact 1.1ms vs 6.1ms 9.5ms vs 17.0ms 35.6ms vs 8.9ms

Three reframing findings versus the 2.1 understanding:

  1. Dimension is the dominant variable. At production-scale dimension (1536), ANN recall collapses to 0.52 and int8 quantization — the 2.1 promotion candidate — collapses to 0.61. The dim-64 lane was the easy case.
  2. At high dimension the fight is memory, not latency. Exact float64 is 8.9ms p50 at 10k/1536 — fine — but 100k x 1536 x 8B ≈ 1.2GB against a 256MB vector-cache budget. Approximate methods must win on resident bytes first, recall second, latency third.
  3. int8's collapse is candidate selection, not rerank. The float rerank is exact; at high dimension the true top-10 falls outside the fixed top-k×4 int8 candidate set. Fix is adaptive oversampling and/or quantization with better high-dim behavior — measured against realistic embedding distributions, since hash-embedding value distributions may be adversarial for int8 specifically.

Research Findings

Kuzu reality (external research, fully sourced in /tmp/zaxy-ann-research/)

Decisive experiments (10k, dim 64, this machine; raw JSON

/tmp/zaxy-ann-exp/results.json)

Design

Workstream A — Query path (latency + recall)

Lever order revised by E2/E3: rerank first, filter-scan profiling second, projected-graph hygiene third, efs last.

  1. Oversample + exact float64 rerank on the ANN path (mirroring the int8 design): fetch k×oversample from HNSW, rerank with exact float64 scores from the already-resident entity vectors. E3 says this eliminates the measured recall deficit class (float32-boundary tie flips) entirely and makes recall robust to build variance — the determinism mitigation that costs nothing at build time.
  2. Profile and fix the filtered-query cost at 10^5. E2 shows the latency gap is not graph create/drop; the per-query prefilter mask scan over the session/version predicate is the suspect. Options in order: avoid the predicate entirely when one (session, version) group owns the whole shadow generation (make the shadow table per-group so unfiltered direct-table queries are the common case); otherwise one long-lived projected graph per (session, version) on the store's derived-cache pattern.
  3. Drop per-query projected graphs regardless (hygiene + the design intent of connection-scoped graphs), with lifecycle tied to _clear_read_caches.
  4. Expose efs as setting VECTOR_ANN_EFS (default from the lane sweep; secondary lever now — E3 hit 1.0 at the 200 default under clean conditions). Capabilities reports it.

Workstream B — Build path (sync time + reproducibility)

  1. Replace batched UNWIND sync with COPY FROM an in-memory Arrow table for full rebuilds; create the HNSW index after the copy completes.
  2. Rebuild without DROP_VECTOR_INDEX (#6040): rebuild into a fresh shadow table generation (e.g. ZaxyVectorAnnShadow{dim}_g{n}), swap the active generation atomically in store state, and drop the old table (not the index) afterward. Test the full cycle hard.
  3. Incremental small deltas stay on the existing insert path (inserts are reflected in queries — verified in 2.1); full COPY rebuilds trigger on the same lazy signature change that rebuilds the dense matrix today, with a delta-threshold to choose between incremental insert and generation swap.
  4. Cold-start guard (#6047): measure index-load cost at 10^5 on DB open; if it blocks, document and gate index existence behind the threshold so default-path users never pay it.

Post-ann.1 diagnostic findings (dim-1536 root cause)

The 2.2-ann.1 gate runs failed recall at dim 1536 (0.55/0.61) even with the float64 rerank. Follow-up diagnostics isolated the cause and exonerate the index:

0.9875 at efs 400, 1.0 at efs 800. With realistic distributions the index is healthy; efs 400 is the evidence-backed high-dim default.

Consequences (folded into Workstream C below): the lane gains a tie-aware recall metric (hit = retrieved score equals the k-th true score — standard ANN-benchmarking practice), reported ALONGSIDE strict recall so nothing is hidden; the realistic-distribution variant (C3) becomes the high-dim gate corpus; and VECTOR_ANN_EFS default moves to 400 on the sweep evidence.

Workstream C — Quantized path at high dimension

  1. Adaptive oversampling: scale the int8 candidate multiplier with dimension (the fixed k×4 is the measured failure at 1536); sweep on the lane to find the recall/latency frontier.
  2. Evaluate int8 asymmetric scoring (float query × int8 corpus, per-dim or per-block scales) if oversampling alone cannot reach 0.95 at 1536 within latency budget.
  3. Realistic-distribution check: add an optional lane variant using a realistic embedding distribution (e.g. normalized Gaussian mixture or downloadable real vectors kept out of the default path) so int8 conclusions are not artifacts of hash-embedding value distributions.

Workstream D — Consolidation

The store currently carries three parallel search paths (dense float64, _AnnVectorGroup, _QuantizedVectorGroup) with duplicated selection, scoring, and cache-accounting logic. Consolidate behind one internal strategy interface (selection → candidates → exact rerank → results with exact flag), so A/B/C land as strategy implementations rather than more branching. This is the "consolidate where needed" mandate — done as part of the work, not as a separate refactor pass.

Decision Gates

Gate Evidence required Action on pass Action on fail
G1 query path Lane at 10^5 dim 64: ANN recall ≥0.95 AND p50 < exact proceed to G2 ANN stays opt-in; record findings
G2 build path Full rebuild at 10^5 in single-digit minutes; rebuild cycle survives generation-swap stress test proceed to G3 threshold stays; incremental-only posture documented
G3 high-dim Lane at 10^4–10^5 dim 1536: at least one approximate mode (ANN or int8) recall ≥0.95 with bytes < exact mode becomes the documented high-dim recommendation exact remains the only recommendation at high dim; memory ceiling documented
G4 threshold G1+G2 pass with margin on two consecutive lane runs lower VECTOR_ANN_THRESHOLD with migration note threshold unchanged

No gate is judged on a single lane run; HNSW build variance means each gate needs two consecutive passing runs (the lane's documented nondeterminism posture).

Non-Goals

Increment Plan

  1. 2.2-ann.1 (Workstreams A + D): strategy consolidation, direct-table / reused projected graph, VECTOR_ANN_EFS, ANN oversample+rerank. Lane evidence against G1.
  2. 2.2-ann.2 (Workstream B): COPY-based generation-swap rebuilds, delta-threshold incremental policy, cold-start measurement. Evidence against G2.
  3. 2.2-ann.3 (Workstream C): adaptive int8 oversampling, high-dim sweep, realistic-distribution lane variant. Evidence against G3.
  4. 2.2-ann.4: G4 threshold decision, docs (configuration/embeddings/ migration), capabilities reporting, release notes.

Each increment lands green (ruff, mypy strict, full pytest with coverage, site freshness) and updates the lane before the next starts.

G4 Outcome (2026-06-11)

G4 passed and the threshold moved, scoped to the dimension the evidence covers. Decision as shipped in 2.2-ann.4: