Back to aleahim.com
Cupertino · v1.2.0

Does the search find the right answer?

Cupertino is an Apple-platform documentation index built to keep AI coding assistants from hallucinating. This page summarises 10 measurements of how well it does that on the v1.2.0 build.

10 measurements·Auto-derived from audit markdown·Re-rendered when audits change

v1.2.0 release · search-quality leap

From 52% to 92% rank-1 accuracy.

On the queries an AI coding agent actually issues — Hashable, URLSession, Observable, SwiftUI — cupertino's MCP server now lands the right Apple documentation page at rank 1 nine times out of ten. Up from roughly five out of ten in v1.1.0. Across 110 queries spanning three independent test corpora, ~30 queries newly answer correctly. Zero queries regressed.

+40pp
Rank-1 accuracy lift
52% → 92% on the 50-query canonical lookup corpus
+30
Queries newly correct
across 110 queries on three independent corpora
0
Regressions
McNemar two-sided p ≤ 10-5 on the largest corpus

One architectural pattern surfaces twice

The two weak tests (abbreviations and symbol attributes) both reveal the same gap: cupertino has rich relational metadata (synonym tables, symbol kinds, attributes, conformances) that the default search path doesn't consult. A consolidated fix likely closes both. Tracking issues are open with three candidate approaches each.

Every measurement, with a path to the full audit

Each card below summarises one audit. The bar visualises how close to a perfect score the test scored. Open any card to see the full audit dashboard.

Version-diff comparisons — how v1.2.0 stacks against earlier releases

3 paired measurements covering 110+ queries. 2 strong, 1 mixed, 0 regression. All show v1.2.0 improving over its predecessor.

Strong
Δ Version diff

Search-quality version diff: v1.1.0 → v1.2.0 (deprecation pairs)

Version-to-version comparison

modern-wins rate 90.00% → 100.00%
Method & sourceBrowse all citationsNo metric token detected in the audit text — see sources.html

For each (query, modern_uri, legacy_uri) triple: run cupertino search "<query>" --limit 10, classify the outcome as modern_wins (modern…

Open audit dashboard
Mixed
Δ Version diff

Search-quality version diff: v1.1.0 → v1.2.0 (canonical-lookup-V2 (independent corpus))

Version-to-version comparison

+8 / 30 queries newly rank-1

This is a cross-validation corpus. The v1.1.0 → v1.2.0 claim ("v1.2.0 is better") is being independently re-verified with a different fixed…

Open audit dashboard

Absolute baselines — how v1.2.0 stands on its own terms

7 single-system measurements (no comparison to any prior release). 3 pass strongly, 1 mixed, 3 weak. The weak entries below are standing weaknesses — query classes where v1.2.0's ranker has known gaps; each has an open issue tracking the candidate fix. These are not v1.1.0 / v1.0.2 regressions.

Room for improvement

Query classes where cupertino's ranker has known standing weaknesses. Each has an open issue tracking a candidate fix. None of these is a v1.1.0 / v1.0.2 regression — they're pre-existing weak spots that v1.2.0 inherited and v1.3+ targets.

Weak
Standing baseline · all-time

Search-quality baseline: acronym / synonym recall (Phase 1.4, v1.2.0 candidate)

Acronym / synonym baseline

4 / 22 (18.2%)

This audit tests framework_aliases.synonyms — the cupertino-specific table that maps colloquial / abbreviated names to canonical framework…

Open audit dashboard
Coming next

Does an AI agent then write correct Swift?

Test 1.7 · Anti-hallucination, end-to-end

All seven other tests measure whether the right document is findable. This test measures whether an AI agent, given cupertino's top results, produces Swift that compiles, calls real APIs, and respects platform availability. The actual success measure for cupertino.

Design written (PR #815)
Implementation issue filed (#816)
Build task corpus (~30 Swift tasks)
Wire agent driver (LLM + MCP)
Wire scoring pipeline (compile + symbol + availability + deprecation)
First formal run; publish baseline
Read the design

Every metric, with its scientific source

Every number on the cards above traces back to a peer-reviewed source. The full citation list with paper / book / standard links is at sources.html. Highlights:

MRR

Voorhees (1999), TREC-8 QA Report. The reference for evaluating "the first relevant answer" rank.

See citation

BM25F

Robertson, Zaragoza, Taylor (2004). The field-weighted ranking formula cupertino tunes per-column.

See citation

Reciprocal Rank Fusion

Cormack, Clarke, Büttcher (2009). The cross-source fusion formula cupertino uses with k=60.

See citation

Wilcoxon + McNemar

The two paired statistical tests cupertino uses to compare two builds. Wilcoxon (1945) for rank metrics, McNemar (1947) for binary outcomes.

See citation

Full sources page

How a cupertino search-quality measurement works

For each test I run a fixed list of queries against cupertino, capture the top-10 results, score them against pre-defined right-answer patterns, and report the headline number plus a per-query breakdown. No anecdotes; everything is reproducible.

Full methodology in design/search-quality-eval. The universal rule lives at mihaela-agents/Rules/universal/search-quality-eval.md.