Search-quality version diff: v1.1.0 → v1.2.0 (deprecation pairs)
Version-to-version comparison
For each (query, modern_uri, legacy_uri) triple: run cupertino search "<query>" --limit 10, classify the outcome as modern_wins (modern…
Cupertino is an Apple-platform documentation index built to keep AI coding assistants from hallucinating. This page summarises 10 measurements of how well it does that on the v1.2.0 build.
On the queries an AI coding agent actually issues — Hashable, URLSession, Observable, SwiftUI — cupertino's MCP server now lands the right Apple documentation page at rank 1 nine times out of ten. Up from roughly five out of ten in v1.1.0. Across 110 queries spanning three independent test corpora, ~30 queries newly answer correctly. Zero queries regressed.
Read the full release write-up — what changed in v1.2.0, why the numbers moved →
The two weak tests (abbreviations and symbol attributes) both reveal the same gap: cupertino has rich relational metadata (synonym tables, symbol kinds, attributes, conformances) that the default search path doesn't consult. A consolidated fix likely closes both. Tracking issues are open with three candidate approaches each.
Related issues: #818 acronym routing #819 attribute filters #820 design-vocab intent #821 prose profile
Each card below summarises one audit. The bar visualises how close to a perfect score the test scored. Open any card to see the full audit dashboard.
3 paired measurements covering 110+ queries. 2 strong, 1 mixed, 0 regression. All show v1.2.0 improving over its predecessor.
Version-to-version comparison
For each (query, modern_uri, legacy_uri) triple: run cupertino search "<query>" --limit 10, classify the outcome as modern_wins (modern…
Version-to-version comparison
This is the Phase 1.8 version-to-version comparison KPI specified in issue #830, applied to the v1.1.0 → v1.2.0 jump.
Version-to-version comparison
This is a cross-validation corpus. The v1.1.0 → v1.2.0 claim ("v1.2.0 is better") is being independently re-verified with a different fixed…
Open audit dashboard7 single-system measurements (no comparison to any prior release). 3 pass strongly, 1 mixed, 3 weak. The weak entries below are standing weaknesses — query classes where v1.2.0's ranker has known gaps; each has an open issue tracking the candidate fix. These are not v1.1.0 / v1.0.2 regressions.
Deprecation-aware baseline
This audit answers a focused question: when a developer queries a concept that exists in both modern Swift form (value type / stdlib) and th…
Open audit dashboardSearch-quality baseline
This audit records the v1.2.0 candidate database's standing on Criterion 1 (good search) restricted to query classes **A (canonical lookup)*…
Open audit dashboardCamelCase fragment baseline
This audit tests Search.Index.CamelCaseSplitter (#77), the cupertino-specific mechanism that expands CamelCase identifiers like `LazyVGrid…
Cross-source canonical baseline
This audit tests Search.SmartQuery.sourceWeights (apple-docs=3.0, swift-evolution=1.5, packages=1.5, swift-book=1.0, swift-org=1.0, sample…
Query classes where cupertino's ranker has known standing weaknesses. Each has an open issue tracking a candidate fix. None of these is a v1.1.0 / v1.0.2 regression — they're pre-existing weak spots that v1.2.0 inherited and v1.3+ targets.
Prose / conceptual baseline
This audit tests multi-word natural-language queries — the kind a developer or AI agent would actually issue in a coding session, like "how…
Open audit dashboardSymbol-attribute baseline
This audit tests symbol-attribute queries — queries that conceptually describe symbols by attribute (@MainActor, @Observable), by signat…
Acronym / synonym baseline
This audit tests framework_aliases.synonyms — the cupertino-specific table that maps colloquial / abbreviated names to canonical framework…
Test 1.7 · Anti-hallucination, end-to-end
All seven other tests measure whether the right document is findable. This test measures whether an AI agent, given cupertino's top results, produces Swift that compiles, calls real APIs, and respects platform availability. The actual success measure for cupertino.
Every number on the cards above traces back to a peer-reviewed source. The full citation list with paper / book / standard links is at sources.html. Highlights:
Voorhees (1999), TREC-8 QA Report. The reference for evaluating "the first relevant answer" rank.
See citationRobertson, Zaragoza, Taylor (2004). The field-weighted ranking formula cupertino tunes per-column.
See citationCormack, Clarke, Büttcher (2009). The cross-source fusion formula cupertino uses with k=60.
The two paired statistical tests cupertino uses to compare two builds. Wilcoxon (1945) for rank metrics, McNemar (1947) for binary outcomes.
See citationFor each test I run a fixed list of queries against cupertino, capture the top-10 results, score them against pre-defined right-answer patterns, and report the headline number plus a per-query breakdown. No anecdotes; everything is reproducible.
~30-50 queries per test, hand-curated to cover breadth (types, protocols, methods, framework concepts). Each query carries a right-answer pattern.
View corpus source on GitHub →A Python harness invokes cupertino search via subprocess for each query, extracts the top-10 URIs from stdout. Read-only against the database.
Per-query MRR / P@k / NDCG / per-class custom metric. Auto-extracted, no human in the loop for Phase 1.
Read methodology →Markdown audit at docs/audits/, JSON raw data at /tmp/, this dashboard auto-derived from the markdown. Future ranking changes pair against the baseline using Wilcoxon / McNemar significance tests.
Full methodology in
design/search-quality-eval.
The universal rule lives at mihaela-agents/Rules/universal/search-quality-eval.md.