- The paper's headline holds and is now stronger: Centaur (CMA-ES + LLM) with Claude Opus 4.6 beats TPE at p=0.055 (n=8), up from p=0.06 (n=5) in the paper.
- Two pure-LLM agent methods (KA Code [Opus 4.6] and KA HPs [Opus 4.7]) now numerically beat TPE. Differences are within noise (p≈0.41) but the means inverted vs the paper.
- Two new Anthropic generations tested. Opus 4.7 and Sonnet 4.6 do not reproduce the Opus 4.6 advantage in our setup at n=5. We do not have enough seeds to conclude whether this is a model difference or sampling variance.
- TPE's mean shifted ~0.0004 lower when we extended its seeds from n=3 to n=8. Part of the apparent "closing gap" is the classical baseline being measured more accurately, not LLMs improving.
- We keep a live tracker. Each new Claude release runs here.
What we already wrote
In the original paper we benchmarked 9 HPO methods (classical, LLM-based, hybrid) on Karpathy's autoresearch task, optimizing a 50M-parameter GPT-2-class transformer on Climbmix-400B. Each method got an identical 24h training-time budget per seed. We tested two LLM optimizer "shapes":
- Karpathy Agent (14 HPs): the LLM suggests the next config inside a fixed 14-HP search space (DEPTH, head dim, batch sizes, learning rates, etc.).
- Karpathy Agent (Code): the LLM edits
train.pydirectly — no fixed search space, so OOM and "off-spec" trials are possible.
We also introduced Centaur (CMA-ES + LLM): a CMA-ES inner loop where, on a fraction of trials, the LLM overrides the proposal with a config informed by CMA-ES's internal state (its current mean and covariance estimate). The paper concluded that classical methods consistently outperform pure LLM-based agents within a fixed search space, that code-editing freedom narrows but does not close the gap with frontier models such as Opus 4.6 and Gemini 3.1 Pro Preview, and that Centaur was the only LLM-flavored setup that landed near the classical baselines.
That was the picture at n=3 seeds per method.
What changed
Three things shifted since publication:
- More seeds on the headline methods. TPE: n=3 → n=8. Centaur (CMA-ES + LLM) [Opus 4.6]: n=3 → n=8. The other methods went from n=3 to n=5 (with extras still in flight).
- Two new Anthropic generations. Claude Opus 4.7 (April 2026) and Claude Sonnet 4.6 (May 2026), both tested via the Claude Code SDK with
thinking=False, identical SUGGEST_PROMPT to the paper's Opus 4.6 runs, identical VRAM cap. - A live tracker. Auto-updates the comparison and the paired Wilcoxon p-values whenever new seeds clear the 95% training-budget threshold.
Updated results
The current state of the 11 method × generation cells in the tracker:
| Method | Mean ± std | n vs TPE | p (one-sided) |
|---|---|---|---|
| Centaur (CMA-ES + LLM) [Opus 4.6] | 0.9738 ± 0.0013 | 8 | 0.055 ** |
| Karpathy Agent (14 HPs) [Opus 4.7] | 0.9752 ± 0.0029 | 5 | 0.406 |
| Karpathy Agent (Code) [Opus 4.6] | 0.9753 ± 0.0032 | 5 | 0.406 |
| TPE (classical, baseline) | 0.9755 ± 0.0019 | — | — |
| Karpathy Agent (14 HPs) [Opus 4.6] | 0.9757 ± 0.0029 | 5 | 0.500 |
| Centaur (CMA-ES + LLM) [Opus 4.7] | 0.9764 ± 0.0006 | 5 | 0.781 |
| Karpathy Agent (14 HPs) [Sonnet 4.6] | 0.9764 ± 0.0016 | 5 | 0.594 |
| CMA-ES (classical) | 0.9774 ± 0.0024 | — | — |
| Centaur (CMA-ES + LLM) [Sonnet 4.6] | 0.9780 ± 0.0028 | 5 | 0.969 |
| Karpathy Agent (Code) [Opus 4.7] | 0.9790 ± 0.0021 | 5 | 0.906 |
| Karpathy Agent (Code) [Sonnet 4.6] | 0.9800 ± 0.0047 | 5 | 0.969 |
Reading the table:
- Three of nine LLM × generation variants numerically beat TPE (Centaur Opus 4.6, KA HPs Opus 4.7, KA Code Opus 4.6).
- One of nine does so with statistical significance: Centaur (CMA-ES + LLM) [Opus 4.6], with p=0.055 at n=8 paired observations.
- Sonnet 4.6 variants cluster near the bottom across all three method families. CMA-ES alone (no LLM) is mid-pack.
Surprises worth a paragraph
1. Centaur with Opus 4.6 holds up; the picture across models is mixed
The same recipe (same prompt, same CMA-ES state hand-off, same VRAM cap, same 30% LLM-override ratio) gives p=0.055 with Opus 4.6 but lands in the noise for Opus 4.7 (p=0.78) and Sonnet 4.6 (p=0.97, mean is actually above TPE). At n=5 for the Opus 4.7 and Sonnet 4.6 runs we cannot distinguish a real model effect from seed-level variance. For now: Centaur's lift is observed for the configuration we tested in the paper; whether it generalizes across LLMs is an open question.
2. KA Code [Opus 4.6] closes the "code-editing" gap the paper said it couldn't
The paper wrote: "Allowing the LLM to directly edit source code narrows the gap but does not close it, even with frontier models such as Claude Opus 4.6." Updated data: KA Code [Opus 4.6] mean = 0.9753, TPE mean = 0.9755. The mean is now below TPE by 0.0002. Statistical test: paired Wilcoxon p=0.41, well inside noise. So "doesn't significantly close" is still defensible; "doesn't close" was overstated at the paper's n=3.
3. Part of the closing is TPE getting better data
TPE n=3 → n=8 moved its mean from approximately 0.9760 to 0.9755. That is roughly the same magnitude as the KA Code [Opus 4.6] gap closure. At n=3, paired tests on a 0.001 effect at this noise level have very wide confidence intervals; the paper's headline numbers were on thin statistical ice and we should have noted that more clearly. This is the most important methodological lesson from the update.
4. CMA-ES alone is the worst classical method on this benchmark
CMA-ES standalone: 0.9774. TPE: 0.9755. But hand CMA-ES's internal state to an LLM (Centaur with Opus 4.6) and the combined system is the best of everything tested. The hybrid's lift is not "the LLM rescues a bad classical method" — it's that exposing the CMA-ES internal state to the LLM gives the LLM a foothold that pure-LLM agents and pure-CMA-ES alone do not have.
What we cannot claim yet
- No GPT-5 in the comparison. OpenAI's flagship is conspicuously absent. Adding GPT-5 is the next item on the queue; results will land in the live tracker as they complete.
- One task. Climbmix-400B language modeling is one downstream. We have not tested whether the Centaur-Opus-4.6 effect carries to other tasks.
- n=8 is still small. The Wilcoxon p-value at n=8 sits at 0.055; with additional seeds it will move and we cannot predict the direction.
- We do not have a mechanistic explanation for why Opus 4.6 + Centaur outperforms the other configurations.
The live tracker
https://ferreirafabio.github.io/autoresearch-automl/#tab=tracker
Four sections:
- A. Convergence. All method × generation cells, val_bpb vs cumulative wall-time, mean ± std band. Filter by Claude generation.
- B. Slopegraph. Per-method progression across the Claude generations we've tested.
- C. Wilcoxon forest. Δ vs TPE with paired p-values.
- D. Per-generation summary cards. One card per Anthropic release; last-updated timestamp; current leader.
Each new Anthropic release lands here automatically once the 5-seed campaign for that release clears the 95% training-budget threshold.
What's next
- GPT-5. Backend scaffolding in progress; will fold into the live tracker once seed runs complete.
- More seeds on the lagging methods to settle the "is it Opus 4.6 specifically or just noise?" question.
- arXiv v2 after the NeurIPS 2026 review window closes, with the updated numbers and a more cautious framing of the n=3 results in the original.