2026-05-19 · ~10 min read

Classical HPO vs frontier LLMs: an update

Companion post to Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch (Ferreira, Wobbe, Krishnakumar, Hutter, Zela — 2026)

TL;DR.

What we already wrote

In the original paper we benchmarked 9 HPO methods (classical, LLM-based, hybrid) on Karpathy's autoresearch task, optimizing a 50M-parameter GPT-2-class transformer on Climbmix-400B. Each method got an identical 24h training-time budget per seed. We tested two LLM optimizer "shapes":

We also introduced Centaur (CMA-ES + LLM): a CMA-ES inner loop where, on a fraction of trials, the LLM overrides the proposal with a config informed by CMA-ES's internal state (its current mean and covariance estimate). The paper concluded that classical methods consistently outperform pure LLM-based agents within a fixed search space, that code-editing freedom narrows but does not close the gap with frontier models such as Opus 4.6 and Gemini 3.1 Pro Preview, and that Centaur was the only LLM-flavored setup that landed near the classical baselines.

That was the picture at n=3 seeds per method.

What changed

Three things shifted since publication:

  1. More seeds on the headline methods. TPE: n=3 → n=8. Centaur (CMA-ES + LLM) [Opus 4.6]: n=3 → n=8. The other methods went from n=3 to n=5 (with extras still in flight).
  2. Two new Anthropic generations. Claude Opus 4.7 (April 2026) and Claude Sonnet 4.6 (May 2026), both tested via the Claude Code SDK with thinking=False, identical SUGGEST_PROMPT to the paper's Opus 4.6 runs, identical VRAM cap.
  3. A live tracker. Auto-updates the comparison and the paired Wilcoxon p-values whenever new seeds clear the 95% training-budget threshold.
A caveat to flag up front: TPE's mean shifted from ~0.9760 (n=5) to 0.9755 (n=8) as the continuation seeds came in. The classical baseline got measurably better with more data. Part of what looks like LLM improvement vs the paper is, more honestly, TPE getting a better point estimate. We come back to this below.

Updated results

The current state of the 11 method × generation cells in the tracker:

MethodMean ± stdn vs TPEp (one-sided)
Centaur (CMA-ES + LLM) [Opus 4.6]0.9738 ± 0.001380.055 **
Karpathy Agent (14 HPs) [Opus 4.7]0.9752 ± 0.002950.406
Karpathy Agent (Code) [Opus 4.6]0.9753 ± 0.003250.406
TPE (classical, baseline)0.9755 ± 0.0019
Karpathy Agent (14 HPs) [Opus 4.6]0.9757 ± 0.002950.500
Centaur (CMA-ES + LLM) [Opus 4.7]0.9764 ± 0.000650.781
Karpathy Agent (14 HPs) [Sonnet 4.6]0.9764 ± 0.001650.594
CMA-ES (classical)0.9774 ± 0.0024
Centaur (CMA-ES + LLM) [Sonnet 4.6]0.9780 ± 0.002850.969
Karpathy Agent (Code) [Opus 4.7]0.9790 ± 0.002150.906
Karpathy Agent (Code) [Sonnet 4.6]0.9800 ± 0.004750.969

Reading the table:

Surprises worth a paragraph

1. Centaur with Opus 4.6 holds up; the picture across models is mixed

The same recipe (same prompt, same CMA-ES state hand-off, same VRAM cap, same 30% LLM-override ratio) gives p=0.055 with Opus 4.6 but lands in the noise for Opus 4.7 (p=0.78) and Sonnet 4.6 (p=0.97, mean is actually above TPE). At n=5 for the Opus 4.7 and Sonnet 4.6 runs we cannot distinguish a real model effect from seed-level variance. For now: Centaur's lift is observed for the configuration we tested in the paper; whether it generalizes across LLMs is an open question.

2. KA Code [Opus 4.6] closes the "code-editing" gap the paper said it couldn't

The paper wrote: "Allowing the LLM to directly edit source code narrows the gap but does not close it, even with frontier models such as Claude Opus 4.6." Updated data: KA Code [Opus 4.6] mean = 0.9753, TPE mean = 0.9755. The mean is now below TPE by 0.0002. Statistical test: paired Wilcoxon p=0.41, well inside noise. So "doesn't significantly close" is still defensible; "doesn't close" was overstated at the paper's n=3.

3. Part of the closing is TPE getting better data

TPE n=3 → n=8 moved its mean from approximately 0.9760 to 0.9755. That is roughly the same magnitude as the KA Code [Opus 4.6] gap closure. At n=3, paired tests on a 0.001 effect at this noise level have very wide confidence intervals; the paper's headline numbers were on thin statistical ice and we should have noted that more clearly. This is the most important methodological lesson from the update.

4. CMA-ES alone is the worst classical method on this benchmark

CMA-ES standalone: 0.9774. TPE: 0.9755. But hand CMA-ES's internal state to an LLM (Centaur with Opus 4.6) and the combined system is the best of everything tested. The hybrid's lift is not "the LLM rescues a bad classical method" — it's that exposing the CMA-ES internal state to the LLM gives the LLM a foothold that pure-LLM agents and pure-CMA-ES alone do not have.

What we cannot claim yet

The live tracker

https://ferreirafabio.github.io/autoresearch-automl/#tab=tracker

Four sections:

Each new Anthropic release lands here automatically once the 5-seed campaign for that release clears the 95% training-budget threshold.

What's next