Can LLMs Beat Classical Hyperparameter Optimization Algorithms?

A Study on autoresearch

Fabio Ferreira · Lucca Wobbe · Arjun Krishnakumar · Frank Hutter · Arber Zela

Abstract (click to expand)
We benchmark 9 HPO methods (classical, LLM-based, and hybrid) on Karpathy's autoresearch task under identical 24-hour budgets. Classical methods consistently outperform pure LLM-based agents within a fixed search space, where avoiding out-of-memory failures matters more than search diversity. Allowing the LLM to directly edit source code narrows the gap but does not close it, even with frontier models such as Claude Opus 4.6 and Gemini 3.1 Pro Preview. Our hybrid Centaur, which shares CMA-ES's interpretable internal state with the LLM, achieves the best result, and a 0.8B LLM already suffices to outperform all classical and pure LLM methods. All in all, our results suggest that LLMs are most effective as a complement to classical optimizers, not as a replacement.
What's new (click to expand)
DateChange
2026-05-19Blog post: Classical HPO vs frontier LLMs — an update (companion to the paper; new seeds, Opus 4.7 + Sonnet 4.6, honest caveat on the original n=3 results)
2026-05-18Opus 4.6 KA Code + KA HPs extra seeds (5 → 8 in progress); Sonnet 4.6 KA Code + KA HPs at 5/5; Centaur Sonnet 4.6 at 5/5; CMA-ES added as second classical baseline
2026-05-17Opus 4.7 KA Code closes at 5/5
2026-05-16Live Benchmark tab launched (rolling Centaur / KA Code / KA HPs across Claude generations)
2026-05-155-seed extension across all Qwen 27B methods; result dirs de-aliased
Data access policy (click to expand)
Publishing per-trial HP configs and val_bpb values carries a risk of training-data leakage: future LLMs that we benchmark, trained on this page, could later score artificially well on Climbmix-400B HPO. To reduce that risk, the raw per-trial LLM call logs (full prompts, responses, chain-of-thought, error traces; ~12 GB across ~7k files) have been removed from this repository and are available on request for academic reproducibility: open an issue on the GitHub repo or email the authors.

Click legends to toggle methods, drag sliders, hover for details.

Method details & references
TPE : Tree-structured Parzen Estimator. Classical Bayesian HPO with density estimation. Optuna
CMA-ES : Covariance Matrix Adaptation Evolution Strategy. Classical evolutionary HPO. Optuna
SMAC : Sequential Model-based Algorithm Configuration with random-forest surrogate. SMAC3
Random : Uniform random sampling baseline. Optuna
LLAMBO (Optuna) : LLAMBO via OptunaHub port. Binary surrogate labels, categoricals random, failed trials hidden. OptunaHub · Ozaki et al. 2025
LLAMBO (Paper) : Faithful reimplementation of LLAMBO paper: continuous labels, all HPs visible, failed trials included. Liu et al. 2024 · our impl
Karpathy Agent (14 HPs) : LLM sees trial history and suggests next config within the fixed 14-HP search space. our impl
Karpathy Agent (Code) : LLM directly edits train.py source each trial (unconstrained). Karpathy autoresearch
Centaur (CMA-ES+LLM) : Hybrid: CMA-ES runs every trial; on 30% of trials the LLM overrides with a config informed by CMA-ES internal state. our paper · algorithm

1. Classical vs LLM-based HPO

Best val_bpb against cumulative training time (mean ± std across 3 seeds). All 9 Qwen3.5-27B methods plus frontier-model variants of Centaur and Karpathy Agent (Code): Claude Opus 4.6, Gemini 3.1 Pro Preview, Gemini 3.1 Flash-Lite, and Gemini 2.5 Flash. Click the filter buttons to group methods by type, or click legend entries to toggle individual curves.

Show:
💡 Click legend entries to toggle · Double-click to isolate a method · Marker positions are offset per line for visual clarity

Key OOM finding: Karpathy Agent (Code)'s failure rate drops sharply with LLM capability (19% for Qwen 0.8B, 12% for Qwen 27B, 3% for Gemini 3.1 Pro, 5% for Opus 4.6), whereas Centaur's stays in a narrow range (13-20%) across model choices. This suggests CMA-ES dominates OOM avoidance in the hybrid method, while pure code editing is directly sensitive to LLM capability.

2. Scaling the LLM optimizer (Qwen3.5: 0.8B vs 27B)

Scaling Qwen3.5 from 0.8B to 27B is essential for unconstrained code editing but provides no advantage for fixed-HP methods. Centaur even shows a slight edge with 0.8B.

Show:
💡 Solid: 27B · Dashed: 0.8B

3. Centaur LLM ratio ablation

Centaur lets the LLM override CMA-ES on a fraction r of trials. Too much LLM control (r=0.8) degrades performance, confirming CMA-ES should retain majority control. Filter by model size or LLM ratio. CMA-ES baseline is always shown as reference.

Show:
💡 Solid: 27B · Dashed: 0.8B · Dotted: CMA-ES baseline

4. Centaur: LLM vs CMA-ES trial contributions (Qwen3.5-27B, r=0.3)

Centaur uses the LLM on ~30% of trials, CMA-ES on the rest. This plot shows which trials were proposed by which source (from Centaur's LLM call logs), with stars marking new incumbents.

Loading trial data…

5. Incumbent trace explorer

Grey dots = all trials. Colored stars = new incumbents (moments where the method found a new best). Staircase line = best-so-far trajectory. Reveals when each method found its improvements.

Loading trial data…

6. Hyperparameter evolution

One panel per hyperparameter (14 total). Each dot is a trial, colored by its val_bpb (darker = better). Shows how each method explores each HP dimension over time.

Loading trial data…

7. 2D HP scatter (pick two HPs)

Pick two hyperparameters to see how a method explored that 2D slice of the search space. Red × marks failed (OOM/crashed) trials.

Loading trial data…

8. LLM prompt explorer

Each LLM-based method receives a prompt containing the optimization goal, model class, training stack, hardware constraints, search space, and trial history. Select a method to view its prompt template. Braces like {name} are Python format placeholders filled in at runtime.

View source on GitHub ↗