A Study on autoresearch
| Date | Change |
|---|---|
| 2026-05-19 | Blog post: Classical HPO vs frontier LLMs — an update (companion to the paper; new seeds, Opus 4.7 + Sonnet 4.6, honest caveat on the original n=3 results) |
| 2026-05-18 | Opus 4.6 KA Code + KA HPs extra seeds (5 → 8 in progress); Sonnet 4.6 KA Code + KA HPs at 5/5; Centaur Sonnet 4.6 at 5/5; CMA-ES added as second classical baseline |
| 2026-05-17 | Opus 4.7 KA Code closes at 5/5 |
| 2026-05-16 | Live Benchmark tab launched (rolling Centaur / KA Code / KA HPs across Claude generations) |
| 2026-05-15 | 5-seed extension across all Qwen 27B methods; result dirs de-aliased |
Click legends to toggle methods, drag sliders, hover for details.
train.py source each trial (unconstrained). Karpathy autoresearchBest val_bpb against cumulative training time (mean ± std across 3 seeds). All 9 Qwen3.5-27B methods plus frontier-model variants of Centaur and Karpathy Agent (Code): Claude Opus 4.6, Gemini 3.1 Pro Preview, Gemini 3.1 Flash-Lite, and Gemini 2.5 Flash. Click the filter buttons to group methods by type, or click legend entries to toggle individual curves.
Key OOM finding: Karpathy Agent (Code)'s failure rate drops sharply with LLM capability (19% for Qwen 0.8B, 12% for Qwen 27B, 3% for Gemini 3.1 Pro, 5% for Opus 4.6), whereas Centaur's stays in a narrow range (13-20%) across model choices. This suggests CMA-ES dominates OOM avoidance in the hybrid method, while pure code editing is directly sensitive to LLM capability.
Scaling Qwen3.5 from 0.8B to 27B is essential for unconstrained code editing but provides no advantage for fixed-HP methods. Centaur even shows a slight edge with 0.8B.
Centaur lets the LLM override CMA-ES on a fraction r of trials. Too much LLM control (r=0.8) degrades performance, confirming CMA-ES should retain majority control. Filter by model size or LLM ratio. CMA-ES baseline is always shown as reference.
Centaur uses the LLM on ~30% of trials, CMA-ES on the rest. This plot shows which trials were proposed by which source (from Centaur's LLM call logs), with stars marking new incumbents.
Grey dots = all trials. Colored stars = new incumbents (moments where the method found a new best). Staircase line = best-so-far trajectory. Reveals when each method found its improvements.
One panel per hyperparameter (14 total). Each dot is a trial, colored by its val_bpb (darker = better). Shows how each method explores each HP dimension over time.
Pick two hyperparameters to see how a method explored that 2D slice of the search space. Red × marks failed (OOM/crashed) trials.
Each LLM-based method receives a prompt containing the optimization goal, model class, training stack, hardware constraints, search space, and trial history. Select a method to view its prompt template. Braces like {name} are Python format placeholders filled in at runtime.
Each new Claude release runs at multiple seeds on Climbmix-400B with a 24h training budget. We compare against TPE (the best classical baseline). Results refresh automatically as new jobs complete.
Last updated: 2026-05-20 20:08 UTC
| Method | Seeds | Mean ± std | One-sided p vs TPE |
|---|---|---|---|
| Centaur (CMA-ES + LLM) [Opus 4.6] ★ | 8 | 0.9738 ± 0.0013 | n=8, p=0.055 * |
| Karpathy Agent (14 HPs) [Opus 4.6] | 7 | 0.9751 ± 0.0027 | n=7, p=0.406 |
| Karpathy Agent (14 HPs) [Opus 4.7] | 5 | 0.9752 ± 0.0029 | n=5, p=0.406 |
| Karpathy Agent (Code) [Opus 4.6] | 5 | 0.9753 ± 0.0032 | n=5, p=0.406 |
| TPE (classical) | 8 | 0.9755 ± 0.0018 | (baseline) |
| Centaur (CMA-ES + LLM) [Opus 4.7] | 5 | 0.9764 ± 0.0006 | n=5, p=0.781 |
| Karpathy Agent (14 HPs) [Sonnet 4.6] | 5 | 0.9764 ± 0.0016 | n=5, p=0.594 |
| Centaur (CMA-ES + LLM) [Sonnet 4.6] | 5 | 0.9780 ± 0.0020 | n=5, p=0.969 |
| Karpathy Agent (Code) [Opus 4.7] | 5 | 0.9790 ± 0.0021 | n=5, p=0.906 |
| Karpathy Agent (Code) [Sonnet 4.6] | 5 | 0.9800 ± 0.0047 | n=5, p=0.969 |
Significance: * p<0.10 (one-sided Wilcoxon vs TPE).
Best val_bpb against cumulative training time, all Opus generations overlaid. TPE always visible as the classical reference. Use the filter buttons to narrow to a single generation. Default-visible state: TPE plus the two most recent Opus generations.
For each method (Centaur, Karpathy Agent (Code), Karpathy Agent (14 HPs)), a slopegraph showing per-seed final val_bpb across Claude generations. Steep downward slopes indicate that the latest Opus release improved on the previous one for that method.
Paired Wilcoxon signed-rank test for each Claude generation × method versus TPE. Bars show Δ = mean(method) − mean(TPE) across paired seeds; one-sided p-value annotated.
One card per Claude release: mean ± std across all completed seeds, paired Wilcoxon p-value vs TPE, number of seeds completed, and last-updated date.
| Method | Seeds | Mean ± std | p vs TPE |
|---|---|---|---|
| Centaur (CMA-ES + LLM) | 8 | 0.9738 ± 0.0013 | 0.055 |
| Karpathy Agent (Code) | 5 | 0.9753 ± 0.0032 | 0.406 |
| Karpathy Agent (14 HPs) | 7 | 0.9751 ± 0.0027 | 0.406 |
| Method | Seeds | Mean ± std | p vs TPE |
|---|---|---|---|
| Centaur (CMA-ES + LLM) | 5 | 0.9764 ± 0.0006 | 0.781 |
| Karpathy Agent (Code) | 5 | 0.9790 ± 0.0021 | 0.906 |
| Karpathy Agent (14 HPs) | 5 | 0.9752 ± 0.0029 | 0.406 |
| Method | Seeds | Mean ± std | p vs TPE |
|---|---|---|---|
| Centaur (CMA-ES + LLM) | 5 | 0.9780 ± 0.0020 | 0.969 |
| Karpathy Agent (Code) | 5 | 0.9800 ± 0.0047 | 0.969 |
| Karpathy Agent (14 HPs) | 5 | 0.9764 ± 0.0016 | 0.594 |
Classical reference (TPE): 0.9755 ± 0.0018 across 8 seeds.