能力标签

🔌 MCP 🤖 Agent 🔄 工作流 🌐 翻译 🐳 Docker 💻 CLI 🧬 Embedding 📚 RAG 🧠 Claude ✨ GPT

⚙️

Agent工作流

AI工作流评估

无代码搭建完整 AI 自动化流程

英文名：awesome-evals

⭐ 218 Stars 🍴 11 Forks 📄 NOASSERTION 🏷 AI 8.0分

8.0AI 综合评分

ai-agentsawesome-listbenchmarks

⚙️ 配置说明 📺 TG 频道

✦ AI Skill Hub 推荐

AI Skill Hub 强烈推荐：AI工作流评估是一款优质的Agent工作流。AI 综合评分 8.0 分，在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案，这是一个值得深入了解的选择。

📚 深度解析

AI工作流评估是一套完整的 AI Agent 自动化工作流方案。随着 AI 能力的不断提升，基于 Agent 的自动化工作流正在成为提升个人和团队效率的核心方式。区别于传统的 RPA 自动化（模拟鼠标键盘操作），AI Agent 工作流通过理解任务意图、动态规划执行路径，能够处理更复杂的非结构化任务。

AI工作流评估工作流的设计遵循"最小配置，最大复用"原则：核心逻辑已经封装好，用户只需配置自己的 API Key 和业务参数即可快速上手。工作流内置错误处理和重试机制，在网络波动或 API 限速等情况下仍能稳定运行，适合作为生产环境的自动化基础设施。

在实际部署时，建议先在测试环境中运行 3-5 次，验证各个环节的输出结果符合预期，再部署到生产环境。AI Skill Hub 评分 8.0 分，是同类 Agent 工作流中的精选推荐。

📋 工具概览

AI工作流评估是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排，将复杂的多步骤任务拆解为清晰的自动化流程，实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成，适合构建数据处理管线、业务自动化和 AI 辅助决策系统。

GitHub Stars

⭐ 218

开发语言

多语言

支持平台

Windows / macOS / Linux

维护状态

轻量级项目，按需更新

开源协议

NOASSERTION

AI 综合评分

8.0 分

工具类型

Agent工作流

Forks

📖 中文文档

以下内容由 AI Skill Hub 根据项目信息自动整理，如需查看完整原始文档请访问底部「原始来源」。

📌 核心特色

可视化 Agent 工作流编排，无需编写复杂代码
支持多步骤自动化任务链，实现全流程无人值守
与外部 API、数据库和第三方服务无缝集成
内置错误处理与自动重试机制，保障稳定运行
提供可复用的自动化模板，快速在同类场景部署

🎯 主要使用场景

自动化日常重复性工作，将精力集中于创造性任务
构建数据采集 → 处理 → 输出的完整自动化管线
实现跨平台、跨系统的数据流转和业务协同

以下安装命令基于项目开发语言和类型自动生成，实际以官方 README 为准。

安装命令

# 克隆仓库
git clone https://github.com/benchflow-ai/awesome-evals
cd awesome-evals

# 查看安装说明
cat README.md

# 按 README 完成环境依赖安装后即可使用

📋 安装步骤说明

访问 GitHub 仓库获取工作流文件
在对应平台（Dify / Flowise / Make 等）中找到「导入工作流」功能
上传工作流文件
按照提示配置必要的环境变量和 API Key
运行测试确认流程正常后投入使用

以下用法示例由 AI Skill Hub 整理，涵盖最常见的使用场景。

常用命令 / 代码示例

# 查看帮助
awesome-evals --help

# 基本运行
awesome-evals [options] <input>

# 详细使用说明请查阅文档
# https://github.com/benchflow-ai/awesome-evals

以下配置示例基于典型使用场景生成，具体参数请参照官方文档调整。

配置示例

# awesome-evals 配置说明
# 查看配置选项
awesome-evals --config-example > config.yml

# 常见配置项
# output_dir: ./output
# log_level: info
# workers: 4

# 环境变量（覆盖配置文件）
export AWESOME_EVALS_CONFIG="/path/to/config.yml"

📑 README 深度解析真实文档完整度 25/100 查看 GitHub 原文 →

以下内容由系统直接从 GitHub README 解析整理，保留代码块、表格与列表结构。

Awesome Agent Evals [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

A curated, opinionated, non-BS library of the best resources for building and evaluating AI agents — papers, blog posts, talks, courses, tools, and benchmarks.

Maintained by BenchFlow · "Environments are the new data."

Most "awesome" lists are link dumps. This one is annotated and verified: every entry says what it is and why it belongs, URLs are checked, quotes are verbatim, and dead/abandoned tools are pruned (not silently listed). It was assembled by:

a depth-4 recursive citation crawl (11.6k papers, ranked by in-degree) to surface the academic canon,
targeted practitioner-web discovery for the industry sources citation graphs miss (Eugene Yan, Han-Chung Lee, Hamel Husain, Shreya Shankar, Nathan Lambert, …),
47 talks & podcasts transcribed and deep-noted (verbatim + timestamps), and
per-section gap audits with adversarial verification.

443+ curated links · 146 deep reading notes (see notes/). Markers: 🆕 = released/updated 2025–2026 · ⚠️ = caveat. Contributions welcome — see CONTRIBUTING.

📘 Playbook: PATTERNS.md — real, runnable code + worked examples for LLM-as-judge (aligned to humans), pass@k/pass^k, error analysis, trajectory & world-state grading, CI gating, verifiable rewards, and more.

2 · "If you can eval it, you have built it" — eval ⇄ capability ⇄ RL environment

Asymmetry of Verification and Verifier's Law — Jason Wei — <https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law> · blog — Trainability tracks verifiability; verifying = creating an RL environment.
A Taxonomy of RL Environments for LLM Agents — Han-Chung Lee — <https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/> · blog — A benchmark is a frozen RL environment; the E = {T,H,V,S,C} decomposition; "verifiable beats judgeable."
The Life Cycle of an RL Environment — Kanav Garg (Core Automation; ex-DeepMind) — talk; summary at <https://muratbuffalo.blogspot.com/2026/06/acm-cais-conference-on-ai-and-agentic.html> · talk — Difficulty calibration (the 1–4/16 Goldilocks band), RL as variance reduction, reward hacking under training pressure. (local notes: research/notes/kanav-garg-rl-environment-lifecycle.md)
Welcome to the Era of Experience — David Silver & Richard Sutton — <https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf> · paper — Human-data value approaching its ceiling; the frontier is agents learning from experience / synthetic environments.
RLHF Book, Ch. 16 — Evaluation — Nathan Lambert — <https://rlhfbook.com/c/16-evaluation> · book — Evaluation as a reflection of training goals; prompt-format sensitivity (60%→~0%).
What Comes Next with Reinforcement Learning — Nathan Lambert — <https://www.interconnects.ai/p/what-comes-next-with-reinforcement> · blog — Long-horizon credit assignment; where RL is and isn't ready.
verifiers — Prime Intellect — <https://github.com/PrimeIntellect-ai/verifiers> (docs: .../blob/main/docs/environments.md) · tool/repo — One environment package shared by eval and prime-rl — the eval-is-an-RL-env thesis as code.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI (Guo et al.) — <https://arxiv.org/abs/2501.12948> · paper — The proof-of-thesis: pure RL with rule-based verifiable rewards (no SFT) makes reasoning emerge — the canonical 'if you can verify it, RL builds it' result; also published in Nature 2025. Conspicuously absent from a section literally about eval-as-RL-environment. 🆕
Tülu 3: Pushing Frontiers in Open Language Model Post-Training — Lambert et al. (Allen Institute for AI) — <https://arxiv.org/abs/2411.15124> · paper — Coined/popularized RLVR and open-sourced the recipe + code (open-instruct): swap the reward model for a verifier on tasks with checkable answers. The foundational citation behind every 'verifiable beats judgeable' claim in this section. 🆕
Natural Emergent Misalignment from Reward Hacking in Production RL — Anthropic — <https://www.anthropic.com/research/emergent-misalignment-reward-hacking> · paper — Empirical receipt for the section's 'reward hacking under training pressure' theme: learning to cheat on real coding environments generalizes to sabotage/alignment-faking; introduces inoculation prompting as mitigation (arXiv 2511.18397). 🆕
Environments Hub: A Community Hub To Scale RL To Open AGI — Prime Intellect — <https://www.primeintellect.ai/blog/environments> · blog — The launch post for the verifiers-spec marketplace (2,500+ shared eval/RL environments) — the eval-is-an-RL-env thesis as an actual ecosystem, the natural companion to the already-listed verifiers repo. 🆕
How to fully automate software engineering — Ege Erdil, Matthew Barnett, Tamay Besiroglu (Mechanize) — <https://www.mechanize.work/blog/how-to-fully-automate-software-engineering/> · blog — Sharpest statement of the inverse thesis: today's RL environments are rudimentary, so capability is gated on building richer/more diverse environments — 'you only get the capability you can build an environment for.' 🆕
Cheap RL tasks will waste compute — Mechanize (Erdil, Barnett, Besiroglu) — <https://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/> · blog — The economics of environment quality: data and compute are complementary, so low-quality (cheaply-bought) tasks waste expensive RL compute — directly informs difficulty calibration / why environment design matters. 🆕
An FAQ on Reinforcement Learning Environments — Jean-Stanislas Denain & Chris Barber (Epoch AI) — <https://epoch.ai/gradient-updates/state-of-rl-envs> · blog — Practitioner-interview survey (18 pros) on how RL environments are actually built, the reward-hacking failure modes, and the production-scaling bottleneck — the empirical state-of-the-field map this section lacks. 🆕
RL Environments and RL for Science: Data Foundries and Multi-Agent Architectures — AJ Kourabi & Dylan Patel (SemiAnalysis) — <https://newsletter.semianalysis.com/p/rl-environments-and-rl-for-science> · newsletter — Market-structure view: 35+ companies now sell RL environments; capability gains are coming from ramping RL compute, not pretraining. Grounds the 'benchmark = frozen RL environment' thesis in who's actually building/buying them. 🆕
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces — Harbor / Stanford / Laude Institute — <https://github.com/harbor-framework/terminal-bench> · benchmark — A concrete instance of the thesis: each task ships a Docker environment + programmatic verification test suite + oracle — i.e. a benchmark that IS an RL environment (and is used as one). 2.4k stars, active. 🆕
tau2-bench (τ²-Bench): A Benchmark for Tool-Agent-User Interaction in Real-World Domains — Sierra Research (Barres et al.) — <https://github.com/sierra-research/tau2-bench> · benchmark — Dual-control, multi-turn, policy-following eval with a simulated user and verifiable DB-state checks — the canonical example of a verifiable conversational/agentic environment beyond math/code (paper arXiv 2506.07982). 🆕

Must-reads: Wei · Lee (RL-env taxonomy)

5e · RL-environment / verifiable-reward toolkits (eval ⇄ training)

verifiers — Prime Intellect — <https://github.com/PrimeIntellect-ai/verifiers> — Environment = dataset + harness + rubric; one package for eval, RL, synthetic data. (MUST)
Environments Hub — Prime Intellect — <https://github.com/PrimeIntellect-ai/community-environments> (app.primeintellect.ai) — 🆕 crowdsourced verifiers-based RL/eval envs.
prime-rl — Prime Intellect — <https://github.com/PrimeIntellect-ai/prime-rl> — 🆕 async RL trainer consuming verifiers envs (INTELLECT-3).
BenchFlow — <https://github.com/benchflow-ai/benchflow> · <https://benchflow.ai> — 🆕 environment lab: builds & runs RL/eval environments (SkillsBench, ClawsBench, runtime). "Environments are the new data." (also §5a)
HUD — <https://github.com/hud-evals/hud-python> — 🆕 SDK to build/run agent eval environments (computer-use, browser, MCP) with telemetry.
Atropos — Nous Research — <https://github.com/NousResearch/atropos> — 🆕 async "environment microservice" framework for rollouts/verifiable rewards.
verl — <https://github.com/volcengine/verl> (now verl-project/verl) — de-facto industry RLVR trainer (PPO/GRPO). ~22k★.
OpenRLHF — <https://github.com/OpenRLHF/OpenRLHF> · SkyRL — <https://github.com/NovaSky-AI/SkyRL> · AReaL — <https://github.com/areal-project/AReaL> · ROLL — <https://github.com/alibaba/ROLL> · rLLM — <https://github.com/agentica-project/rllm> · TRL — <https://github.com/huggingface/trl> — the RL-training stack agents are post-trained + eval'd in.
Open Reward Standard (ORS) — General Reasoning — <https://docs.openreward.ai/> (PyPI openreward) — 🆕 MCP-extending spec adding RL primitives (episodes, rewards, curriculum). ⚠️ no single canonical repo confirmed.

7 · Evals & RL environments (verifiers, reward design, difficulty calibration, lifecycle)

(See also T2 — verifiers library, Lee's RL-env taxonomy, Garg's lifecycle, Wei's verifier's law.)

RewardBench — Nathan Lambert et al. — <https://arxiv.org/abs/2403.13787> · paper — Evaluating reward models (the verifier you train against).
The New RL Scaling Laws — Nathan Lambert — <https://www.interconnects.ai/p/the-new-rl-scaling-laws> · blog — Where RLVR scaling is heading. (interview: <https://www.latent.space/p/the-rlvr-revolution-with-nathan-lambert>)
Spurious Rewards: Rethinking Training Signals in RLVR — <https://arxiv.org/abs/2506.10947> · paper — Random/spurious rewards rival ground truth on Qwen2.5 (Qwen-specific). (cite arXiv figures, not the blog gloss — see research/notes/reference-audit.md)
The State of Post-Training 2025 — Nathan Lambert — <https://www.interconnects.ai/p/the-state-of-post-training-2025> · blog — Context for where evals feed training.

Reward Hacking in Reinforcement Learning — Lilian Weng — <https://lilianweng.github.io/posts/2024-11-28-reward-hacking/> · blog — The canonical survey of reward hacking — taxonomy, RLHF-specific failure modes, mitigations; the foundational reference any reward-design section needs.
Specification gaming: the flip side of AI ingenuity — Victoria Krakovna et al. (Google DeepMind) — <https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/> · blog — Canonical specification-gaming post (+the running examples list); origin story of why verifiers/reward functions get gamed, predating the LLM-RL wave.
Multi-Turn RL for Multi-Hour Agents — with Will Brown (Prime Intellect) — Latent Space / Will Brown — <https://www.latent.space/p/willccbb> · talk — The verifiers author on building multi-turn RL environments, turn-level credit assignment and reward design in practice — the practitioner voice behind the verifiers library already cited here. 🆕
Position: The Hidden Costs and Measurement Gaps of RLVR — various (arXiv 2509.21882) — <https://arxiv.org/abs/2509.21882> · paper — RLVR gains overstated via budget mismatch, calibration drift, contamination; proposes a tax-aware minimum standard — the rigor counterweight to Lambert's RL-scaling optimism. 🆕
RewardBench 2: Advancing Reward Model Evaluation — Saumya Malik, Nathan Lambert et al. (Ai2) — <https://arxiv.org/abs/2506.01937> · benchmark — The 2025 successor to RewardBench (already listed) — harder, less saturated, ICLR 2026; the current bar for evaluating the verifier you train against. 🆕
Reward Modeling (RLHF Book, ch. 5) — Nathan Lambert — <https://rlhfbook.com/c/05-reward-models> · docs — Canonical free reference chapter on reward models — the standing explainer for the 'verifier you train against' framing this section uses. 🆕
Curriculum RL from Easy to Hard Tasks Improves LLM Reasoning (E2H Reasoner) — Shubham Parashar et al. (Texas A&M) — <https://arxiv.org/abs/2506.06632> · paper — Difficulty-calibration primary source: easy-to-hard scheduling with convergence guarantees and the 'fade out easy tasks' result — directly fills the section's difficulty-calibration theme. 🆕
GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators — Jiacheng Guo, Ling Yang, Mengdi Wang et al. (Princeton) — <https://arxiv.org/abs/2512.19682> · paper — Generative environment simulator with an alpha-Curriculum Reward that keeps tasks in the zone of proximal development — recent take on auto-calibrating env difficulty to the agent. 🆕

Must-reads: Lee (RL-env taxonomy) · Garg (lifecycle) · verifiers (repo)

6 · Benchmark vs. eval (and benchmark integrity: contamination, saturation, label errors, leaderboard gaming)

How to Build Good Language Modeling Benchmarks — Ofir Press — <https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/> · blog — The benchmark-author's checklist; difficulty target; one-number reporting; 150–500 task sizing.
AI Agents That Matter — Kapoor et al. — <https://arxiv.org/abs/2407.01502> · paper — Cost-controlled evaluation; model-dev vs downstream-dev needs; holdouts.
Why We No Longer Evaluate SWE-bench Verified — OpenAI — <https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/> · blog — ~59% of audited failures were broken tests. (mirror: <https://decrypt.co/359012/...>)
The Leaderboard Illusion — Shivalika Singh et al. (Cohere/Princeton/Stanford/MIT/AI2) — <https://arxiv.org/abs/2504.20879> · paper — Private testing, selective disclosure, and data-access asymmetry on Chatbot Arena. (notes: research/notes/leaderboard-illusion.md)
The SWE-bench Illusion: When SOTA LLMs Remember Instead of Reason — <https://arxiv.org/abs/2506.12286> · paper — Memorization inflates SWE-bench scores.
Establishing Best Practices for Building Rigorous Agentic Benchmarks (ABC) — <https://arxiv.org/abs/2507.02825> · paper — SWE-bench Verified weak tests; τ-bench rewards empty responses. (verified high)
FrontierMath Tiers 1–3 v2 (corrected) — Epoch AI — <https://epoch.ai/benchmarks/frontiermath-tiers-1-3-v2> (changelog: .../frontiermath-tier-4-v2) · page — ~42% of problems corrected after AI-assisted review. (also T8: the operator-as-rot-detector tale)
About 30% of Humanity's Last Exam Answers Are Wrong — FutureHouse / Andrew White — <https://www.futurehouse.org/research-announcements/hle-exam> · blog — 29 ± 3.7% of text-only chem/bio answers contradicted by the literature. (LessWrong writeup: <https://www.lesswrong.com/posts/JANqfGrMyBgcKtGgK/>)
Building on Evaluation Quicksand — Nathan Lambert — <https://www.interconnects.ai/p/building-on-evaluation-quicksand> · blog — No hard source of truth; synthetic-data contamination.
Lost in Simulation — <https://arxiv.org/abs/2601.17087> · paper — Simulated users are unreliable proxies (~9pp swings by simulator choice; demographic miscalibration).
SWE-bench: Can LMs Resolve Real-World GitHub Issues? — Jimenez, Yang, … Press, Narasimhan — <https://arxiv.org/abs/2310.06770> · <https://www.swebench.com> (Verified: .../verified.html) · paper/site.
Task-Specific LLM Evals that Do & Don't Work — Eugene Yan — <https://eugeneyan.com/writing/evals/> · blog — Off-the-shelf evals rarely transfer; accuracy is too coarse.
Andrej Karpathy on evals — <https://x.com/karpathy/status/1896266683301659068> · post — "We make a number of specific recommendations…" (the eval-as-narrow critique).

A Careful Examination of LLM Performance on Grade School Arithmetic (GSM1k) — Hugh Zhang et al. (Scale AI) — <https://arxiv.org/abs/2405.00332> · paper — Held-out GSM1k replica of GSM8k exposes up to 8% accuracy drop and partial memorization (Mistral/Phi) — the canonical method for measuring benchmark overfitting/contamination via a matched holdout.
Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks — Curtis Northcutt, Anish Athalye, Jonas Mueller — <https://arxiv.org/abs/2103.14749> · paper — NeurIPS 2021 foundational result: ~3.3% avg label errors across 10 famous test sets (ImageNet, MNIST, etc.); corrections flip model rankings. The canonical 'label errors' citation this section's theme rests on (labelerrors.com / cleanlab).
Are We Done with MMLU? (MMLU-Redux) — Aryo Pradipta Gema et al. (Edinburgh) — <https://arxiv.org/abs/2406.04127> · paper — ~6.5% of MMLU questions contain errors (57% in Virology); MMLU-Redux re-annotation shifts rankings — directly demonstrates label-error impact on the most-cited LLM benchmark.
LiveCodeBench: Holistic and Contamination-Free Evaluation of LLMs for Code — Naman Jain et al. (UC Berkeley) — <https://arxiv.org/abs/2403.07974> · benchmark — Time-windowed problem collection (post-cutoff scoring) as the leading contamination-resistant design pattern — the section discusses contamination but lists no exemplar of how to engineer around it.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark — White, Dohan, LeCun, Goldblum et al. — <https://github.com/LiveBench/LiveBench> · benchmark — Monthly-refreshed questions from new arXiv/news/competitions with objective ground truth — the canonical 'dynamic refresh' answer to saturation and contamination.
The LLM Evaluation Guidebook (Open LLM Leaderboard team) — Clémentine Fourrier / Hugging Face — <https://github.com/huggingface/evaluation-guidebook> · docs — Practitioner reference from running the Open LLM Leaderboard; explicit sections on contamination, reproducibility, and leaderboard design — the hands-on 'how to not get fooled' companion to this section (updated version: hf.co/spaces/OpenEvals/evaluation-guidebook).
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation — Kapoor, Stroebl, Kirgis et al. (Princeton) — <https://arxiv.org/abs/2510.11977> · paper — 21,000+ standardized agent runs surfacing leaderboard unreliability and unreported misbehaviors (agents searching HuggingFace for benchmark answers) — extends 'AI Agents That Matter' to leaderboard integrity for agents specifically. 🆕
Gaming the System: Goodhart's Law Exemplified in the AI Leaderboard Controversy — Jambholkar, Rajani, Bakshi (Collinear AI) — <https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy> · blog — Practitioner framing of the Llama 4 / Chatbot Arena gaming episode through Goodhart's Law — the accessible blog companion to The Leaderboard Illusion paper. 🆕
A Shared Playbook for Trustworthy Third-Party Evaluations — OpenAI — <https://openai.com/index/trustworthy-third-party-evaluations-foundations/> · blog (Safety, May 29 2026) — What makes independent evals of frontier-model safeguards & capabilities trustworthy: selecting the right harness, checking for validity hazards that distort results, and the standards third-party evaluators need. (also T10) 🆕

Must-reads: Press · Kapoor et al. · OpenAI (SWE-bench Verified) · Leaderboard Illusion

8 · LLM-as-judge & verifiers (alignment, biases, verifiable vs judgeable)

Evaluating the Effectiveness of LLM-Evaluators — Eugene Yan — <https://eugeneyan.com/writing/llm-evaluators/> · blog — Position/verbosity/self-enhancement bias; direct vs pairwise; prefer binary + classification metrics.
Creating an LLM-as-a-Judge That Drives Business Results — Hamel Husain — <https://hamel.dev/blog/posts/llm-judge/> · blog — Critique-shadowing; validate against ONE benevolent-dictator expert; precision/recall over raw agreement.
Who Validates the Validators? (EvalGen) — Shankar et al. (UIST '24) — <https://arxiv.org/abs/2404.12272> (pdf: .../pdf/2404.12272; UIST: <https://people.eecs.berkeley.edu/~bjoern/papers/shankar-validators-uist2024.pdf>) · paper — Criteria drift; the coverage-vs-false-failure judge-alignment loop.
LLM Evals FAQ — Hamel Husain & Shreya Shankar — <https://hamel.dev/blog/posts/evals-faq/> (error-analysis section: .../why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html) · blog — Binary over Likert; review ≥100 traces; the first-failure transition matrix for agents.
LLM-as-a-Judge: Rethinking Model-Based Evaluations — Han-Chung Lee — <https://leehanchung.github.io/blogs/2024/08/11/llm-as-a-judge/> · blog — Avoid [0,1] continuous scales; manage judges like junior annotators.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al. — <https://arxiv.org/abs/2306.05685> · paper — Source of the 10%/25% self-favoring & position-bias numbers — which the authors themselves hedge ("cannot determine"); GPT-3.5 doesn't self-favor.
LLMs Instead of Human Judges? A Large-Scale Study — Bavaresco et al. — <https://arxiv.org/abs/2406.