⚙️
Agent工作流

ClawBench

基于 Python · 无代码搭建完整 AI 自动化流程
⭐ 319 Stars 🍴 20 Forks 💻 Python 📄 Apache-2.0 🏷 AI 7.5分
7.5AI 综合评分
workflowagent-evaluationagentic-aiai-agent-benchmarkai-agentsbenchmark
✦ AI Skill Hub 推荐

AI Skill Hub 推荐使用:ClawBench 是一款优质的Agent工作流。AI 综合评分 7.5 分,在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案,这是一个值得深入了解的选择。

📚 深度解析
ClawBench 是一套完整的 AI Agent 自动化工作流方案。随着 AI 能力的不断提升,基于 Agent 的自动化工作流正在成为提升个人和团队效率的核心方式。区别于传统的 RPA 自动化(模拟鼠标键盘操作),AI Agent 工作流通过理解任务意图、动态规划执行路径,能够处理更复杂的非结构化任务。

ClawBench 工作流的设计遵循"最小配置,最大复用"原则:核心逻辑已经封装好,用户只需配置自己的 API Key 和业务参数即可快速上手。工作流内置错误处理和重试机制,在网络波动或 API 限速等情况下仍能稳定运行,适合作为生产环境的自动化基础设施。

在实际部署时,建议先在测试环境中运行 3-5 次,验证各个环节的输出结果符合预期,再部署到生产环境。AI Skill Hub 评分 7.5 分,是同类 Agent 工作流中的精选推荐。
📋 工具概览

ClawBench 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。

GitHub Stars
⭐ 319
开发语言
Python
支持平台
Windows / macOS / Linux
维护状态
轻量级项目,按需更新
开源协议
Apache-2.0
AI 综合评分
7.5 分
工具类型
Agent工作流
Forks
20
📖 中文文档
以下内容由 AI Skill Hub 根据项目信息自动整理,如需查看完整原始文档请访问底部「原始来源」。

ClawBench 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。

📌 核心特色
  • 可视化 Agent 工作流编排,无需编写复杂代码
  • 支持多步骤自动化任务链,实现全流程无人值守
  • 与外部 API、数据库和第三方服务无缝集成
  • 内置错误处理与自动重试机制,保障稳定运行
  • 提供可复用的自动化模板,快速在同类场景部署
🎯 主要使用场景
  • 自动化日常重复性工作,将精力集中于创造性任务
  • 构建数据采集 → 处理 → 输出的完整自动化管线
  • 实现跨平台、跨系统的数据流转和业务协同
以下安装命令基于项目开发语言和类型自动生成,实际以官方 README 为准。
安装命令
# 方式一:pip 安装(推荐)
pip install clawbench

# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install clawbench

# 方式三:从源码安装(获取最新功能)
git clone https://github.com/TIGER-AI-Lab/ClawBench
cd ClawBench
pip install -e .

# 验证安装
python -c "import clawbench; print('安装成功')"
📋 安装步骤说明
  1. 访问 GitHub 仓库获取工作流文件
  2. 在对应平台(Dify / Flowise / Make 等)中找到「导入工作流」功能
  3. 上传工作流文件
  4. 按照提示配置必要的环境变量和 API Key
  5. 运行测试确认流程正常后投入使用
以下用法示例由 AI Skill Hub 整理,涵盖最常见的使用场景。
常用命令 / 代码示例
# 命令行使用
clawbench --help

# 基本用法
clawbench input_file -o output_file

# Python 代码中调用
import clawbench

# 示例
result = clawbench.process("input")
print(result)
以下配置示例基于典型使用场景生成,具体参数请参照官方文档调整。
配置示例
# clawbench 配置文件示例(config.yml)
app:
  name: "clawbench"
  debug: false
  log_level: "INFO"

# 运行时指定配置文件
clawbench --config config.yml

# 或通过环境变量配置
export CLAWBENCH_API_KEY="your-key"
export CLAWBENCH_OUTPUT_DIR="./output"
📑 README 深度解析 真实文档 完整度 62/100 查看 GitHub 原文 →
以下内容由系统直接从 GitHub README 解析整理,保留代码块、表格与列表结构。

简介

<a href="https://github.com/reacher-z/ClawBench"> <picture> <source media="(prefers-color-scheme: dark)" srcset="static/hero-dark.svg"> <img alt="ClawBench" src="static/hero-light.svg" width="820"> </picture> </a>

Star this repo arXiv HF Daily Paper HF Dataset HF Trace Dataset Project Page GitHub stars Discord Codespaces

PyPI downloads PyPI version Last commit Contributors Commit activity License

<p align="center"><sub><i>Featured in</i></sub></p> <p align="center"> <a href="https://github.com/walkinglabs/awesome-harness-engineering"><img alt="awesome-harness-engineering" src="https://img.shields.io/badge/Featured-awesome--harness--engineering-7C3AED?style=flat-square&logo=awesomelists&logoColor=white"></a> <a href="https://github.com/Jenqyang/Awesome-AI-Agents"><img alt="Awesome-AI-Agents" src="https://img.shields.io/badge/Featured-Awesome--AI--Agents-7C3AED?style=flat-square&logo=awesomelists&logoColor=white"></a> <a href="https://github.com/ranpox/awesome-computer-use"><img alt="awesome-computer-use" src="https://img.shields.io/badge/Featured-awesome--computer--use-7C3AED?style=flat-square&logo=awesomelists&logoColor=white"></a> <a href="https://github.com/ZJU-REAL/Awesome-GUI-Agents"><img alt="Awesome-GUI-Agents" src="https://img.shields.io/badge/Featured-Awesome--GUI--Agents-7C3AED?style=flat-square&logo=awesomelists&logoColor=white"></a> <a href="https://github.com/zhangxjohn/LLM-Agent-Benchmark-List"><img alt="LLM-Agent-Benchmark-List" src="https://img.shields.io/badge/Featured-LLM--Agent--Benchmark--List-7C3AED?style=flat-square&logo=awesomelists&logoColor=white"></a> </p>

<p align="center"> <a href="https://huggingface.co/papers/2604.08523"><img src="https://img.shields.io/badge/%233_Paper_of_the_Day-FFD21E?style=for-the-badge&logo=huggingface&logoColor=000" alt="#3 Paper of the Day"></a> </p>

<p align="center"> <a href="https://deepwiki.com/reacher-z/ClawBench"><img alt="Ask DeepWiki" src="https://deepwiki.com/badge.svg" /></a> </p>

</div>

<p align="center"> <b>New:</b> Check out our sister project <a href="https://github.com/reacher-z/HarnessBench"><b>HarnessBench</b></a> &mdash; fixes the base model, varies the harness. Same scoring pipeline, orthogonal axis. </p>

<a href="#-human-quick-start"><img src="https://img.shields.io/badge/Run%20in%20one%20line%20of%20code-4F46E5?style=for-the-badge&labelColor=4F46E5&logoColor=white&logo=data:image/svg%2Bxml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCA1NzYgNTEyIj48cGF0aCBmaWxsPSIjZmZmZmZmIiBkPSJNMjYzLjQtMjdMMjc4LjIgOS44IDMxNSAyNC42YzMgMS4yIDUgNC4yIDUgNy40cy0yIDYuMi01IDcuNEwyNzguMiA1NC4yIDI2My40IDkxYy0xLjIgMy00LjIgNS03LjQgNXMtNi4yLTItNy40LTVMMjMzLjggNTQuMiAxOTcgMzkuNGMtMy0xLjItNS00LjItNS03LjRzMi02LjIgNS03LjRMMjMzLjggOS44IDI0OC42LTI3YzEuMi0zIDQuMi01IDcuNC01czYuMiAyIDcuNCA1ek0xMTAuNyA0MS43bDIxLjUgNTAuMSA1MC4xIDIxLjVjNS45IDIuNSA5LjcgOC4zIDkuNyAxNC43cy0zLjggMTIuMi05LjcgMTQuN2wtNTAuMSAyMS41LTIxLjUgNTAuMWMtMi41IDUuOS04LjMgOS43LTE0LjcgOS43cy0xMi4yLTMuOC0xNC43LTkuN0w1OS44IDE2NC4yIDkuNyAxNDIuN0MzLjggMTQwLjIgMCAxMzQuNCAwIDEyOHMzLjgtMTIuMiA5LjctMTQuN0w1OS44IDkxLjggODEuMyA0MS43QzgzLjggMzUuOCA4OS42IDMyIDk2IDMyczEyLjIgMy44IDE0LjcgOS43ek00NjQgMzA0YzYuNCAwIDEyLjIgMy44IDE0LjcgOS43bDIxLjUgNTAuMSA1MC4xIDIxLjVjNS45IDIuNSA5LjcgOC4zIDkuNyAxNC43cy0zLjggMTIuMi05LjcgMTQuN2wtNTAuMSAyMS41LTIxLjUgNTAuMWMtMi41IDUuOS04LjMgOS43LTE0LjcgOS43cy0xMi4yLTMuOC0xNC43LTkuN2wtMjEuNS01MC4xLTUwLjEtMjEuNWMtNS45LTIuNS05LjctOC4zLTkuNy0xNC43czMuOC0xMi4yIDkuNy0xNC43bDUwLjEtMjEuNSAyMS41LTUwLjFjMi41LTUuOSA4LjMtOS43IDE0LjctOS43ek00NjAgMGMxMSAwIDIxLjYgNC40IDI5LjUgMTIuMmw0Mi4zIDQyLjNDNTM5LjYgNjIuNCA1NDQgNzMgNTQ0IDg0cy00LjQgMjEuNi0xMi4yIDI5LjVsLTg4LjIgODguMi0xMDEuMy0xMDEuMyA4OC4yLTg4LjJDNDM4LjQgNC40IDQ0OSAwIDQ2MCAwek00NC4yIDM5OC41TDMwOC40IDEzNC4zIDQwOS43IDIzNS42IDE0NS41IDQ5OS44QzEzNy42IDUwNy42IDEyNyA1MTIgMTE2IDUxMnMtMjEuNi00LjQtMjkuNS0xMi4yTDQ0LjIgNDU3LjVDMzYuNCA0NDkuNiAzMiA0MzkgMzIgNDI4czQuNC0yMS42IDEyLjItMjkuNXoiLz48L3N2Zz4=" alt="Run in one line of code"></a>

git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench && ./run.sh

<sub><i>Clone → configure → run. &nbsp; Root uv package. &nbsp; Docker-isolated harnesses.</i></sub>

Option A — Docker Desktop (easiest, includes GUI)

brew install --cask docker open -a Docker # launch and wait for the whale icon to settle

Option B — Docker

sudo apt install -y docker.io sudo usermod -aG docker $USER # log out / back in so your shell picks up the group


> **Rootful Docker ownership note:** with classic `sudo`-docker, files extracted from containers land owned by `root` on the host. ClawBench's driver detects this after each run and chowns `test-output/` back to your user automatically — but if you run other container tooling alongside, rootless Podman (or rootless Docker) avoids the issue entirely.

#### Windows
powershell

Option A — Docker Desktop (WSL2 backend)

winget install Docker.DockerDesktop

then launch Docker Desktop from the Start menu and wait for it to be ready

<img src="static/icons/robot.svg" width="28" height="28"> LLM Quick Start

Point your coding agent (Claude Code, Cursor, Copilot, etc.) at AGENTS.md and prompt away.

<br/>

<img src="static/icons/person.svg" width="28" height="28"> Human Quick Start

Install ClawBench from PyPI for normal use:

uv tool install clawbench-eval

You can also use pipx install clawbench-eval or python -m pip install clawbench-eval. The installed commands are still clawbench, clawbench-run, and clawbench-batch.

For those want more granular control and contribution, clone the repo and run the root uv package entrypoint:

git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench && ./run.sh

Prerequisites: Python 3.11+, uv, and a container engine — Docker or Podman. ClawBench auto-detects whichever is installed; force one with export CONTAINER_ENGINE=docker or export CONTAINER_ENGINE=podman.

<details> <summary><b>Install Docker or Podman</b> (macOS / Linux / Windows)</summary>

macOS

```bash

<img src="static/icons/play.svg" width="28" height="28"> Demos

Each ClawBench run produces a full MP4 session recording. See the project page for V1 task recordings.

<br/>

Option B — Podman (rootless, no daemon, CLI only)

brew install podman podman machine init # one-time: downloads the Linux VM image podman machine start # must be running before any podman command


> **macOS Podman needs a VM.** `brew install podman` alone is not enough — Podman on macOS runs containers inside a small Linux VM, so you must `podman machine init && podman machine start` once after install or `podman info` will fail with `Cannot connect to Podman`.

#### Linux (Ubuntu / Debian)
bash

Option B — Podman

winget install RedHat.Podman podman machine init podman machine start


> Run the `uv run …` commands below from **PowerShell**, **WSL2**, or **Git Bash**. Like macOS, Windows Podman requires `podman machine init && podman machine start` before its first use.

</details>

**1. Configure models** — one-time setup.

If you installed from PyPI, run `clawbench` from the directory where you want
results and editable config to live. On first launch it creates local templates
under `models/`; use the TUI to add a model or edit the file directly:
bash clawbench $EDITOR models/models.yaml

If you are working from a source checkout:
bash cp models/models.example.yaml models/models.yaml $EDITOR models/models.yaml

PurelyMail credentials for disposable run emails are provided in the committed `.env`.
You only need to edit `.env` if you want to use your own PurelyMail account or enable optional HuggingFace upload.

> [!NOTE]
> **First run builds a container image** (Chromium + ffmpeg + noVNC + the selected agent harness dependencies). You'll see a live progress spinner with the current build step. Subsequent runs reuse the cached layers and finish in seconds.

**2. Run your first task** (pick one):

> [!TIP]
> **Recommended &rarr; Interactive TUI** &nbsp; guided model + test case selection
> 
bash > clawbench # PyPI install > uv run clawbench # source checkout >
> If installed from PyPI, run `clawbench` directly. Needs an interactive terminal.
> For pipes / CI / non-TTY, use `clawbench-run` or `clawbench-batch` directly;
> from a source checkout, prefix commands with `uv run`.

**(b) Run one specific task against a specific model:**
bash uv run clawbench-run test-cases/v1/001-daily-life-food-uber-eats claude-sonnet-4-6
Once the container starts, the script prints a **noVNC URL** (e.g. `http://localhost:6080/vnc.html`) — open it in your browser to watch the agent operate in real-time. If port 6080 is already in use, an alternative port is chosen automatically.

Results land in `./test-output/<model>/<harness>-<case>-<model>-<timestamp>/` with the full five-layer recording. The default harness is `openclaw`; pass `--harness opencode` to use [opencode](https://opencode.ai), `--harness claude-code` to use [Claude Code](https://docs.anthropic.com/en/docs/claude-code), `--harness claude-code-chrome-extension` to use Claude Code + the [Claude in Chrome](https://code.claude.com/docs/en/chrome) extension (Microsoft Edge + local bridge, bypass stack so any LiteLLM-routed provider works), `--harness codex` to use [OpenAI Codex CLI](https://github.com/openai/codex), `--harness claw-code` to use [claw-code](https://github.com/ultraworkers/claw-code), `--harness browser-use` to use [browser-use](https://github.com/browser-use/browser-use) (Python framework, routed via LiteLLM), `--harness hermes` to use [Hermes Agent](https://github.com/NousResearch/hermes-agent) with native browser tools attached to ClawBench Chrome via CDP, or `--harness pi` to use [Pi](https://pi.dev/) with pinned [pi-browser-harness](https://pi.dev/packages/pi-browser-harness) browser tools attached to the same ClawBench Chrome CDP endpoint.

**(c) Drive the browser yourself via noVNC** — produces a human reference run:
bash uv run clawbench-run test-cases/v1/001-daily-life-food-uber-eats --human
Open the noVNC URL the script prints, complete the task by hand, then close the tab. Port is auto-assigned if 6080 is busy.

**(d) Pair with an external browser agent** — run in Human mode, open the noVNC URL, and let an external browser agent control that browser session while ClawBench records and intercepts it.

<details>
<summary><b>Develop from source</b> &nbsp;— clone + ``./run.sh`` for contributors</summary>

Prefer the repo checkout if you want to modify the driver, the bundled V1/V2 test cases, or the container build itself.
bash git clone https://github.com/reacher-z/ClawBench.git && cd ClawBench cp models/models.example.yaml models/models.yaml # edit: add your model API keys

.env is already provided for PurelyMail; edit only for your own creds or HF upload

./run.sh # interactive TUI uv run clawbench-run \ test-cases/v1/001-daily-life-food-uber-eats claude-sonnet-4-6 # single run uv run clawbench-run \ test-cases/v1/001-daily-life-food-uber-eats --human # human mode ```

This path gives you live-reload on `src/, src/clawbench/runtime/chrome-extension/, and all suites under test-cases/` — useful when iterating on the harness itself.

</details>

<br/>

How ClawBench compares

BenchmarkDomainEnvironmentTask countClawBench difference
[WebArena](https://webarena.dev)Synthetic web appsSelf-hosted replicas812Live consumer sites, not admin UIs on hosted replicas
[GAIA](https://huggingface.co/datasets/gaia-benchmark/GAIA)General assistantsClosed-book text + tools466Browser-centric; end-to-end task execution
[SWE-bench](https://www.swebench.com)Software engineeringGitHub repos2,294Non-code; everyday consumer workflows
[BrowserGym](https://github.com/ServiceNow/BrowserGym)Web agentsHeadless sandboxCloud-parity; records real user journeys
[Mind2Web](https://github.com/OSU-NLP-Group/Mind2Web)Web navigationStatic traces2,350Dynamic live websites, not replayed traces
[Online-Mind2Web](https://github.com/OSU-NLP-Group/Online-Mind2Web)Live web navigationReal websites3004× more tasks (V1+V2: 283 vs 300 — comparable), with full 5-layer recordings
[VisualWebArena](https://jykoh.com/vwa)Visual web tasksSelf-hosted (3 sites)910Real websites with full visual layer (vs 3 hosted apps)
[WebVoyager](https://github.com/MinorJerry/WebVoyager)Real-website navReal websites (15)643Interception-graded vs LLM-judge-only, 144 sites covered
[TheAgentCompany](https://the-agent-company.com)Office workflowsSelf-hosted (6 platforms)175Consumer everyday tasks instead of enterprise sandbox

ClawBench's niche: live consumer websites, everyday tasks, end-to-end recording. If you want a controlled sandbox or replayed traces, the projects above are excellent. If you want to know whether your agent can actually order food or book a flight today, this is the benchmark for that.

<br/>

<img src="static/icons/circle-question.svg" width="20" height="20"> What are you looking for?

🏆 See scores<br/> Live leaderboard<br/> <sub>Pick a corpus (v1 / v2)</sub>

</td> <td width="25%" align="center" valign="top">

🚀 Run it on your model<br/> Quick start ↓<br/> <sub><code>pip install clawbench-eval</code></sub>

</td> <td width="25%" align="center" valign="top">

📊 Browse 283 tasks<br/> Task explorer<br/> <sub>Search · filter · category</sub>

</td> <td width="25%" align="center" valign="top">

📄 Read the paper<br/> arXiv:2604.08523<br/> <sub>Methodology · evaluator · results</sub>

</td> </tr> <tr> <td align="center" valign="top">

🎬 Re-grade old runs<br/> V1 · V2 raw traces<br/> <sub>5 layers per (task × model)</sub>

</td> <td align="center" valign="top">

📦 Download the data<br/> hf download NAIL-Group/ClawBench<br/> <sub>Tasks · rubrics · metadata</sub>

</td> <td align="center" valign="top">

🌱 Add a task / model<br/> How to contribute<br/> <sub>YAML spec + rubric</sub>

</td> <td align="center" valign="top">

Have a question<br/> FAQ · Discord<br/> <sub>Or open an issue</sub>

</td> </tr> </table>

<img src="static/icons/circle-question.svg" width="28" height="28"> Example Walkthrough

Curious what one task actually looks like, start to finish? Here's task 001 end to end.

The task — from test-cases/v1/001-daily-life-food-uber-eats/task.json:

{
  "instruction": "On Uber Eats, order delivery: one Pad Thai, deliver to home address, note \"no peanuts\"",
  "time_limit": 30,
  "eval_schema": {
    "url_pattern": "__PLACEHOLDER_WILL_NOT_MATCH__",
    "method": "POST"
  }
}

The agent gets this instruction verbatim, plus read-only access to /my-info/alex_green_personal_info.json (the dummy user's name, home address, phone, date of birth) and a disposable email account for any sign-in prompt. It has 30 minutes to reach a POST request — any longer and the container is killed.

What the agent does (the happy path):

  1. Navigates to ubereats.com
  2. Reads the dummy user's home address from /my-info/alex_green_personal_info.json and enters it in the delivery-address box
  3. Searches for "Pad Thai" in the food search
  4. Picks a restaurant that has Pad Thai available for delivery to that address
  5. Opens the item detail page, finds the customization or special-instructions field, enters "no peanuts"
  6. Adds one to cart, opens the cart, and handles any sign-in prompt using the disposable email credentials
  7. Reaches checkout, taps Place Order

What the interceptor catches — that final Place Order tap fires a POST request. ClawBench's request interceptor sits in front of the browser and captures the outbound request before it reaches Uber Eats's servers, so the dummy user is never actually charged. At the exact moment of interception, all five recording layers (MP4 video, PNG screenshots, HTTP traffic, browser actions, agent messages) are frozen into /data/.

How the judge decides PASS / FAIL — task 001's url_pattern is the intentional sentinel __PLACEHOLDER_WILL_NOT_MATCH__, which means no request path can mechanically match. The verdict comes from the agentic judge in eval/agentic_eval.md, which replays the five-layer recording against a human reference run and checks four things:

  • Did the agent actually reach the final checkout step?
  • Is the cart exactly one Pad Thai (not two, not a combo)?
  • Is the delivery address the user's home address from alex_green_personal_info.json?
  • Does the order carry the "no peanuts" note in the instructions field?

All four must hold for a PASS. Miss any one and it's a FAIL with evidence from the recording pinned to the failing criterion. This per-task rubric is what makes ClawBench judge-sensitive rather than URL-regex-sensitive — see eval/README.md for the full rubric format and eval/agentic_eval.md for the judge prompt.

<br/>

<img src="static/icons/circle-question.svg" width="28" height="28"> FAQ

<details> <summary><b>What data does each run produce?</b></summary>

Each session records five layers of synchronized data under /data/:

LayerFileDescription
Session replayrecording.mp4Full session video (H.264, 15fps)
Action screenshotsscreenshots/*.pngTimestamped PNG per browser action
Browser actionsactions.jsonlEvery DOM event (click, keydown, input, pageLoad, scroll, etc.)
HTTP trafficrequests.jsonlEvery HTTP request with headers, body, and query params
Agent messagesagent-messages.jsonlFull agent conversation transcript (thinking, text, tool calls)

For the Pi harness, agent-messages.jsonl is filtered Pi JSON mode output, including message_start/message_end events, tool_execution_* events, tool-call content blocks, and thinking blocks when the selected model emits reasoning. Streaming message_update fragments, including *_delta rows, are omitted because complete assistant messages are already preserved in message_end events.

Harness diagnostic logs such as Pi's agent.log and proxy.log are not copied into the final data/ directory.

The interceptor result is saved to interception.json.

</details>

<details> <summary><b>How does the request interceptor work?</b></summary>

The interceptor blocks critical, irreversible HTTP requests (checkout, form submit, email send) to prevent real-world side effects. It connects to Chrome via CDP's Fetch domain and matches requests against the eval schema (url_pattern regex + method + optional body/params). When triggered, it saves the blocked request to interception.json, kills the agent, and stops recording.

The interceptor does not validate task completion -- evaluation is handled separately by evaluators post-session.

For tasks behind payment walls (agent has no valid credit card), the eval schema uses a placeholder pattern that never matches, so the session runs until timeout.

</details>

<details> <summary><b>What is the synthetic user profile?</b></summary>

Each container gets a /my-info/ directory with a dummy user identity (Alex Green): personal info JSON, email credentials, and a resume PDF. The email is a fresh disposable PurelyMail address generated per run. The agent reads these files when it needs to fill forms, register accounts, etc.

Source templates: src/clawbench/runtime/shared/alex_green_personal_info.json (profile) and src/clawbench/runner/run_support/resume_template.json (resume).

</details>

<details> <summary><b>Can I use Podman instead of Docker?</b></summary>

Yes. Set export CONTAINER_ENGINE=podman. The framework auto-detects whichever is available. Podman works without root privileges.

</details>

<details> <summary><b>What tools can the agent use?</b></summary>

All supported harnesses run inside the same container recording and interception environment. CLI/MCP harnesses expose the browser tool plus a restricted set of read-only shell commands (ls, cat, find, grep, head, tail, jq, wc, etc.); commands that could bypass the browser (curl, python, node, wget) are blocked. Hermes and Pi use native browser/file tools attached to the same ClawBench Chrome CDP endpoint. The Pi harness intentionally allowlists only read-only file tools and browser interaction tools; bash, write, edit, browser_http_get, and browser_run_script are not enabled. The agent instruction also explicitly requires browser-only task completion.

</details>

<details> <summary><b>How do I add a new test case?</b></summary>

See CONTRIBUTING.md. In short: create a directory under the target corpus (test-cases/v1/ for V1 or test-cases/v2/ for V2) with a task.json conforming to test-cases/task.schema.json, define the eval schema, test with human mode, and submit a PR.

</details>

<br/>

Frequently Asked Questions

What is ClawBench? ClawBench is an open-source benchmark for AI browser agents — the systems (GPT-based, Claude-based, or open) that drive a real web browser to complete a user's task. V1 measures whether the agent actually finishes 153 everyday online tasks across 144 live websites; V2 adds a 130-task corpus in test-cases/v2/. It measures completion, not whether the agent produces the right-looking text.

What kinds of tasks does ClawBench cover? Fifteen life categories: food delivery, travel booking, job applications, shopping, housing search, email and calendar management, academic research, software development, learning platforms, and more. Every task is something a normal person might do in a normal week, on a real website.

Are 153 tasks enough for evaluation? Yes for a V1 benchmark signal: the 153 tasks span 144 live websites and 15 life categories, and each full run is expensive because it uses isolated containers, real websites, five-layer recording, and post-session judgment against human references. V2 adds another 130 tasks in test-cases/v2/. For cheaper iteration, start with the 20-task test-cases/v1-lite/ subset.

How is a task judged successful? Each task runs in an isolated browser container with a five-layer recording: video, screenshots, network requests, browser actions, and agent messages. For the original V1 results, an evaluator compares the agent trajectory against human reference runs and assigns PASS/FAIL with evidence from the recording. For V2 and newer leaderboard rows, scoring is two-stage: first, the request interceptor checks whether the final blocked HTTP request matches the task's URL/method schema; second, an LLM judge checks whether the captured request payload fulfills the natural-language instruction.

How do account login, registration, and initial task state work? Each run receives a synthetic user profile plus a fresh disposable PurelyMail address. If a task requires sign-up, the agent normally starts from scratch and registers during the run, using the provided identity and email. If a task needs starting files or workspace context, those files live under the task's extra_info/ directory and are mounted for the agent at runtime.

What happens when live websites change? Live-site change is part of the benchmark's target: ClawBench measures whether agents can handle production websites rather than frozen snapshots. That also means some runs can be affected by layout changes, availability, anti-bot systems, or alternate flows. Reproducibility comes from publishing task definitions, eval schemas, run metadata, and five-layer traces; repeated runs over time are still useful for measuring site drift.

Do CAPTCHA or bot checks dominate failures? If an agent encounters a CAPTCHA, it must attempt it. We have seen cases where frontier models are able to solve some CAPTCHAS. CAPTCHA failures can reflect model behavior, browser-control stack limits, or site defenses. The trace datasets make these failures inspectable.

What's the current top score? 33.3% — roughly one task in three — from the strongest frontier model we evaluated. The majority of tasks still defeat every model we've tested; the headroom is real, and the benchmark is not saturated.

Which harness are the published model results based on? The repo default is openclaw, but leaderboard rows include their harness explicitly. V1 results used OpenClaw; newer runs may use Hermes or other supported harnesses. Use the harness column when comparing models, because model and harness changes are separate experimental axes.

Is ClawBench tightly coupled to OpenClaw? No. OpenClaw is the default harness, but ClawBench supports interchangeable harnesses listed in src/clawbench/runtime/harnesses/.

Can ClawBench evaluate CLI agents? Yes. ClawBench is a browser-task benchmark, but CLI and coding-agent harnesses can drive the same instrumented Chromium session using native tools or MCPs.

How do I reproduce a published score? From a source checkout, configure models/models.yaml, then run uv run clawbench. The TUI builds the container image and runs local tasks against your model of choice. For batch runs, use --all-cases for the default V1 suite, --cases-suite v2 --all-cases for V2, or --cases-suite v1-lite --all-cases for Lite.

Will newer models be added? Yes. New model runs can be submitted or requested through the contribution flow and issues. Public rows are added as complete or clearly marked partial runs, depending on what has finished.

Is ClawBench safe to run against live websites? The runner uses a hardened container with a request interceptor that blocks purchases, account creation, outbound email sends, and similar irreversible actions by default. Tasks that need to simulate those actions (e.g., "add to cart and checkout") terminate at the last reversible step. You can relax the interceptor per-task if your research requires it.

Can I contribute new tasks or harnesses? Yes. V1 tasks live in test-cases/v1/; V2 tasks live in test-cases/v2/; Lite tasks live in test-cases/v1-lite/. See CONTRIBUTING.md for the task schema and validation flow.

How does ClawBench relate to HarnessBench? Same scoring pipeline, orthogonal axis. ClawBench fixes the harness and varies the model; HarnessBench fixes the model and varies the harness. They share the V1 153-task corpus, the five-layer recording, and the agentic evaluator — so numbers are directly comparable.

🎯 aiskill88 AI 点评 A 级 2026-05-23

ClawBench是一个开源的AI工作流,用于评估浏览器AI代理的性能和可靠性。它提供了153个日常在线任务的测试用例,适用于AI开发和研究人员。虽然它是一个有用的工具,但仍需要进一步的开发和完善。

⚡ 核心功能
  • 可视化 Agent 工作流编排,无需编写复杂代码
  • 支持多步骤自动化任务链,实现全流程无人值守
  • 与外部 API、数据库和第三方服务无缝集成
  • 内置错误处理与自动重试机制,保障稳定运行
  • 提供可复用的自动化模板,快速在同类场景部署
👥 适合人群
自动化工程师和运维人员项目经理和业务分析师希望减少重复性工作的专业人士数字化转型团队
🎯 使用场景
  • 自动化日常重复性工作,将精力集中于创造性任务
  • 构建数据采集 → 处理 → 输出的完整自动化管线
  • 实现跨平台、跨系统的数据流转和业务协同
⚖️ 优点与不足
✅ 优点
  • +Apache-2.0 协议,可免费商用
  • +大幅减少重复性人工操作
  • +可视化流程,清晰直观
  • +可扩展性强,支持复杂场景
⚠️ 不足
  • 初始配置和调试需投入一定时间
  • 强依赖外部服务的稳定性
  • 复杂场景需具备一定技术基础
⚠️ 使用须知

AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。

建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。

📄 License 说明

✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。

❓ 常见问题 FAQ
解答
💡 AI Skill Hub 点评

总体来看,ClawBench 是一款质量良好的Agent工作流,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。

⬇️ 获取与下载
⬇ 下载源码 ZIP

✅ Apache-2.0 协议 · 可免费商用 · 直接从 aiskill88 服务器下载,无需跳转 GitHub

📚 深入学习 ClawBench
查看分步骤安装教程和完整使用指南,快速上手这款工具
🌐 原始信息
原始名称 ClawBench
Topics workflowagent-evaluationagentic-aiai-agent-benchmarkai-agentsbenchmark
GitHub https://github.com/TIGER-AI-Lab/ClawBench
License Apache-2.0
语言 Python
🔗 原始来源
🐙 GitHub 仓库  https://github.com/TIGER-AI-Lab/ClawBench 🌐 官方网站  https://claw-bench.com

收录时间:2026-05-22 · 更新时间:2026-05-22 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。