📄 工具详情 ⚙️ 安装教程 📚 使用教程

能力标签

🔌 MCP 🤖 Agent 🔄 工作流 🐳 Docker 💻 CLI 🔗 REST API 📚 RAG 🖼 视觉 🎙 STT 🧠 Claude

⚙️

Agent工作流

古安事数权模式

Q: evalmonkey 如何安装和开始使用？

访问 evalmonkey 的 GitHub 仓库或官方网站，按照 README 文档中的步骤安装依赖并运行。通常需要 Python 3.8+ 或 Node.js 16+ 基础环境。

Q: evalmonkey 是否免费？许可证是什么？

evalmonkey 完全免费，采用 Apache-2.0 许可证开源发布，任何人都可以免费使用、修改和分发。

Q: evalmonkey 适合哪些用户使用？

evalmonkey 主要面向有一定技术基础的用户，包括开发者、数据分析师、AI 工程师等专业人士。

Q: evalmonkey 的社区活跃度和项目维护状况如何？

evalmonkey 在 GitHub 上已获得 33 个 Star，处于积极发展阶段，社区在持续扩大。

基于 Python · 无代码搭建完整 AI 自动化流程

英文名：evalmonkey

⭐ 33 Stars 🍴 4 Forks 💻 Python 📄 Apache-2.0 🏷 AI 7.5分

7.5AI 综合评分

workflowagentai-agentai-agentsai-toolsbenchmarkpython

⬇ 下载源码 ZIP ⚙️ 配置说明 📺 TG 频道

✦ AI Skill Hub 推荐

AI Skill Hub 推荐使用：古安事数权模式是一款优质的Agent工作流。AI 综合评分 7.5 分，在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案，这是一个值得深入了解的选择。

📚 深度解析

古安事数权模式是一套完整的 AI Agent 自动化工作流方案。随着 AI 能力的不断提升，基于 Agent 的自动化工作流正在成为提升个人和团队效率的核心方式。区别于传统的 RPA 自动化（模拟鼠标键盘操作），AI Agent 工作流通过理解任务意图、动态规划执行路径，能够处理更复杂的非结构化任务。

古安事数权模式工作流的设计遵循"最小配置，最大复用"原则：核心逻辑已经封装好，用户只需配置自己的 API Key 和业务参数即可快速上手。工作流内置错误处理和重试机制，在网络波动或 API 限速等情况下仍能稳定运行，适合作为生产环境的自动化基础设施。

在实际部署时，建议先在测试环境中运行 3-5 次，验证各个环节的输出结果符合预期，再部署到生产环境。AI Skill Hub 评分 7.5 分，是同类 Agent 工作流中的精选推荐。

📋 工具概览

古安事数权模式为常用的系统会模式。当前古安事数权模式为常用的系统会模式。

古安事数权模式是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排，将复杂的多步骤任务拆解为清晰的自动化流程，实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成，适合构建数据处理管线、业务自动化和 AI 辅助决策系统。

GitHub Stars

⭐ 33

开发语言

Python

支持平台

Windows / macOS / Linux

维护状态

轻量级项目，按需更新

开源协议

Apache-2.0

AI 综合评分

7.5 分

工具类型

Agent工作流

Forks

📖 中文文档

以下内容由 AI Skill Hub 根据项目信息自动整理，如需查看完整原始文档请访问底部「原始来源」。

古安事数权模式为常用的系统会模式。当前古安事数权模式为常用的系统会模式。

📌 核心特色

可视化 Agent 工作流编排，无需编写复杂代码
支持多步骤自动化任务链，实现全流程无人值守
与外部 API、数据库和第三方服务无缝集成
内置错误处理与自动重试机制，保障稳定运行
提供可复用的自动化模板，快速在同类场景部署

🎯 主要使用场景

自动化日常重复性工作，将精力集中于创造性任务
构建数据采集 → 处理 → 输出的完整自动化管线
实现跨平台、跨系统的数据流转和业务协同

以下安装命令基于项目开发语言和类型自动生成，实际以官方 README 为准。

安装命令

# 方式一：pip 安装（推荐）
pip install evalmonkey

# 方式二：虚拟环境安装（推荐生产环境）
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install evalmonkey

# 方式三：从源码安装（获取最新功能）
git clone https://github.com/Corbell-AI/evalmonkey
cd evalmonkey
pip install -e .

# 验证安装
python -c "import evalmonkey; print('安装成功')"

📋 安装步骤说明

访问 GitHub 仓库获取工作流文件
在对应平台（Dify / Flowise / Make 等）中找到「导入工作流」功能
上传工作流文件
按照提示配置必要的环境变量和 API Key
运行测试确认流程正常后投入使用

以下用法示例由 AI Skill Hub 整理，涵盖最常见的使用场景。

常用命令 / 代码示例

# 命令行使用
evalmonkey --help

# 基本用法
evalmonkey input_file -o output_file

# Python 代码中调用
import evalmonkey

# 示例
result = evalmonkey.process("input")
print(result)

以下配置示例基于典型使用场景生成，具体参数请参照官方文档调整。

配置示例

# evalmonkey 配置文件示例（config.yml）
app:
  name: "evalmonkey"
  debug: false
  log_level: "INFO"

# 运行时指定配置文件
evalmonkey --config config.yml

# 或通过环境变量配置
export EVALMONKEY_API_KEY="your-key"
export EVALMONKEY_OUTPUT_DIR="./output"

📑 README 深度解析真实文档完整度 69/100 查看 GitHub 原文 →

以下内容由系统直接从 GitHub README 解析整理，保留代码块、表格与列表结构。

简介

Overview

Agents are fundamentally non-deterministic. They rely on external APIs, tool loops, and massive context windows. EvalMonkey is the ultimate, strictly local, open-source execution harness that enables developers to: 1. 🎯 Benchmark Capabilities: Run standard Agent benchmark datasets against your agent endpoints natively! 2. 🔥 Inject Chaos: Mutate headers, spike latency, and corrupt schemas dynamically to prove true resilience. 3. 📈 Track Production Reliability: Locally store all scores to visualize a single Production Reliability metric over time! 4. 🛠 Generate Improvement Evals: When scores are poor, automatically synthesise targeted test cases using your LLM — then hand them to Claude Code or Cursor to fix your agent.

EvalMonkey natively supports evaluating ANY LLM: AWS Bedrock, Azure, GCP, OpenAI, and Ollama.

Note on API Keys: If you have special setups that generate long-lived, static API keys for Bedrock, Azure, or GCP, simply supply them in the .env! EvalMonkey seamlessly supports both standard IAM / Service Account credential flows and long-term stateless authentication strings.

Option B — Manual Setup (5 minutes)

1. Install

git clone https://github.com/Corbell-AI/evalmonkey
cd evalmonkey
pip install -e .

2. Configure your LLM key (used only as the evaluation judge — never for your agent)

cp .env.example .env

Open .env and set one of these depending on your LLM provider: ```bash EVAL_MODEL=gpt-4o OPENAI_API_KEY=sk-... # OpenAI

First time setup:

cp .env.example .env # fill in EVAL_MODEL + your LLM provider key pip install -e .

⚡️ Quick Start

1. CSV Example (`evals.csv`)

If using a CSV, just make sure you have the columns id and expected_behavior_rubric. Any other column you add (like question, topic, image_url) will be automatically gathered and sent in the JSON payload directly to your agent!

id	expected_behavior_rubric	question
get_benefits	Must return the URL linking to the company hr portal	Where do I sign up for medical benefits?
time_off	Provide the exact number of standard vacation days (15)	How many days of PTO do I get?

evalmonkey run-benchmark --scenario get_benefits --eval-file evals.csv

2. JSON / YAML Example (`evals.json`)

If you use JSON or YAML, you must nest the agent payload keys explicitly under an input_payload dict object:

[
  {
    "id": "onboarding_query",
    "description": "Test HR agent's ability to return the onboarding link.",
    "expected_behavior_rubric": "Must contain exactly the URL https://hr.example.com/benefits",
    "input_payload": {
      "question": "Where do I sign up for benefits?"
    }
  }
]

evalmonkey run-benchmark --scenario onboarding_query --eval-file evals.json

</details>

---

Experience 1: Local Sample Agents (Single Command Start)

Easiest Experience: Test our built-in sample agents with a single command! EvalMonkey will spawn the sample agent in the background automatically and run the benchmark. ```bash

⚠️ 3 sample(s) scored below threshold — eval assets saved.

Experience 6: One-Command End-to-End Demo (RAG App)

Run the full benchmark + chaos + eval-generation pipeline against the built-in rag_app sample agent:

```bash

Option A — Let Claude Code or Cursor set it up for you (30 seconds)

Open Claude Code, Cursor, or any AI coding assistant and paste this prompt:

Set up EvalMonkey in my project so I can benchmark my AI agent.

1. Clone https://github.com/Corbell-AI/evalmonkey into a sibling folder
2. Run: pip install -e . inside that folder
3. Copy .env.example to .env and ask me which LLM provider I want to use as the benchmark judge (OpenAI, Anthropic, Bedrock, or Ollama) — then fill in the correct key
4. Run: evalmonkey init --framework <my_framework> --name "My Agent" --port <my_port>
   Use the framework my agent is built with (crewai / langchain / openai / bedrock / autogen / ollama / strands / custom)
5. Show me the generated evalmonkey.yaml and ask me to confirm the agent URL and response path are correct
6. Run a quick smoke test: evalmonkey run-benchmark --scenario gsm8k --sample-agent rag_app --limit 2
   to confirm everything is wired up correctly
7. Then run the real benchmark against my agent: evalmonkey run-benchmark --scenario mmlu --limit 5
8. Show me the score and explain what it means

The agent will handle cloning, installing, configuring your .env, and running the first benchmark — all without you typing a single command.

---

Setting Up in Claude Desktop / Cursor

Add the following to your MCP configuration file (e.g. claude_desktop_config.json):

{
  "mcpServers": {
    "evalmonkey": {
      "command": "evalmonkey",
      "args": ["serve-mcp"]
    }
  }
}

Once connected, your AI assistant will gain the ability to list benchmarks, trigger full evaluation runs, inject chaos payload mutators, pull historical trends, and generate improvement eval assets — entirely autonomously while helping you build your agent!

OpenAI-compatible endpoint returning {"choices":[{"message":{"content":""}}]}

evalmonkey run-benchmark --scenario arc \ --target-url http://localhost:8000/v1/chat/completions \ --request-key content \ --response-path choices.0.message.content ```

🤖 MCP Server (Cursor & Claude Integration)

EvalMonkey natively ships with a Model Context Protocol (MCP) server! This allows AI IDEs (like Cursor) or external agents (like Claude Desktop) to invoke EvalMonkey tools automatically while they build your agent.

1. run_full_pipeline(scenario="gsm8k", target_url="...", chaos_profiles="client_prompt_injection,client_payload_bloat")

🇨🇳 中文文档镜像 AI 翻译 2026-07-04

英文原文章节由系统翻译为中文摘要，便于快速理解。完整原文见上方 "📑 README 深度解析"。

📌 简介

EvalMonkey 是一个专为 Agent 设计的开源、严格本地化的执行测试框架。由于 Agent 在调用外部 API、工具循环及处理大规模 Context Window 时具有非确定性，EvalMonkey 为开发者提供了强大的基准测试能力。你可以直接在本地运行标准的 Agent Benchmark 数据集，并利用其独特的“混沌注入”功能（如修改 Header、增加 Latency 或损坏 Schema），通过模拟各种异常场景来全面评估 Agent 的鲁棒性。

🛠 安装步骤（Docker/pip/源码）

支持通过源码手动安装。首先通过 git clone 下载 EvalMonkey 仓库并进入目录，随后使用 `pip install -e .` 进行安装。安装完成后，需要进行首次环境配置：复制 `.env.example` 为 `.env` 文件，并在其中配置用于执行评估任务的 LLM 密钥（如 OpenAI API Key）。请注意，该密钥仅用于充当评估裁判（Judge），不会直接用于你的 Agent 运行。

🚀 使用教程

EvalMonkey 支持通过 CSV、JSON 或 YAML 文件进行测试。若使用 CSV 格式，请确保包含 `id` 和 `expected_behavior_rubric` 列，其他自定义列（如 `question`）将自动封装进 JSON Payload 发送给 Agent。若使用 JSON/YAML，则需将 Agent 的输入参数显式嵌套在 `input_payload` 字典对象下。通过命令行工具，你可以轻松触发针对特定场景的评估流程。

⚙️ 配置说明（含 MCP / env）

EvalMonkey 提供了极高的配置灵活性。你可以通过修改 `.env` 文件来指定评估用的 LLM 模型（如 `EVAL_MODEL=gpt-4o`）。此外，本项目深度集成 MCP 协议，支持通过 Claude Code、Cursor 或 Claude Desktop 进行快速配置。只需在 MCP 配置文件（如 `claude_desktop_config.json`）中添加相应的 `mcpServers` 配置，即可让 AI 助手直接调用 EvalMonkey 的工具进行基准测试和趋势分析。

🔌 API 说明

EvalMonkey 提供与 OpenAI 兼容的 API 端点，能够返回符合 OpenAI 标准格式的 JSON 响应（包含 `choices` 字段）。开发者可以通过命令行指定 `target-url`、`request-key` 以及 `response-path`，从而实现对不同 Agent 接口的自动化评估与结果解析。

🔄 工作流/模块

EvalMonkey 内置了 MCP Server，实现了与 Cursor 和 Claude Desktop 的原生集成。这种工作流允许 AI IDE 或外部 Agent 在构建 Agent 的过程中，自动调用 EvalMonkey 的工具链。通过 `run_full_pipeline` 等函数，开发者可以自动化地执行从场景加载、目标 URL 调用到混沌注入测试的全流程，并实时获取评估反馈。

🎯 aiskill88 AI 点评 A 级 2026-05-23

古安事数权模式为常用的系统会模式。古安事数权模式为常用的系统会模式。为常用的系统会模式。

📚 实用指南（长尾问题）

适合谁

需要让 Claude / Cursor 操作本地工具的 AI 工程师
构建多智能体协作系统的 Agent 开发者
构建企业知识库 / RAG 检索应用的团队

最佳实践

配置 MCP 服务器时建议使用 stdio 传输 + JSON-RPC，避免暴露公网
生产部署优先使用 Docker Compose 隔离依赖，并挂载 volume 持久化数据
本地部署优先选 GGUF 量化模型，节省显存并保持响应速度
分块大小建议 256-512 tokens，向量库优选 pgvector 或 Qdrant
Agent 任务先做 dry-run 验证工具调用链，再开启自主执行

常见错误

API key 直接提交到 git 仓库（请用 .env 并加入 .gitignore）
MCP 配置路径拼错或权限不足，重启 Claude Desktop 才生效
容器内无法访问宿主机 localhost — 使用 host.docker.internal
embedding 模型与查询模型不一致导致检索失效
显存不足直接 OOM — 优先降低 context 或换更小的量化模型
Python 依赖冲突：建议用 venv / uv 隔离环境

部署方案

Docker：evalmonkey 提供官方镜像，docker compose up 一键启动
CLI：直接 npm install -g / pip install，命令行调用
本地部署：CPU 8GB 起，GPU 推荐 16GB+ 显存
云端托管：可放在 Vercel / Railway / Fly.io 等 PaaS 平台

⚡ 核心功能

可视化 Agent 工作流编排，无需编写复杂代码
支持多步骤自动化任务链，实现全流程无人值守
与外部 API、数据库和第三方服务无缝集成
内置错误处理与自动重试机制，保障稳定运行
提供可复用的自动化模板，快速在同类场景部署

👥 适合谁

需要让 Claude / Cursor 操作本地工具的 AI 工程师
构建多智能体协作系统的 Agent 开发者
构建企业知识库 / RAG 检索应用的团队

⭐ 最佳实践

配置 MCP 服务器时建议使用 stdio 传输 + JSON-RPC，避免暴露公网
生产部署优先使用 Docker Compose 隔离依赖，并挂载 volume 持久化数据
本地部署优先选 GGUF 量化模型，节省显存并保持响应速度
分块大小建议 256-512 tokens，向量库优选 pgvector 或 Qdrant

⚠️ 常见错误

API key 直接提交到 git 仓库（请用 .env 并加入 .gitignore）
MCP 配置路径拼错或权限不足，重启 Claude Desktop 才生效
容器内无法访问宿主机 localhost — 使用 host.docker.internal
embedding 模型与查询模型不一致导致检索失效

👥 适合人群

自动化工程师和运维人员项目经理和业务分析师希望减少重复性工作的专业人士数字化转型团队

🎯 使用场景

自动化日常重复性工作，将精力集中于创造性任务
构建数据采集 → 处理 → 输出的完整自动化管线
实现跨平台、跨系统的数据流转和业务协同

⚖️ 优点与不足

✅ 优点

+Apache-2.0 协议，可免费商用
+大幅减少重复性人工操作
+可视化流程，清晰直观
+可扩展性强，支持复杂场景

⚠️ 不足

−初始配置和调试需投入一定时间
−强依赖外部服务的稳定性
−复杂场景需具备一定技术基础

⚠️ 使用须知

AI Skill Hub 为第三方内容聚合平台，本页面信息基于公开数据整理，不对工具功能和质量作任何法律背书。

建议在沙箱或测试环境中充分验证后，再部署至生产环境，并做好必要的安全评估。

📄 License 说明

🔗 相关工具推荐

LLM资源合集（精选）

精选100+可直接运行的AI Agent和RAG应用集合。包含完整工作流示例、智能代理框架和检索增强生成系统。适合AI开

LangChain AI开发框架

Agent工作流

ai-agents-for-beginners Agent工作流

微软官方开源项目，提供12堂系统课程学习AI智能体框架。涵盖工作流设计、RAG检索增强、多智能体协作等核心技能。适合AI

📚 相关教程推荐

AI 工具链资讯精选：2026-05-16 MCP / Agent / 自动化工具最新动态

帮助中心 · AI Skill Hub

AI 工具链资讯精选：2026-05-15 MCP / Agent / 自动化工具最新动态

📰 相关 AI 新闻

🍿 AI 圈相关吃瓜

AutoGPT 自主完成了任务：把我的文件夹全部重命名了

AI 圈观察

配了5个 MCP 工具，Claude 一个都没用

AI 圈观察

Filesystem MCP 帮 Claude 找文件，找了整个 node_modules

🗺️ 相关解决方案

ai-workflow-templates

docker

docker-deploy

🧩 你可能还需要

基于当前 Skill 的能力图谱，自动补全的工具组合

natively-cluely-ai-assistant — Claude Skill 中文使用文档

免费开源的AI面试助手，实时转录，隐蔽模式，局部RAG，BYOK。无订阅，防止数据泄露。

❓ 常见问题 FAQ

evalmonkey 是什么工具？−

evalmonkey 是一款Python开发的AI辅助工具。开源AI工作流：CLI for coding agents to benchmark & chaos test your AI Agents。⭐33 · Python 主要应用场景包括：古安事数权模式为常用的系统会模式。古安事数权模式为常用的系统会模式。。

evalmonkey 如何安装和开始使用？+

evalmonkey 是否免费？许可证是什么？+

evalmonkey 适合哪些用户使用？+

evalmonkey 的社区活跃度和项目维护状况如何？+

什么是 Agent 工作流？和普通自动化有什么区别？+

导入工作流后，我需要修改哪些配置？+

工作流运行失败了，如何排查问题？+

💡 AI Skill Hub 点评

总体来看，古安事数权模式是一款质量良好的Agent工作流，在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态，建议收藏备用，结合自身场景选择合适时机引入使用。

⬇️ 获取与下载

⬇ 下载源码 ZIP

✅ Apache-2.0 协议 · 可免费商用 · 直接从 aiskill88 服务器下载，无需跳转 GitHub

📚 深入学习古安事数权模式

查看分步骤安装教程和完整使用指南，快速上手这款工具

⚙️ 安装教程 📚 使用教程

🌐 原始信息

原始名称	`evalmonkey`
原始描述	开源AI工作流：CLI for coding agents to benchmark & chaos test your AI Agents。⭐33 · Python
Topics	`workflowagentai-agentai-agentsai-toolsbenchmarkpython`
GitHub	https://github.com/Corbell-AI/evalmonkey
License	Apache-2.0
语言	Python

🔗 原始来源

🐙 GitHub 仓库 https://github.com/Corbell-AI/evalmonkey

收录时间：2026-05-23 · 更新时间：2026-05-30 · License：Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。

📺 订阅 AI Skill Hub Daily Telegram 频道

每天 8 条精选 AI Skill、MCP、Agent 与自动化工具推送

加入频道 →

古安事数权模式

📚 深度解析

📋 工具概览

📖 中文文档

简介

Overview

Option B — Manual Setup (5 minutes)

First time setup:

⚡️ Quick Start

1. CSV Example (`evals.csv`)

2. JSON / YAML Example (`evals.json`)

Experience 1: Local Sample Agents (Single Command Start)

⚠️ 3 sample(s) scored below threshold — eval assets saved.

Experience 6: One-Command End-to-End Demo (RAG App)

Option A — Let Claude Code or Cursor set it up for you (30 seconds)

Setting Up in Claude Desktop / Cursor

OpenAI-compatible endpoint returning {"choices":[{"message":{"content":""}}]}

🤖 MCP Server (Cursor & Claude Integration)

1. run_full_pipeline(scenario="gsm8k", target_url="...", chaos_profiles="client_prompt_injection,client_payload_bloat")

⚡ 核心功能

👥 适合人群

🎯 使用场景

⚖️ 优点与不足

🔗 相关工具推荐

❓ 常见问题 FAQ

🤖 交给 Agent 安装 · 古安事数权模式