AI Skill Hub 推荐使用:古安事数权模式 是一款优质的Agent工作流。AI 综合评分 7.5 分,在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案,这是一个值得深入了解的选择。
古安事数权模式为常用的系统会模式。当前古安事数权模式为常用的系统会模式。
古安事数权模式 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
古安事数权模式为常用的系统会模式。当前古安事数权模式为常用的系统会模式。
古安事数权模式 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
# 方式一:pip 安装(推荐)
pip install evalmonkey
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install evalmonkey
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/Corbell-AI/evalmonkey
cd evalmonkey
pip install -e .
# 验证安装
python -c "import evalmonkey; print('安装成功')"
# 命令行使用
evalmonkey --help
# 基本用法
evalmonkey input_file -o output_file
# Python 代码中调用
import evalmonkey
# 示例
result = evalmonkey.process("input")
print(result)
# evalmonkey 配置文件示例(config.yml) app: name: "evalmonkey" debug: false log_level: "INFO" # 运行时指定配置文件 evalmonkey --config config.yml # 或通过环境变量配置 export EVALMONKEY_API_KEY="your-key" export EVALMONKEY_OUTPUT_DIR="./output"
<p align="center"> <img src="assets/evalmonkey-logo.png" alt="EvalMonkey Logo" width="400"/> </p>
Agents are fundamentally non-deterministic. They rely on external APIs, tool loops, and massive context windows. EvalMonkey is the ultimate, strictly local, open-source execution harness that enables developers to: 1. 🎯 Benchmark Capabilities: Run standard Agent benchmark datasets against your agent endpoints natively! 2. 🔥 Inject Chaos: Mutate headers, spike latency, and corrupt schemas dynamically to prove true resilience. 3. 📈 Track Production Reliability: Locally store all scores to visualize a single Production Reliability metric over time! 4. 🛠 Generate Improvement Evals: When scores are poor, automatically synthesise targeted test cases using your LLM — then hand them to Claude Code or Cursor to fix your agent.
EvalMonkey natively supports evaluating ANY LLM: AWS Bedrock, Azure, GCP, OpenAI, and Ollama.
Note on API Keys: If you have special setups that generate long-lived, static API keys for Bedrock, Azure, or GCP, simply supply them in the .env! EvalMonkey seamlessly supports both standard IAM / Service Account credential flows and long-term stateless authentication strings.
1. Install
git clone https://github.com/Corbell-AI/evalmonkey
cd evalmonkey
pip install -e .
2. Configure your LLM key (used only as the evaluation judge — never for your agent)
cp .env.example .env Open .env and set one of these depending on your LLM provider: ```bash EVAL_MODEL=gpt-4o OPENAI_API_KEY=sk-... # OpenAI
cp .env.example .env # fill in EVAL_MODEL + your LLM provider key pip install -e .
If using a CSV, just make sure you have the columns id and expected_behavior_rubric. Any other column you add (like question, topic, image_url) will be automatically gathered and sent in the JSON payload directly to your agent!
| id | expected_behavior_rubric | question |
|---|---|---|
| get_benefits | Must return the URL linking to the company hr portal | Where do I sign up for medical benefits? |
| time_off | Provide the exact number of standard vacation days (15) | How many days of PTO do I get? |
evalmonkey run-benchmark --scenario get_benefits --eval-file evals.csv
If you use JSON or YAML, you must nest the agent payload keys explicitly under an input_payload dict object:
[
{
"id": "onboarding_query",
"description": "Test HR agent's ability to return the onboarding link.",
"expected_behavior_rubric": "Must contain exactly the URL https://hr.example.com/benefits",
"input_payload": {
"question": "Where do I sign up for benefits?"
}
}
]
evalmonkey run-benchmark --scenario onboarding_query --eval-file evals.json </details>
---
Easiest Experience: Test our built-in sample agents with a single command! EvalMonkey will spawn the sample agent in the background automatically and run the benchmark. ```bash
Run the full benchmark + chaos + eval-generation pipeline against the built-in rag_app sample agent:
```bash
Open Claude Code, Cursor, or any AI coding assistant and paste this prompt:
Set up EvalMonkey in my project so I can benchmark my AI agent.
1. Clone https://github.com/Corbell-AI/evalmonkey into a sibling folder
2. Run: pip install -e . inside that folder
3. Copy .env.example to .env and ask me which LLM provider I want to use as the benchmark judge (OpenAI, Anthropic, Bedrock, or Ollama) — then fill in the correct key
4. Run: evalmonkey init --framework <my_framework> --name "My Agent" --port <my_port>
Use the framework my agent is built with (crewai / langchain / openai / bedrock / autogen / ollama / strands / custom)
5. Show me the generated evalmonkey.yaml and ask me to confirm the agent URL and response path are correct
6. Run a quick smoke test: evalmonkey run-benchmark --scenario gsm8k --sample-agent rag_app --limit 2
to confirm everything is wired up correctly
7. Then run the real benchmark against my agent: evalmonkey run-benchmark --scenario mmlu --limit 5
8. Show me the score and explain what it means
The agent will handle cloning, installing, configuring your .env, and running the first benchmark — all without you typing a single command.
---
Add the following to your MCP configuration file (e.g. claude_desktop_config.json):
{
"mcpServers": {
"evalmonkey": {
"command": "evalmonkey",
"args": ["serve-mcp"]
}
}
}
Once connected, your AI assistant will gain the ability to list benchmarks, trigger full evaluation runs, inject chaos payload mutators, pull historical trends, and generate improvement eval assets — entirely autonomously while helping you build your agent!
evalmonkey run-benchmark --scenario arc \ --target-url http://localhost:8000/v1/chat/completions \ --request-key content \ --response-path choices.0.message.content ```
EvalMonkey natively ships with a Model Context Protocol (MCP) server! This allows AI IDEs (like Cursor) or external agents (like Claude Desktop) to invoke EvalMonkey tools automatically while they build your agent.
古安事数权模式为常用的系统会模式。古安事数权模式为常用的系统会模式。为常用的系统会模式。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。
总体来看,古安事数权模式 是一款质量良好的Agent工作流,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | evalmonkey |
| Topics | workflowagentai-agentai-agentsai-toolsbenchmarkpython |
| GitHub | https://github.com/Corbell-AI/evalmonkey |
| License | Apache-2.0 |
| 语言 | Python |
收录时间:2026-05-23 · 更新时间:2026-05-23 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。
选择 Agent 类型,复制安装指令后粘贴到对应客户端