🛠

AI工具

NASDE AI编程智能体评测工具

Q: 评测数据集是否开源？

工具使用自有基准数据集，支持用户自定义任务进行评测。

Q: 是否提供可视化报告？

CLI工具提供结构化输出，可进一步集成可视化方案。

基于 Python · 开源 AI 工具，GitHub 社区精选

英文名：nasde-toolkit

⭐ 9 Stars 💻 Python 📄 MIT 🏷 AI 7.5分

7.5AI 综合评分

智能体评测基准测试AI编程MCP工具Claude性能评估

🌐 访问官网

✦ AI Skill Hub 推荐

AI Skill Hub 推荐使用：NASDE AI编程智能体评测工具是一款优质的AI工具。AI 综合评分 7.5 分，在同类工具中表现稳健。如果你正在寻找可靠的AI工具解决方案，这是一个值得深入了解的选择。

📚 深度解析

NASDE AI编程智能体评测工具是一款基于 Python 的开源工具，在 GitHub 上收获 0k+ Star，是智能体评测、基准测试、AI编程、MCP工具领域中的优质开源项目。开源工具的最大优势在于代码完全透明，你可以审计每一行代码的安全性，也可以根据自身需求进行二次开发和定制。

**为什么要使用开源工具而非商业 SaaS？**
对于个人开发者和有隐私需求的用户，本地部署的开源工具意味着数据不离本机，不受第三方服务商的数据政策约束。同时，开源工具通常没有使用次数限制和月度费用，一次安装即可长期使用，对于高频使用场景的总拥有成本（TCO）远低于订阅制商业工具。

**安装与环境准备**
NASDE AI编程智能体评测工具依赖 Python 运行环境。建议通过 pyenv（Python）或 nvm（Node.js）管理 Python 版本，避免全局环境污染。对于新手用户，推荐先创建虚拟环境（python -m venv venv && source venv/bin/activate），再安装依赖，这样即使出现问题也可以随时删除虚拟环境重新开始，不影响系统稳定性。

**社区与维护**
GitHub Issue 和 Discussion 是获取帮助的最快渠道。在提问前建议先检查 Closed Issues（已关闭的问题），大多数常见问题都已有解答。遇到 Bug 时，提供 pip list 的输出、完整错误堆栈和最小可复现示例，能显著提高开发者响应速度。AI Skill Hub 将持续追踪 NASDE AI编程智能体评测工具的版本更新，及时通知重要功能变化。

📋 工具概览

专业的MCP工具，为AI编程智能体提供基准测试与评估框架。支持在已有任务上进行性能评测，帮助开发者全面了解Claude Code等AI编程助手的能力和局限。适合AI研究者、评估师和编程工具开发者。

NASDE AI编程智能体评测工具是一款基于 Python 开发的开源工具，专注于智能体评测、基准测试、AI编程等核心功能。作为 GitHub 开源项目，它拥有活跃的社区支持和持续的版本迭代，代码完全透明可审计，支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流，都能提供稳定可靠的解决方案。

GitHub Stars

⭐ 9

开发语言

Python

支持平台

Windows / macOS / Linux

维护状态

轻量级项目，按需更新

开源协议

MIT

AI 综合评分

7.5 分

工具类型

AI工具

Forks

—

📖 中文文档

以下内容由 AI Skill Hub 根据项目信息自动整理，如需查看完整原始文档请访问底部「原始来源」。

📌 核心特色

开源免费，支持本地部署，数据完全自主可控
活跃的 GitHub 开源社区，持续迭代更新
提供详细文档和使用示例，新手友好
支持自定义配置，灵活适配不同使用环境
可作为基础组件集成进现有技术栈或进行二次开发

🎯 主要使用场景

本地部署运行，保护数据隐私，满足合规要求
自定义集成到现有系统，扩展技术栈能力
作为开源基础组件进行商业化二次开发

以下安装命令基于项目开发语言和类型自动生成，实际以官方 README 为准。

安装命令

# 方式一：pip 安装（推荐）
pip install nasde-toolkit

# 方式二：虚拟环境安装（推荐生产环境）
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install nasde-toolkit

# 方式三：从源码安装（获取最新功能）
git clone https://github.com/NoesisVision/nasde-toolkit
cd nasde-toolkit
pip install -e .

# 验证安装
python -c "import nasde_toolkit; print('安装成功')"

📋 安装步骤说明

访问 GitHub 仓库页面
按照 README 文档完成依赖安装
根据系统环境完成初始化配置
参考官方示例或文档开始使用
遇到问题可在 GitHub Issues 中查找解答

以下用法示例由 AI Skill Hub 整理，涵盖最常见的使用场景。

常用命令 / 代码示例

# 命令行使用
nasde-toolkit --help

# 基本用法
nasde-toolkit input_file -o output_file

# Python 代码中调用
import nasde_toolkit

# 示例
result = nasde_toolkit.process("input")
print(result)

以下配置示例基于典型使用场景生成，具体参数请参照官方文档调整。

配置示例

# nasde-toolkit 配置文件示例（config.yml）
app:
  name: "nasde-toolkit"
  debug: false
  log_level: "INFO"

# 运行时指定配置文件
nasde-toolkit --config config.yml

# 或通过环境变量配置
export NASDE_TOOLKIT_API_KEY="your-key"
export NASDE_TOOLKIT_OUTPUT_DIR="./output"

📑 README 深度解析真实文档完整度 70/100 含工作流图查看 GitHub 原文 →

以下内容由系统直接从 GitHub README 解析整理，保留代码块、表格与列表结构。

简介

Noesis Agentic Software Development Evals Toolkit

<p>Run an AI coding agent on a task you already know the answer to. Score the result. Compare configurations.</p>

---

Prerequisites

Python 3.12+
Docker (default) or a cloud sandbox provider — Harbor runs agents in isolated environments
uv — Package manager
npm — Required for Gemini CLI (@google/gemini-cli is installed automatically by Harbor)
Agent credentials (at least one):
Claude Code: ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN
OpenAI Codex: CODEX_API_KEY (API key) or codex login (ChatGPT subscription OAuth)
Gemini CLI: GEMINI_API_KEY (API key), GOOGLE_API_KEY (Vertex AI), or gemini login (Google account OAuth)
Evaluator CLI — the assessment evaluator spawns the claude CLI by default (or codex if [evaluation] backend = "codex"). That CLI must be installed and authenticated (OAuth subscription or API key — whichever you already use interactively)

1. Install the CLI

uv tool install nasde-toolkit --python 3.13
nasde --version

This installs the latest stable release from PyPI.

Python version: We recommend --python 3.13 (latest stable, broadest wheel availability). --python 3.12 is also supported and tested if your environment standardizes on it. Python 3.14 is not currently supported — a transitive dependency (pyiceberg via supabase) hasn't yet released wheels for cp314. The cap will be lifted once upstream wheels land.

2. Install the authoring skills for Claude Code

nasde install-skills

This copies the bundled nasde-benchmark-* skills into ~/.claude/skills/ so they're available in every Claude Code session. Use --scope project to install into the current project's .claude/skills/ instead, or --force to overwrite after a nasde upgrade.

Note: the authoring helpers are Claude Code skills. Codex and Gemini users can still run NASDE from the CLI — the skills just speed up creating benchmarks; they are not required to run them.

3. From inside your own repo, ask the agent to build a benchmark from git history

Open your own project in Claude Code and say something like:

"Create a NASDE benchmark with a single task, based on a recent piece of work from this repo — a commit, a range of commits, or a merged PR."

Start with one task. Point the skill at whatever unit of work feels self-contained in your workflow — a single commit, a range, a merged MR/PR, or an issue that was closed by a set of commits. The nasde-benchmark-from-history skill proposes a good candidate, and generates one task directory with instruction.md, a Dockerfile, test.sh, and a starter assessment_criteria.md. You review each file before it's written.

Then run it:

nasde run --all-variants -C path/to/generated-benchmark

--all-variants runs every variant the skill scaffolded, so you don't need to know their names yet. If you'd rather burn fewer tokens on the first run, pick just one with --variant NAME — you can run the others later.

Installation reference

The Quick start above uses uv tool install — recommended because it isolates nasde in its own environment and puts only the nasde command on PATH. Alternatives:

```bash

Latest unreleased changes from main (for testing PRs and dev builds)

uv tool install git+https://github.com/NoesisVision/nasde-toolkit.git --python 3.13

harbor_env = "daytona" # Optional: cloud sandbox provider (default: docker)

[docker] base_image = "ubuntu:22.04" build_commands = []

[evaluation] backend = "claude" # "claude" (default) | "codex" model = "claude-opus-4-7" dimensions_file = "assessment_dimensions.json"

Quick start (three steps)

The fastest path from zero to a working benchmark built from your own git history:

Results — four agent configurations scored against the same criteria

Variant	Pass	Domain (/25)	Encaps. (/20)	Arch. (/20)	Ext. (/15)	Tests (/20)	Total (/100)
`claude-vanilla`	75%	17.1	11.2	16.1	9.5	7.7	61.6
`claude-guided` (with a DDD skill)	75%	17.4	12.4	16.6	10.0	8.7	65.1
`codex-vanilla`	89%	18.8	13.8	16.8	11.4	8.7	69.4
`codex-guided` (same skill)	50%	11.5	9.6	12.9	7.4	6.0	47.4

The insight: the same "DDD guidance" skill helps Claude a little (+3.5) and badly hurts Codex (-22). The per-dimension breakdown pinpoints where Codex regresses — domain modeling, encapsulation, extensibility — which would be invisible without this assessment. Skill optimization is agent-specific.

Inside an existing virtual environment (3.12 or 3.13)

pip install nasde-toolkit

Configuring the reviewer agent

The reviewer agent (assessment evaluator) is configurable via the [evaluation] section in nasde.toml. By default it uses claude-opus-4-7 with read-only tools (Read, Glob, Grep).

All options

Setting	Default	Purpose
`backend`	`claude`	Subprocess backend: `claude` or `codex`
`model`	`claude-opus-4-7`	Evaluator model
`dimensions_file`	`assessment_dimensions.json`	Scoring dimensions file
`max_turns`	`30`	Max conversation turns
`allowed_tools`	`["Read", "Glob", "Grep"]`	Tool whitelist
`mcp_config`	—	Path to MCP server config JSON
`skills_dir`	—	Path to evaluator skills directory
`append_system_prompt`	—	Extra system prompt text
`include_trajectory`	`false`	Include ATIF trajectory in evaluation

When include_trajectory is enabled, the evaluator can read the agent's full execution trajectory (agent/trajectory.json) — tool calls, timestamps, token usage, errors. This enables assessment dimensions that evaluate the agent's process (efficiency, verification discipline, decision-making) alongside the final output. See examples/nasde-dev-skill for a working example with trajectory-aware dimensions.

`nasde run` options

Flag	Description
`--variant`	Variant to run (defaults to config default)
`--tasks`	Comma-separated task names to run
`--model`	Model override (e.g. `claude-sonnet-4-6`, `o3`, `google/gemini-3-flash-preview`)
`--timeout`	Agent timeout in seconds
`--with-opik`	Enable Opik tracing
`--without-eval`	Skip assessment evaluation
`--harbor-env`	Harbor execution environment (`docker`, `daytona`, `modal`, `e2b`, `runloop`, `gke`)
`--project-dir`, `-C`	Path to evaluation project

mcp_config = "./evaluator_mcp.json" # MCP server config for evaluator

CLI cheatsheet

Most users only need nasde run — everything else is occasional. See Commands below for the full reference.

```bash

Gemini CLI variant

nasde run --variant gemini-baseline --model google/gemini-3-flash-preview -C my-benchmark

or: export OPENAI_API_KEY=sk-...

```

API key always takes priority over OAuth when both are present.

The evaluation pipeline, end to end

flowchart LR A["Task:
instruction.md
+ test.sh
+ assessment_criteria.md"] --> B["Coding agent solves task
in an isolated container
(Docker or cloud sandbox)"] B --> C["test.sh:
initial rough tests"] C --> D["Binary reward
0 or 1"] D --> E["Reviewer agent
reads the produced
workspace + trajectory"] E --> F["Per-dimension scores
vs. your criteria"] F --> G["Results logged
(locally + optional
experiment tracker)"] style E fill:#c0392b,color:#fff

Stage 1 (the agent does the work in a sandbox) comes from Harbor. The optional experiment-tracking stage at the end uses Opik. NASDE is the glue that connects them and adds the reviewer stage in between — plus the CLI, the benchmark project layout, and the authoring skills (see below).

Benchmarking a Claude Code plugin (`[nasde.plugin]`)

If your task exercises a local Claude Code plugin (a directory with .claude-plugin/plugin.json, skills/, and an MCP server in .mcp.json), declare it once in task.toml — no vendored snapshot, no hand-wired Dockerfile COPY, no hand-written [environment.mcp_servers], no copying the plugin's skills into a variant:

[nasde.plugin]
path = "../../../src/plugins/my-plugin"   # dir containing .claude-plugin/plugin.json
ref = "abc1234"                           # optional git ref, same semantics as [nasde.source]
install_root = "/opt/my-plugin"           # optional, default /opt/<plugin-name>
build = "bun install --frozen-lockfile"   # optional, run at image-build time

[nasde.plugin.env]                        # optional, exported in the MCP server wrapper
CLAUDE_PLUGIN_DATA = "/opt/my-plugin-data"

One declaration ships the whole plugin into the sandbox image (at ref, via a temporary git worktree, for reproducibility), registers the plugin's own skills for the agent (whole skill dir, including references/), and wires its MCP server into the task automatically. Works with or without [nasde.source] and with or without a hand-written environment/Dockerfile. This removes the frozen-snapshot workaround entirely. See ADR-009.

🎯 aiskill88 AI 点评 B 级 2026-05-22

填补AI编程智能体评测空白的创新工具，基于真实任务的评估思路值得称赞。但早期维护水平需观察，社区采纳度有限。

⚡ 核心功能

开源免费，支持本地部署，数据完全自主可控
活跃的 GitHub 开源社区，持续迭代更新
提供详细文档和使用示例，新手友好
支持自定义配置，灵活适配不同使用环境
可作为基础组件集成进现有技术栈或进行二次开发

👥 适合人群

AI 技术爱好者研究人员和学生开发者和工程师技术创业者

🎯 使用场景

本地部署运行，保护数据隐私，满足合规要求
自定义集成到现有系统，扩展技术栈能力
作为开源基础组件进行商业化二次开发

⚖️ 优点与不足

✅ 优点

+MIT 协议，可免费商用
+完全开源免费，无授权费用
+本地部署，数据完全自主可控
+开发者社区支持，遇问题可查可问

⚠️ 不足

−安装和初始配置可能需要一定技术基础
−功能完整性通常不如成熟商业产品
−技术支持主要依赖开源社区，响应速度不稳定

⚠️ 使用须知

AI Skill Hub 为第三方内容聚合平台，本页面信息基于公开数据整理，不对工具功能和质量作任何法律背书。

建议在沙箱或测试环境中充分验证后，再部署至生产环境，并做好必要的安全评估。

📄 License 说明

🔗 相关工具推荐

JavaGuide面试指南

AI工具

Claude记忆框架

为Claude AI代理提供跨会话持久化上下文存储能力的开源框架。自动捕获和保留对话历史，使AI智能体能在多次交互中保持

Ruflo智能编排平台

Claude多智能体编排领先平台，提供完整的agentic框架和工作流管理。支持智能RAG、多智能体协作和工作流定制。适

全能LLM生产力助手

AI工具

📚 相关教程推荐

Claude Code 完全指南：从安装到高级用法的系统教程

帮助中心 · AI Skill Hub

Cursor vs Claude Code 2025 深度对比：哪款 AI 编程工具更适合你

帮助中心 · AI Skill Hub

Claude Code 完全指南：从安装到高级用法的系统教程

帮助中心 · AI Skill Hub

Cursor AI 编程完全指南：Rules 配置、Composer 使用、MCP 集成

帮助中心 · AI Skill Hub

❓ 常见问题 FAQ

支持评测哪些AI编程智能体？−

主要支持Claude Code等MCP兼容的AI编程智能体，可扩展其他模型。

评测数据集是否开源？+

是否提供可视化报告？+

安装这个工具需要什么基础？+

安装过程中遇到依赖冲突怎么办？+

工具安装成功但运行报错，该怎么处理？+

这个工具是否有数据隐私风险？+

工具更新后会影响已有的配置和数据吗？+

💡 AI Skill Hub 点评

总体来看，NASDE AI编程智能体评测工具是一款质量良好的AI工具，在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态，建议收藏备用，结合自身场景选择合适时机引入使用。

📚 深入学习 NASDE AI编程智能体评测工具

查看分步骤安装教程和完整使用指南，快速上手这款工具

⚙️ 安装教程 📚 使用教程

🌐 原始信息

原始名称	`nasde-toolkit`
原始描述	开源MCP工具：CLI for benchmarks & evals of AI coding agents — on tasks you already understand。⭐9 · Python
Topics	`智能体评测基准测试AI编程MCP工具Claude性能评估`
GitHub	https://github.com/NoesisVision/nasde-toolkit
License	MIT
语言	Python

🔗 原始来源

🐙 GitHub 仓库 https://github.com/NoesisVision/nasde-toolkit 🌐 官方网站 https://noesis.vision/nasde/

收录时间：2026-05-21 · 更新时间：2026-05-22 · License：MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。