AI Skill Hub 推荐使用:智能提示评估 是一款优质的Prompt模板。AI 综合评分 7.5 分,在同类工具中表现稳健。如果你正在寻找可靠的Prompt模板解决方案,这是一个值得深入了解的选择。
评估LLM知识输入的框架,提供prompt模板和benchmark
智能提示评估 是经过精心设计和反复验证的专业 Prompt 模板集合。这些 Prompt 框架能够有效激活 Claude、ChatGPT 等大型语言模型的深层能力,让 AI 生成更准确、更有价值的输出结果。无需任何安装,直接复制模板内容到 AI 对话框即可使用。
评估LLM知识输入的框架,提供prompt模板和benchmark
智能提示评估 是经过精心设计和反复验证的专业 Prompt 模板集合。这些 Prompt 框架能够有效激活 Claude、ChatGPT 等大型语言模型的深层能力,让 AI 生成更准确、更有价值的输出结果。无需任何安装,直接复制模板内容到 AI 对话框即可使用。
# Prompt 无需安装,直接复制使用 # 支持:Claude / ChatGPT / Gemini / 通义千问 等主流模型 # 使用步骤 # 1. 复制 Prompt 模板内容 # 2. 粘贴到 AI 对话框 # 3. 替换 [占位符] 为实际内容 # 4. 发送后获取结构化输出 # 获取原始文件 git clone https://github.com/lizhiyao/oh-my-knowledge
# 粘贴到 Claude/ChatGPT 使用 # 示例 Prompt 结构: 你是一位 [角色],擅长 [领域]。 请根据以下要求完成任务: 任务背景:[描述背景] 具体要求:[详细说明] 输出格式:[期望格式] # 将 [] 内内容替换为实际需求
# oh-my-knowledge 配置说明 # 查看配置选项 oh-my-knowledge --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export OH_MY_KNOWLEDGE_CONFIG="/path/to/config.yml"
English | 简体中文
Did your prompt actually get better? A/B test your prompts and skills with statistical rigor — bootstrap CI and length-debias on by default, Krippendorff α the moment you add a gold set.

| Feature | What it does |
|---|---|
| **One-line verdict** | omk eval six-tier verdict + ship recommendation + exit-code routing; HTML pill shares the same rules |
| **Six-dim evaluation** | Fact / Behavior / LLM-judge / Cost / Efficiency / Stability shown independently |
| **Multi-executor** | Claude CLI / Claude SDK / Codex CLI / Codex SDK / OpenAI / Gemini / Anthropic API / any custom command |
| **30+ assertion types** | substring, regex, JSON Schema, ROUGE/BLEU/Levenshtein similarity, agent tool-call assertions, semantic similarity, custom JS |
| **Statistical rigor** | Bootstrap CI / length-debias / saturation curve on by default; Krippendorff α auto-computed with a gold set. [Details →](docs/explanation/statistical-rigor.md) |
| **RAG metrics** | faithfulness / answer_relevancy / context_recall — anti-hallucination + answer relevance + context coverage |
| **LLM health audit** | omk doctor grades 7 builtin dimensions; --static-only runs offline without an LLM |
| **Production observability** | parse Claude Code session JSONL traces; measure per-skill failure rate / latency / cost / knowledge-gap signals |
| **Knowledge-gap detection** | severity-weighted signals quantify risk exposure instead of claiming completeness |
| **Construct-validity isolation** | --strict-baseline (default ON) cuts three contamination channels so baseline doesn't silently see the skill it's being compared against |
| **Sample design science** | sample schema with capability / difficulty / construct / provenance metadata (HF Dataset Cards style); studio surfaces coverage breakdown plus rubric_clarity_low / capability_thin flags. [docs/specs/sample-design-spec.md](docs/specs/sample-design-spec.md) |
| **Multi-judge ensemble** | --judge-models claude:opus,openai:gpt-4o cross-vendor scoring + agreement metrics |
| **Blind A/B** | --blind hides variant names; HTML report has a reveal button |
| **Multi-run variance** | --repeat N repeats the eval and computes mean / SD / CI / t-test |
| **MCP URL fetching** | pull content from private-doc URLs via an MCP server (SSO-protected knowledge bases, etc.) |
| **Auto analysis** | detects low-discrimination assertions, flat scores, all-pass / all-fail, expensive samples |
| **Traceability** | reports carry CLI version, Node version, artifact version fingerprint, judge prompt hash |
| **EN / ZH switch** | one-click language toggle in the HTML report |
claude CLI (for the default executor and LLM judge; see Claude Code)--no-judgenpm i -g oh-my-knowledge
omk init demo && cd demo
omk eval --control code-review-v1 --treatment code-review-v2
That's it — no editing required. omk init scaffolds two skill variants and three sample cases; omk eval runs the controlled A/B and opens an HTML report with a one-line verdict in about five minutes.
Walkthrough: 5-minute quickstart guide (recommended for first-time users).
Deeper: CLI reference · how it works · eval sample format · executors & artifact layout
| Variable | Description |
|---|---|
CCV_PROXY_URL | proxy requests through cc-viewer for live eval-traffic visualization |
OMK_REPORT_PORT | report server port (default: 7799) |
| omk | promptfoo | DeepEval | LangSmith | |
|---|---|---|---|---|
| Bootstrap CI | ✓ default | ✗ | ✗ | ✗ |
| Krippendorff α (judge ↔ human) | ✓ with gold set | ✗ | ✗ | ✗ |
| Length-debias judge prompt | ✓ default | ✗ | ✗ | ✗ |
| Saturation curve | ✓ | ✗ | ✗ | ✗ |
| Three-layer scoring isolation | ✓ | ✗ | partial | ✗ |
| Per-variant skill isolation (construct validity) | ✓ default | ✗ | ✗ | ✗ |
| Native Claude Code skill | ✓ | ✗ | ✗ | ✗ |
| Hosted SaaS dashboard | ✗ | ✗ | ✓ | ✓ |
omk's moat is default-on safety net — Bootstrap CI and length-debias aren't advanced flags; they're the default, and judge ↔ human α comes free the moment you add a gold set. Other tools let you opt into confidence intervals; omk makes them unavoidable. Need a hosted SaaS dashboard? Choose LangSmith. Want quick local prompt iteration without statistics? Choose promptfoo. Shipping to production and someone will ask "why should I trust this number?" Choose omk.
RAG-specific evals: see RAGAS (separate niche, complementary to omk). Full comparison with 7 tools across 25+ dimensions: docs/reference/comparison.md.
一个评估LLM知识输入的有用框架
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。
总体来看,智能提示评估 是一款质量良好的Prompt模板,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | oh-my-knowledge |
| 原始描述 | 开源Prompt模板:Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, ag。⭐6 · TypeScript |
| Topics | AILLMbenchmarkprompt |
| GitHub | https://github.com/lizhiyao/oh-my-knowledge |
| License | MIT |
| 语言 | TypeScript |
收录时间:2026-06-01 · 更新时间:2026-06-01 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。
选择 Agent 类型,复制安装指令后粘贴到对应客户端