skill · prompt · RAG · agent
Manage, evaluate, improve, and observe your skills, prompts, RAG, and agent context — one rigorous statistical foundation across the whole journey. Bootstrap confidence intervals and length-debiasing are on by default — not an advanced flag, but a safety net you can't ignore.
doctor, eval, and observe aren't three tools — they're one measurement discipline at three points in a skill's lifecycle, each answering a different question.
Is the skill itself written healthily? 7 built-in dimensions scored independently — static rules at zero cost, LLM audit for depth, plus endpoint-driven custom dimensions.
$ omk doctor my-skill --dimensions audit.yaml
Is v2 really better than v1? A controlled A/B — same model, same cases, only the knowledge changes. Six dimensions scored independently, one-line verdict with a ship recommendation.
$ omk eval --control v1 --treatment v2
How is it doing in production? Parse real session JSONL to measure each skill's failure rate, latency, and token cost, and surface severity-weighted knowledge-gap signals.
$ omk observe ~/.claude/sessions
Five often-overlooked distortions decide whether a comparison is trustworthy. omk builds every defense into the foundation, so you don't have to enable them one by one.
Peer tools typically cover only one or two of these. omk's choice: build credibility into the foundation rather than leave it optional.
Criteria come from common LLMOps selection axes (metric library / judge / CI / observability / collaboration) plus measurement validity & reliability — not rules tailored to omk. On several axes omk doesn't win, and we mark that honestly.
| Capability | omk | promptfoo | DeepEval | LangSmith |
|---|---|---|---|---|
| Measurement credibility · validity / reliability | ||||
| Statistical significance (CI / tests) | ✓ Bootstrap | — | — | — |
| Judge ↔ human reliability (agreement) | ✓ Krippendorff α | — | — | — |
| Evaluation bias control (length-debias) | ✓ default | — | — | — |
| Evaluation capability | ||||
| Assertion / metric library breadth | ✓ 30+ | ✓ | ✓ | ◑ |
| RAG-specific metrics | ◑ 3 | ◑ | ✓ rich | ◑ |
| LLM-as-judge | ✓ | ✓ | ✓ | ✓ |
| Engineering & collaboration | ||||
| CI/CD integration (exit-code routing) | ✓ | ✓ | ✓ | ◑ |
| Onboarding speed / config simplicity | ◑ | ✓ very fast | ◑ | ◑ |
| Experiment tracking / tracing | — | — | ◑ | ✓ strong |
| Hosted SaaS dashboard / team collab | — | — | ✓ | ✓ |
| Ecosystem & integration | ||||
| Community size (GitHub stars, 2026-04) | nascent | 9k+ | 12k+ | commercial |
| Native Claude Code skill | ✓ | — | — | — |
Full comparison (8 tools × 30+ dimensions, incl. RAGAS / OpenAI Evals / lm-eval-harness / inspect-ai) is in the comparison doc, current as of 2026-04 — spot something stale? send a PR. Takeaway: no silver bullet — omk's tradeoff is "statistical credibility by default"; want a SaaS dashboard, pick LangSmith; want academic benchmarks, pick lm-eval-harness.
One line — omk install omk-agent-skill — installs the official Agent Skill into the Claude Code / Codex it detects locally (--to all writes to all). After that, /omk works out of the box in Claude Code; in Codex and others, just run the omk CLI.
No commands to memorize — state your goal in plain words, and the agent locates the skill from context and picks the right command.
Before your next release, let the data speak first.
No files to touch — omk init scaffolds two skill versions and three cases, and omk eval produces an HTML report plus a one-line verdict in under 5 minutes.