Full-lifecycle evaluation across skill · prompt · RAG · agent

Make every change
backed by statistical evidence

Manage, evaluate, improve, and observe your skills, prompts, RAG, and agent context — one rigorous statistical foundation across the whole journey. Bootstrap confidence intervals and length-debiasing are on by default — not an advanced flag, but a safety net you can't ignore.

$ npm i -g oh-my-knowledgeCopy ⧉ Get started
npm weekly ··· CI passing MIT Node ≥ 22 Same model · same cases · only the knowledge changes
omk — eval · knowledge-carrier evaluation lifecycle live

eval Controlled A/B: same model, same cases, only the knowledge changes — 95% CI on the composite-score lift
0% +10% +20% +30% Not significant (CI crosses 0) +18.3% +11.2 +25.4
v2 clearly beats v1 — ship it CI[+11.2, +25.4]α = 0.81length-debiased
doctor Pre-ship checkup: 7 built-in dimensions scored independently (static rules + LLM audit, offline-capable)
TriggersGood
ClarityGood
PrecisionFair
DepsGood
ToolsGood
SafetyFair
ExamplesGood
🩺 Health: Good · 5 healthy / 2 at risk · 3 suggestions
observe In-prod observation: parse Claude Code sessions to quantify failure rate, cost, and knowledge gaps
Failure rate
4.2%
▼ −1.6pt vs last week
P50 latency
18.4s
▼ −2.1s
Cost / session
$0.012
▲ +$0.003
Knowledge-gap signal: "missing environment probe upfront" recurs across 12 sessions — ranked #1 by severity weight ×12
Three evaluation capabilities

One pipeline across a skill's whole life

doctor, eval, and observe aren't three tools — they're one measurement discipline at three points in a skill's lifecycle, each answering a different question.

🩺

doctor pre-ship

Is the skill itself written healthily? 7 built-in dimensions scored independently — static rules at zero cost, LLM audit for depth, plus endpoint-driven custom dimensions.

$ omk doctor my-skill --dimensions audit.yaml
  • Triggers / docs / instructions / deps / tools / safety / examples
  • --static-only: offline, no LLM calls
  • Custom endpoint dimensions: call an API for deep review
📊

eval on release

Is v2 really better than v1? A controlled A/B — same model, same cases, only the knowledge changes. Six dimensions scored independently, one-line verdict with a ship recommendation.

$ omk eval --control v1 --treatment v2
  • Bootstrap CI / length-debias / saturation curves on by default
  • Krippendorff α: judge↔human agreement on a gold set
  • Blind A/B · judge ensemble · multi-run variance
🔭

observe in prod

How is it doing in production? Parse real session JSONL to measure each skill's failure rate, latency, and token cost, and surface severity-weighted knowledge-gap signals.

$ omk observe ~/.claude/sessions
  • Failure rate / latency / cost broken down per skill
  • Knowledge-gap detection: quantify risk exposure
  • Feeds production signal into sample / evolve iteration
Moat · measurement credibility

Rigor is the foundation, not an add-on

Five often-overlooked distortions decide whether a comparison is trustworthy. omk builds every defense into the foundation, so you don't have to enable them one by one.

01
A point estimate mistakes sampling noise for a real gain
Bootstrap confidence intervals built-in
Reports an interval, not a point — significance is read off directly.
02
A composite average hides a regression in a single dimension
Three-layer independent scoring · pass-all gate built-in
Fail any of fact / behavior / judge and it doesn't pass.
03
The control group reads the very carrier under test
construct validity breaks down
strict-baseline isolation built-in
Closes three leaks: skill self-discovery, the Skill tool, and the cwd bypass.
04
Judges systematically prefer longer answers
Length-debiased judging built-in
Scoring removes the length covariate — verbosity no longer buys points.
05
The reliability of the judge's own scoring is unmeasurable
Krippendorff α on with a gold set
Anchored on human labels, it quantifies judge↔human agreement.

Peer tools typically cover only one or two of these. omk's choice: build credibility into the foundation rather than leave it optional.

Tool comparison

A side-by-side under one set of criteria

Criteria come from common LLMOps selection axes (metric library / judge / CI / observability / collaboration) plus measurement validity & reliability — not rules tailored to omk. On several axes omk doesn't win, and we mark that honestly.

CapabilityomkpromptfooDeepEvalLangSmith
Measurement credibility · validity / reliability
Statistical significance (CI / tests)Bootstrap
Judge ↔ human reliability (agreement)Krippendorff α
Evaluation bias control (length-debias)default
Evaluation capability
Assertion / metric library breadth30+
RAG-specific metrics3rich
LLM-as-judge
Engineering & collaboration
CI/CD integration (exit-code routing)
Onboarding speed / config simplicityvery fast
Experiment tracking / tracingstrong
Hosted SaaS dashboard / team collab
Ecosystem & integration
Community size (GitHub stars, 2026-04)nascent9k+12k+commercial
Native Claude Code skill
Native Partial / needs config Not supported

Full comparison (8 tools × 30+ dimensions, incl. RAGAS / OpenAI Evals / lm-eval-harness / inspect-ai) is in the comparison doc, current as of 2026-04 — spot something stale? send a PR. Takeaway: no silver bullet — omk's tradeoff is "statistical credibility by default"; want a SaaS dashboard, pick LangSmith; want academic benchmarks, pick lm-eval-harness.

$ omk install omk-agent-skill # one-time install
✓ Wrote to detected Claude Code / Codex
/omk eval # evaluate this project's artifact
/omk evolve # one shot: checkup → samples → self-iterate
/omk sample # generate or top up eval cases
> Compare v1 and v2 for me
↳ infer intent → omk eval --control v1 --treatment v2 …
Agent integration

Use it right inside your coding agent

One line — omk install omk-agent-skill — installs the official Agent Skill into the Claude Code / Codex it detects locally (--to all writes to all). After that, /omk works out of the box in Claude Code; in Codex and others, just run the omk CLI.

No commands to memorize — state your goal in plain words, and the agent locates the skill from context and picks the right command.

You
Make this skill more robust, then compare it against the last version

Before your next release, let the data speak first.

Run your first eval in 5 minutes

From a bare number to a conclusion that holds up

No files to touch — omk init scaffolds two skill versions and three cases, and omk eval produces an HTML report plus a one-line verdict in under 5 minutes.

$ omk init demo && cd demo && omk evalCopy ⧉