Eval Pipeline — Implementation Plan

4 phases · ~8h · ~350 LOC · M92 54%→91%

Eval Pipeline — Close the Feedback Loop

The Problem

Eval infrastructure is 85% built (88 skill evals, 31 agent evals, 7 scripts, CI pipeline). But the feedback loop is broken — no single command to eval one skill after editing, no automated description improvement, no CI regression detection. api-design has 0% trigger precision and nobody noticed.

#1045
eval:skill
#1044
desc optimizer
#1046
CI gate
#1050
cost tracking

Change Manifest

[A] tests/evals/scripts/run-skill-eval.sh +80
[A] tests/evals/scripts/optimize-description.sh +150
[A] tests/evals/scripts/check-eval-regression.sh +60
[M] package.json +3
[M] .github/workflows/plugin-validation.yml +20
[M] tests/evals/scripts/run-trigger-eval.sh +15
[M] tests/evals/scripts/run-quality-eval.sh +15
3 new files · 4 modified · ~343 LOC net

Dependency Chain

Time ──────────────────────────────────────────────────────────► 1h 3h 2h 2h #1045 ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ eval: skill #1044 ░░░░░░░░████████████████████████░░░░░░░░░░░░░░░░░ desc optim #1046 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████████████ CI gate #1050 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ cost (can do anytime, optional) ████ active ░░░░ waiting ║▼ dependency

Why This Order

PhaseDepends OnReason
#1045 eval:skillNothingFoundation — single command to eval one skill
#1044 desc optimizer#1045Uses eval:skill --trigger-only in its iteration loop
#1046 CI gate#1044Needs optimized baselines committed to compare against
#1050 cost trackingNoneAdditive — can be done anytime, nice-to-have

Phase 1: eval:skill wrapper

#1045 — Unified per-skill evaluation
~1h +80 LOC #1045
$ npm run eval:skill -- commit

  EVAL: commit
  ┌───────────┬────────────┬──────────┐
  │ Category  │ Score      │ Status   │
  ├───────────┼────────────┼──────────┤
  │ Trigger   │ P:100 R:100│ PASS     │
  │ Quality   │ 87%        │ PASS     │
  │ Overall   │            │ PASS     │
  └───────────┴────────────┴──────────┘

$ npm run eval:skill -- commit --trigger-only   # fast, 2min
$ npm run eval:skill -- commit --quality-only   # skip trigger
$ npm run eval:skill -- --all                   # all 88 skills
$ npm run eval:skill -- --all --dry-run         # validate only
Implementation: Bash script that calls run-trigger-eval.sh then run-quality-eval.sh for the specified skill, merges results into unified JSON, prints summary table.
[A] tests/evals/scripts/run-skill-eval.sh +80
[M] package.json +1

Phase 2: Description Optimizer

#1044 — Automated description improvement
~3h +150 LOC #1044
$ npm run eval:optimize-desc -- assess

  OPTIMIZE: assess
  Eval prompts: 14 (8 train / 6 test)

  Iter 1: P=100  R= 87  (baseline)
  Iter 2: P=100  R= 87  (no change)
  Iter 3: P=100  R=100  ← IMPROVED
  Iter 4: P= 88  R=100  ← precision dropped, reverted

  Best: Iter 3 (test set: P=100 R=100)

  - Assesses and rates quality 0-10 with pros/cons analysis.
  + Comprehensive quality evaluation scoring 0-10 across 7
  + dimensions with weighted composite grades. Use when
  + evaluating code, designs, strategies, or approaches.

  Apply to SKILL.md? [y/N]
Algorithm:
1. Load eval YAML → 60/40 train/test split
2. Baseline trigger eval on train set
3. Loop max 7x: feed failures to claude -p → "improve this description" → re-eval
4. Select best iteration by TEST set score (prevents overfitting)
5. Show diff → confirm before writing to SKILL.md
Priority Targets:
SkillPrecisionRecallProblem
api-design0%0%Description completely wrong for trigger
create-pr88%100%1 false positive to eliminate
assess100%87%1 recall miss

Phase 3: CI Regression Gate

#1046 — Block PRs that regress eval scores
~2h +80 LOC #1046
# CI runs on PRs that change src/skills/
# No Claude API calls — just JSON diff

  EVAL REGRESSION CHECK
  ─────────────────────
  assess:   P 100→100  R  87→100  ✅ IMPROVED
  commit:   P 100→100  R 100→100  ✅ STABLE
  explore:  P 100→ 80  R 100→100  ❌ REGRESSED

  RESULT: FAILED (1 regression detected)
How it works:
1. Detect changed skills from git diff
2. Read committed baselines from tests/evals/results/
3. Compare with PR's result files (if updated)
4. Flag if precision OR recall dropped → exit 1
5. Zero API cost — pure JSON comparison
[A] tests/evals/scripts/check-eval-regression.sh +60
[M] .github/workflows/plugin-validation.yml +20

Phase 4: Cost Tracking (Optional)

#1050 — Token + cost visibility
~2h +30 LOC #1050
# Appended to eval results JSON:
"cost": {
  "input_tokens": 12450,
  "output_tokens": 3200,
  "estimated_cost_usd": 0.04
}

# Summary at end of eval:full:
  COST SUMMARY
  ┌────────────────────────────┐
  │ Skills evaluated:  20      │
  │ Total tokens: 248K / 64K   │
  │ Estimated cost: $0.82      │
  │ Avg per skill:  $0.04      │
  └────────────────────────────┘
Additive changes to existing runners. Parses --output-format json from claude -p for token counts. No behavior changes.

Risk Dashboard

PhaseRiskLevelMitigation
#1045 eval:skillScript doesn't chain correctlyLOWThin wrapper, fallback: run two commands manually
#1044 desc optimizerProduces worse descriptionsMEDTrain/test split prevents overfitting. Manual confirm before apply. Git revert available.
#1046 CI gateGate too strict, blocks valid PRsLOWOnly checks committed baselines. Override with --force or update baseline.
#1050 cost trackingToken parsing breaks on API changeLOWAdditive only — cost field missing = no error, just no data.

Pre-mortems

"What if the optimizer makes api-design WORSE?"
→ Train/test split catches it. Test set score must improve or change is rejected. Manual diff review + confirm step. Original in git.

"What if CI gate blocks every PR?"
→ Gate only fires for skills with committed baselines AND changed files. No baseline = no gate. Override available.

"What if eval:skill takes too long?"
→ --trigger-only flag runs in ~2min. --dry-run validates YAML without Claude calls. --quality-only skips trigger.

Before → After

NOW

Edit skill → hope it works
api-design trigger: 0%
No regression detection in CI
Unknown eval costs
Two manual commands per skill
M92: 54% (6/11 done)

AFTER

Edit skill → eval:skill → see score
api-design trigger: optimized
CI blocks score drops automatically
Cost per eval visible
One command: eval:skill
M92: 91% (10/11 done)
4
Issues Closed
3
New Scripts
~350
LOC Added
91%
M92 Target

Current Eval Scores — Worst Performers

Quality Eval Distribution (83 skills)

100% (12)
14%
80-99% (15)
18%
60-79% (28)
34%
40-59% (21)
25%
20-39% (4)
5%
0-19% (3)
4%
Average: 68.5% · Below 70%: 40 skills · PR #1135 fixes assertion quality

Trigger Eval Results (8 skills tested)

SkillPrecisionRecallStatus
commit100%100%PASS
explore100%100%PASS
review-pr100%100%PASS
doctor100%100%PASS
brainstorm100%100%PASS
create-pr88.8%100%1 false positive
assess100%87.5%1 recall miss
api-design0%0%BROKEN
Implement eval pipeline: Phase 1 (#1045) — create tests/evals/scripts/run-skill-eval.sh that chains trigger + quality runners per skill. Phase 2 (#1044) — create optimize-description.sh with train/test split iteration loop. Phase 3 (#1046) — create check-eval-regression.sh for CI JSON diff gate.