Eval Pipeline — Close the Feedback Loop
The Problem
Eval infrastructure is 85% built (88 skill evals, 31 agent evals, 7 scripts, CI pipeline). But the feedback loop is broken — no single command to eval one skill after editing, no automated description improvement, no CI regression detection. api-design has 0% trigger precision and nobody noticed.
#1045
eval:skill
eval:skill
→
#1044
desc optimizer
desc optimizer
→
#1046
CI gate
CI gate
→
#1050
cost tracking
cost tracking
Change Manifest
[A] tests/evals/scripts/run-skill-eval.sh +80
[A] tests/evals/scripts/optimize-description.sh +150
[A] tests/evals/scripts/check-eval-regression.sh +60
[M] package.json +3
[M] .github/workflows/plugin-validation.yml +20
[M] tests/evals/scripts/run-trigger-eval.sh +15
[M] tests/evals/scripts/run-quality-eval.sh +15
3 new files · 4 modified · ~343 LOC net
Dependency Chain
Time ──────────────────────────────────────────────────────────►
1h 3h 2h 2h
#1045 ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
eval: ║
skill ▼
#1044 ░░░░░░░░████████████████████████░░░░░░░░░░░░░░░░░
desc ║
optim ▼
#1046 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████████████░
CI gate ║
▼
#1050 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
cost (can do anytime, optional)
████ active ░░░░ waiting ║▼ dependency
Why This Order
| Phase | Depends On | Reason |
|---|---|---|
| #1045 eval:skill | Nothing | Foundation — single command to eval one skill |
| #1044 desc optimizer | #1045 | Uses eval:skill --trigger-only in its iteration loop |
| #1046 CI gate | #1044 | Needs optimized baselines committed to compare against |
| #1050 cost tracking | None | Additive — can be done anytime, nice-to-have |
Phase 1: eval:skill wrapper
#1045 — Unified per-skill evaluation
~1h +80 LOC #1045
$ npm run eval:skill -- commit EVAL: commit ┌───────────┬────────────┬──────────┐ │ Category │ Score │ Status │ ├───────────┼────────────┼──────────┤ │ Trigger │ P:100 R:100│ PASS │ │ Quality │ 87% │ PASS │ │ Overall │ │ PASS │ └───────────┴────────────┴──────────┘ $ npm run eval:skill -- commit --trigger-only # fast, 2min $ npm run eval:skill -- commit --quality-only # skip trigger $ npm run eval:skill -- --all # all 88 skills $ npm run eval:skill -- --all --dry-run # validate only
Implementation: Bash script that calls
run-trigger-eval.sh then run-quality-eval.sh for the specified skill, merges results into unified JSON, prints summary table.
[A] tests/evals/scripts/run-skill-eval.sh +80
[M] package.json +1
Phase 2: Description Optimizer
#1044 — Automated description improvement
~3h +150 LOC #1044
$ npm run eval:optimize-desc -- assess OPTIMIZE: assess Eval prompts: 14 (8 train / 6 test) Iter 1: P=100 R= 87 (baseline) Iter 2: P=100 R= 87 (no change) Iter 3: P=100 R=100 ← IMPROVED Iter 4: P= 88 R=100 ← precision dropped, reverted Best: Iter 3 (test set: P=100 R=100) - Assesses and rates quality 0-10 with pros/cons analysis. + Comprehensive quality evaluation scoring 0-10 across 7 + dimensions with weighted composite grades. Use when + evaluating code, designs, strategies, or approaches. Apply to SKILL.md? [y/N]
Algorithm:
1. Load eval YAML → 60/40 train/test split
2. Baseline trigger eval on train set
3. Loop max 7x: feed failures to
4. Select best iteration by TEST set score (prevents overfitting)
5. Show diff → confirm before writing to SKILL.md
1. Load eval YAML → 60/40 train/test split
2. Baseline trigger eval on train set
3. Loop max 7x: feed failures to
claude -p → "improve this description" → re-eval4. Select best iteration by TEST set score (prevents overfitting)
5. Show diff → confirm before writing to SKILL.md
Priority Targets:
| Skill | Precision | Recall | Problem |
|---|---|---|---|
| api-design | 0% | 0% | Description completely wrong for trigger |
| create-pr | 88% | 100% | 1 false positive to eliminate |
| assess | 100% | 87% | 1 recall miss |
Phase 3: CI Regression Gate
#1046 — Block PRs that regress eval scores
~2h +80 LOC #1046
# CI runs on PRs that change src/skills/ # No Claude API calls — just JSON diff EVAL REGRESSION CHECK ───────────────────── assess: P 100→100 R 87→100 ✅ IMPROVED commit: P 100→100 R 100→100 ✅ STABLE explore: P 100→ 80 R 100→100 ❌ REGRESSED RESULT: FAILED (1 regression detected)
How it works:
1. Detect changed skills from
2. Read committed baselines from
3. Compare with PR's result files (if updated)
4. Flag if precision OR recall dropped → exit 1
5. Zero API cost — pure JSON comparison
1. Detect changed skills from
git diff2. Read committed baselines from
tests/evals/results/3. Compare with PR's result files (if updated)
4. Flag if precision OR recall dropped → exit 1
5. Zero API cost — pure JSON comparison
[A] tests/evals/scripts/check-eval-regression.sh +60
[M] .github/workflows/plugin-validation.yml +20
Phase 4: Cost Tracking (Optional)
#1050 — Token + cost visibility
~2h +30 LOC #1050
# Appended to eval results JSON: "cost": { "input_tokens": 12450, "output_tokens": 3200, "estimated_cost_usd": 0.04 } # Summary at end of eval:full: COST SUMMARY ┌────────────────────────────┐ │ Skills evaluated: 20 │ │ Total tokens: 248K / 64K │ │ Estimated cost: $0.82 │ │ Avg per skill: $0.04 │ └────────────────────────────┘
Additive changes to existing runners. Parses
--output-format json from claude -p for token counts. No behavior changes.
Risk Dashboard
| Phase | Risk | Level | Mitigation |
|---|---|---|---|
| #1045 eval:skill | Script doesn't chain correctly | LOW | Thin wrapper, fallback: run two commands manually |
| #1044 desc optimizer | Produces worse descriptions | MED | Train/test split prevents overfitting. Manual confirm before apply. Git revert available. |
| #1046 CI gate | Gate too strict, blocks valid PRs | LOW | Only checks committed baselines. Override with --force or update baseline. |
| #1050 cost tracking | Token parsing breaks on API change | LOW | Additive only — cost field missing = no error, just no data. |
Pre-mortems
"What if the optimizer makes api-design WORSE?"
→ Train/test split catches it. Test set score must improve or change is rejected. Manual diff review + confirm step. Original in git.
"What if CI gate blocks every PR?"
→ Gate only fires for skills with committed baselines AND changed files. No baseline = no gate. Override available.
"What if eval:skill takes too long?"
→ --trigger-only flag runs in ~2min. --dry-run validates YAML without Claude calls. --quality-only skips trigger.
→ Train/test split catches it. Test set score must improve or change is rejected. Manual diff review + confirm step. Original in git.
"What if CI gate blocks every PR?"
→ Gate only fires for skills with committed baselines AND changed files. No baseline = no gate. Override available.
"What if eval:skill takes too long?"
→ --trigger-only flag runs in ~2min. --dry-run validates YAML without Claude calls. --quality-only skips trigger.
Before → After
NOW
Edit skill → hope it works
api-design trigger: 0%
No regression detection in CI
Unknown eval costs
Two manual commands per skill
M92: 54% (6/11 done)
api-design trigger: 0%
No regression detection in CI
Unknown eval costs
Two manual commands per skill
M92: 54% (6/11 done)
AFTER
Edit skill →
api-design trigger: optimized
CI blocks score drops automatically
Cost per eval visible
One command:
M92: 91% (10/11 done)
eval:skill → see scoreapi-design trigger: optimized
CI blocks score drops automatically
Cost per eval visible
One command:
eval:skillM92: 91% (10/11 done)
4
Issues Closed
3
New Scripts
~350
LOC Added
91%
M92 Target
Current Eval Scores — Worst Performers
Quality Eval Distribution (83 skills)
100% (12)
14%
80-99% (15)
18%
60-79% (28)
34%
40-59% (21)
25%
20-39% (4)
5%
0-19% (3)
4%
Average: 68.5% · Below 70%: 40 skills · PR #1135 fixes assertion quality
Trigger Eval Results (8 skills tested)
| Skill | Precision | Recall | Status |
|---|---|---|---|
| commit | 100% | 100% | PASS |
| explore | 100% | 100% | PASS |
| review-pr | 100% | 100% | PASS |
| doctor | 100% | 100% | PASS |
| brainstorm | 100% | 100% | PASS |
| create-pr | 88.8% | 100% | 1 false positive |
| assess | 100% | 87.5% | 1 recall miss |
| api-design | 0% | 0% | BROKEN |