Sprint 2 — Skill Eval Pipeline

Milestone M92 · 7 issues remaining · 3 waves
3/10
Completed
7
Remaining
~1400
Est. LOC
3
Waves
11
New Eval YAMLs
Sprint 1 — 30%

Wave Breakdown

Wave A — Foundation (parallel)

#1043
Quality evaluation runner (A/B comparison)
Wave A ~250 LOC Medium risk Critical path
#1048
Seed evals for all 16 user-invocable skills
Wave A ~550 LOC (YAML) Low risk
#1047
Hook interaction testing
Wave A ~50 LOC Low risk Absorbed into #1043

Wave B — Integration (after A)

#1045
Unified npm run eval:skill command
Wave B ~80 LOC Low risk
#1050
Duration tracking + cost reports
Wave B ~100 LOC Low risk

Wave C — Automation (after B)

#1046
CI gate for skill eval regression
Wave C ~200 LOC Medium risk
#1044
Description optimization loop
Wave C ~200 LOC High risk

Dependency Graph

Sprint 1 (MERGED) Sprint 2A (Wave A) Sprint 2B-C ══════════════════ ══════════════════ ══════════════════ #1041 Schema ─────┐ #1042 Trigger ────┼────▶ #1043 Quality ──┐ #1049 Docs ───────┘ │ │ │ ├────▶ #1045 Unified CLI │ │ ▼ │ #1048 Seed 16 ──┘ #1050 Duration (independent) ├────▶ #1046 CI Gate └────▶ #1044 Desc Optimizer #1047 Hook Testing ────▶ absorbed into #1043 (stderr hook check)

Critical Path

#1042 #1043 #1045 #1046 Trigger Quality Unified CI Gate Everything else is off the critical path. Parallelizable: #1048, #1047, #1050

Execution Timeline

Day 1Day 2Day 3Day 4Day 5
#1041 Schema
done
#1042 Trigger
done
#1049 Docs
done
#1043 Quality
#1043
#1048 Seed 16
#1048
#1047 Hooks
#1047
#1045 Unified
#1045
#1050 Duration
#1050
#1046 CI Gate
#1046
#1044 Optimizer
#1044

Legend

Done (Sprint 1) Wave A (parallel) Wave B (integration) Wave C (automation)

All Issues — M92

Eval Coverage Matrix — 16 User-Invocable Skills

5 have evals (Sprint 1). 11 need new YAML files (#1048).

Cross-Skill Confusion Pairs (#1048)

assess <─?─> review-pr "assess this PR" vs "review this PR" explore <─?─> fix-issue "find auth code" vs "fix auth bug" commit <─?─> create-pr "save changes" vs "submit for review" remember <─?─> memory "save this decision" vs "recall past decisions" implement <─?─> brainstorm "build auth" vs "brainstorm auth approaches" help <─?─> doctor "how do I use X" vs "is X working correctly" verify <─?─> review-pr "check if this works" vs "review this PR"

Risk Assessment

Overall Risk
MEDIUM
Critical Path Length
4 issues
Parallelizable
5/7

Risk Register

HIGH: #1044 Desc Optimizer
Automated description rewriting is experimental. LLM-grading LLM-output has variance.
Mitigation: Train/test split (60/40). Max 7 iterations. Human confirm before applying.
MEDIUM: #1043 Quality Runner
A/B comparison needs reliable grading. LLM-as-judge has known bias toward longer outputs.
Mitigation: Flag non-discriminating assertions. Use structured grading prompts.
MEDIUM: #1046 CI Gate
Tier 2 smoke test needs API key in CI. Cost control needed.
Mitigation: Tier 1 (static) is free and blocks merge. Tier 2 stays behind feature flag.
LOW: #1048 Seed 16 Evals
Pure YAML authoring. Main risk is cross-skill confusion pairs being insufficient.
Mitigation: 7 confusion pairs identified. Review with dry-run validation.

Pre-Mortems

Q: "What if quality grading is unreliable?" A: Start with deterministic checks (output length, format markers). LLM grading is a bonus, not the gate. Flag "non-discriminating" assertions that pass both with-skill and baseline. Q: "What if description optimizer makes descriptions worse?" A: 60/40 train/test split. If TEST score drops, roll back automatically. Human must confirm before applying. Git diff shown for review. Q: "What if CI eval adds too much latency?" A: Tier 1 is static (5s). Tier 2 only runs on changed skills (~60s). Tier 3 stays local-only. No full eval in CI.