#1042→#1043→#1045→#1046TriggerQualityUnifiedCI Gate
Everything else is off the critical path.
Parallelizable: #1048, #1047, #1050
Execution Timeline
Day 1Day 2Day 3Day 4Day 5
#1041 Schema
done
#1042 Trigger
done
#1049 Docs
done
#1043 Quality
#1043
#1048 Seed 16
#1048
#1047 Hooks
#1047
#1045 Unified
#1045
#1050 Duration
#1050
#1046 CI Gate
#1046
#1044 Optimizer
#1044
Legend
■ Done (Sprint 1)■ Wave A (parallel)■ Wave B (integration)■ Wave C (automation)
All Issues — M92
Eval Coverage Matrix — 16 User-Invocable Skills
5 have evals (Sprint 1). 11 need new YAML files (#1048).
Cross-Skill Confusion Pairs (#1048)
assess<─?─>review-pr "assess this PR" vs "review this PR"
explore<─?─>fix-issue "find auth code" vs "fix auth bug"
commit<─?─>create-pr "save changes" vs "submit for review"
remember<─?─>memory "save this decision" vs "recall past decisions"
implement<─?─>brainstorm "build auth" vs "brainstorm auth approaches"
help<─?─>doctor "how do I use X" vs "is X working correctly"
verify<─?─>review-pr "check if this works" vs "review this PR"
Risk Assessment
Overall Risk
MEDIUM
Critical Path Length
4 issues
Parallelizable
5/7
Risk Register
HIGH: #1044 Desc Optimizer
Automated description rewriting is experimental. LLM-grading LLM-output has variance.
Mitigation: Train/test split (60/40). Max 7 iterations. Human confirm before applying.
MEDIUM: #1043 Quality Runner
A/B comparison needs reliable grading. LLM-as-judge has known bias toward longer outputs.
Mitigation: Flag non-discriminating assertions. Use structured grading prompts.
MEDIUM: #1046 CI Gate
Tier 2 smoke test needs API key in CI. Cost control needed.
Mitigation: Tier 1 (static) is free and blocks merge. Tier 2 stays behind feature flag.
LOW: #1048 Seed 16 Evals
Pure YAML authoring. Main risk is cross-skill confusion pairs being insufficient.
Mitigation: 7 confusion pairs identified. Review with dry-run validation.
Pre-Mortems
Q: "What if quality grading is unreliable?"
A: Start with deterministic checks (output length, format markers).
LLM grading is a bonus, not the gate. Flag "non-discriminating"
assertions that pass both with-skill and baseline.
Q: "What if description optimizer makes descriptions worse?"
A: 60/40 train/test split. If TEST score drops, roll back automatically.
Human must confirm before applying. Git diff shown for review.
Q: "What if CI eval adds too much latency?"
A: Tier 1 is static (5s). Tier 2 only runs on changed skills (~60s).
Tier 3 stays local-only. No full eval in CI.