๐ง Skill-Fitness Workflow
ork's FIRST shipped dynamic workflow ยท template-in-skill (bare-eval/workflows/) ยท closes the integration gap ยท 2026-06-05
5/5
Skills scored (dogfood)
48s
Wall-clock (isolated ctx)
This is the first concrete step of the "improve to perfection" flywheel: measurement + orchestration as one build.
It's the workflow-backed complement to the static conformance grader โ and the first time ork ships a dynamic
workflow (Thariq's template-in-skill pattern, the way /deep-research does), closing the G1/G3/G4 gaps.
The harness
Workflow({ scriptPath: "${CLAUDE_SKILL_DIR}/workflows/skill-fitness.mjs", args: [skills] })
phase Score: ๐ค fan-out โ one ISOLATED-CONTEXT agent per skill
each reads its SKILL.md โ scores freshness / clarity / structure 0-10 + top_issue
phase Synthesize: ๐ rank weakest-first, flag below-bar
static-first: run conformance-check.mjs (0 tokens) to pre-filter, then this for judgment grep can't make
Live dogfood โ found what static analysis can't
| Skill | F | C | S | Top issue (qualitative) |
| assess | 8 | 8 | 9 | desc "six dimensions" vs body "7 dimensions" drift + "Opus 4.8 do" grammar break |
| commit | 8 | 8 | 9 | duplicate ## Rules H2 + "9 rules across 7 categories" miscount |
| explore | 8 | 8 | 9 | model:sonnet frontmatter vs body's Opus-4.8/xhigh gating + dup "Phase 6" |
| verify | 8 | 9 | 8 | โ ๏ธ hardcoded /Users/... path (breaks on other installs) + version drift 4.3.0 vs 4.2.0 |
| doctor | 9 | 9 | 9 | line 167 "non-4.7 model" should be "non-Opus-4.8" (freshness drift) |
mean 25.4/30 ยท 266k tokens ยท none of these were catchable by the static grader (model-IDs + hook-counts only). The 5 findings are this workflow's first backlog โ a fast follow.
Static grader vs this workflow โ complementary, not redundant
conformance-check.mjs (static): 0 tokens, deterministic, CI-gateable โ catches mechanical drift (stale model IDs, hook counts).
skill-fitness (workflow): ~50k tokens/skill, isolated-context judgment โ catches description/router clarity, internal-count mismatches, duplicate headings, install-portability, version drift.
Run static first to pre-filter, workflow only where judgment is needed.
Why isolated-context (Thariq's failure modes)
Each skill graded in its OWN context window โ structurally immune to the self-preferential bias / agentic laziness / goal
drift that hit my sequential single-context sweeps earlier this session. And the irony that proves it works: the harness's
own first draft had the exact hardcoded-/Users/-path bug it flagged in verify โ caught and fixed before shipping.