๐Ÿง  Skill-Fitness Workflow

ork's FIRST shipped dynamic workflow ยท template-in-skill (bare-eval/workflows/) ยท closes the integration gap ยท 2026-06-05

1st
Workflow ork ships
5/5
Skills scored (dogfood)
5
Real issues found
48s
Wall-clock (isolated ctx)
This is the first concrete step of the "improve to perfection" flywheel: measurement + orchestration as one build. It's the workflow-backed complement to the static conformance grader โ€” and the first time ork ships a dynamic workflow (Thariq's template-in-skill pattern, the way /deep-research does), closing the G1/G3/G4 gaps.

The harness

Workflow({ scriptPath: "${CLAUDE_SKILL_DIR}/workflows/skill-fitness.mjs", args: [skills] }) phase Score: ๐Ÿ“ค fan-out โ€” one ISOLATED-CONTEXT agent per skill each reads its SKILL.md โ†’ scores freshness / clarity / structure 0-10 + top_issue phase Synthesize: ๐Ÿ“Š rank weakest-first, flag below-bar static-first: run conformance-check.mjs (0 tokens) to pre-filter, then this for judgment grep can't make

Live dogfood โ€” found what static analysis can't

SkillFCSTop issue (qualitative)
assess889desc "six dimensions" vs body "7 dimensions" drift + "Opus 4.8 do" grammar break
commit889duplicate ## Rules H2 + "9 rules across 7 categories" miscount
explore889model:sonnet frontmatter vs body's Opus-4.8/xhigh gating + dup "Phase 6"
verify898โš ๏ธ hardcoded /Users/... path (breaks on other installs) + version drift 4.3.0 vs 4.2.0
doctor999line 167 "non-4.7 model" should be "non-Opus-4.8" (freshness drift)

mean 25.4/30 ยท 266k tokens ยท none of these were catchable by the static grader (model-IDs + hook-counts only). The 5 findings are this workflow's first backlog โ€” a fast follow.

Static grader vs this workflow โ€” complementary, not redundant

conformance-check.mjs (static): 0 tokens, deterministic, CI-gateable โ€” catches mechanical drift (stale model IDs, hook counts). skill-fitness (workflow): ~50k tokens/skill, isolated-context judgment โ€” catches description/router clarity, internal-count mismatches, duplicate headings, install-portability, version drift. Run static first to pre-filter, workflow only where judgment is needed.

Why isolated-context (Thariq's failure modes)

Each skill graded in its OWN context window โ€” structurally immune to the self-preferential bias / agentic laziness / goal drift that hit my sequential single-context sweeps earlier this session. And the irony that proves it works: the harness's own first draft had the exact hardcoded-/Users/-path bug it flagged in verify โ€” caught and fixed before shipping.
Eval v2 #156 ยท the eval-as-workflow first spoke. Cost is real (~6M tokens for all 112 skills) โ€” ship as a scoped template, not a blind full run. PR playground gate artifact.