OrchestKit · self-audit
A data-grounded audit of OrchestKit's 113 skills + 38 agents against Matt Pocock's
"Writing Great Skills"
framework — the four axes of Predictability. Every number below is extracted from src/ at HEAD;
every qualitative claim was adversarially verified against the real files by a parallel agent panel.
The root virtue is Predictability — "the degree to which a skill makes the agent behave the same way on every run — the same process, not the same output." Everything hangs off four axes and five failure modes.
mindmap
root((Predictability))
Invocation
Model-invoked · has description
User-invoked · no description
Tight descriptions · leading words
One trigger per branch
Information Hierarchy
Steps first
In-file reference
Disclosed reference · pointers
Progressive disclosure
Steering
Leading words · pretrained concepts
Branches
Completion criteria
Guard premature completion
Pruning
Single source of truth
Kill duplication
Kill sediment · stale
Delete no-ops
Four axes of predictability. Failure modes to hunt: premature completion · duplication · sediment · sprawl · no-ops.
The eight most actionable techniques from the guide itself. Three (④⑤⑥) sharpen or correct fixes proposed elsewhere in this audit — the guide is more precise than a summary of it.
The guide is a worked example of itself: it's disable-model-invocation: true (user-invoked, zero context load), declares "this skill is all reference", and discloses every definition to GLOSSARY.md behind one pointer. The terms are its leading words.
Two front-door routers sit over a shared library. The eval layer should close the loop back onto skills+agents — today it mostly doesn't (dashed = weak/structural-only).
flowchart TB U([plain-english goal]) --> HX["/hq-ext:auto
cross-plugin conductor"] U --> OA["/ork:auto
dev-verb intent router"] HX -->|dev-verbs, one-directional| OA OA -->|classify → confirm → hand off| SK subgraph LIB["shared library · src/"] SK["113 skills
32 user-invocable · 30 with triggers"] AG["38 agents
37/38 skills-wired"] HK["hooks
guardrails + telemetry"] SK -. spawns .-> AG AG -. wired to .-> SK end subgraph EVAL["eval layer"] ES["112 eval specs
query → expected_behavior"] RB["routing-benchmark.json
50 labeled pairs"] GT["golden tests
agent routing"] end SK -.->|covered by| ES OA -.->|gated by?| RB AG -.->|covered by| GT ES -. "CI: STRUCTURAL ONLY" .-> CI{{"PR gate
schema + scaffold"}} ES ==>|"weekly cron + auth"| LLM[["claude -p --bare
real grade"]] RB -. "manual only" .-> LLM classDef weak stroke-dasharray:5 5,stroke:#e0654f,color:#e0654f; classDef strong stroke:#3ecf8e; class ES,RB,GT,CI weak; class SK,AG,HK strong;
Green = strong & real. Red-dashed = the proof gap: specs exist but the real graded loop only fires weekly, and routing accuracy is never gated on a PR.
Scores are the panel's verified figures (post-adversarial-check), grounded in the real counts. Authoring axes score high; the eval axis drags the aggregate down.
Each row: what the framework prescribes, what we already do well (with a real number), and the concrete gap that survived verification.
Gaps shown are the ones the verification pass confirmed against a real file. Rejected/inflated claims were dropped.
SKILL.md length distribution
Cap is 500 lines (400 preferred). 1 over cap (implement=520); 8 over 400. No test enforces the cap.
Description length (chars)
Tight = better invocation. Sweet spot 200–350 (71 skills). 7 bloated >500 (portless=661).
Progressive disclosure
93/113
skills use a references/ dir · avg 4.9 files
WHEN clause in desc
111/113
"use when/for/to" — trigger conditions present
Anti-triggers (structured)
30 skills
anti-triggers: YAML field · but only ~5 in the description CC's router reads
Agents skills-wired
37/38
activation fix applied · 1 orphan
Agents "proactively"
0/38
dead language removed (audit Δ0)
Eval spec coverage
112/113
only portless lacks one
LLM behavior-graded in CI
0%
weekly eval rates descriptions 1–10 on 5 random skills — not behavior
Agent specs consumed
0/31
31 .eval.yaml on disk · golden runner reads none
Agent model tiering (38 agents)
Healthy spread — cheap default (inherit/haiku) with opus reserved for hard reasoning.
The sprawl cluster (SKILL.md > 400 lines)
All 8 are the big orchestrators — the split-into-references candidates.
Provenance sediment — markers per skill (Pruning axis)
(CC x.y) version tags + M1xx/#NNNN issue markers baked into steering prose. 344 total (204 CC-tags across 39 skills + 140 milestone/issue markers) — cognitive load that changes no behavior vs. the default.
The No-Ops test: "does this tag change behaviour vs the default?" If no → demote to a footnote or references/cc-enhancements.md. A test:skills lint could enforce this mechanically.
Both are textbook Router Skills: classify → confirm → hand off, and never do the work themselves. hq-ext:auto conducts across both plugins and delegates every dev-verb one-directionally to ork:auto, the single source of truth for fix/build/review/test.
sequenceDiagram participant U as User participant H as /hq-ext:auto participant O as /ork:auto participant S as specialist skill U->>H: "get me ready & unblock the promote" Note over H: decompose → assign from catalog.json → compose waves H->>H: ops reads (∥) + gated mutate H->>O: dev-verb clause (one-directional) Note over O: classify (CoT) → confirm route O->>S: hand off + follow ITS phases S-->>U: result (specialist owns its report) Note over O,S: routing-benchmark.json · 50 pairs · target ≥95%
but only graded manually via /ork:bare-eval
Verdict: the routers are our best-authored skills (~8.5/10). The single fix that matters: wire routing-benchmark.json into a PR-time gate so a taxonomy edit can't silently regress routing accuracy.
Today's pipeline validates that specs are well-formed, not that skills behave. The target closes the loop affordably: sample-grade changed skills on every PR; full-grade weekly; and — the real blind spot — give agents their own activation + output-quality evals.
Current — proof gap
flowchart TB P[PR opened] --> V["validate eval schema
+ scaffold"] P --> K["keyword-match +
collision test
(deterministic · real)"] V --> G{{green}} K --> G W[weekly cron] -.->|needs auth| R["rate DESCRIPTIONS 1-10
on 5 RANDOM skills
NOT query→behavior"] R -.-> D[(one-shot json ·
no trend ledger)] B["routing-benchmark
50 pairs"] -.->|zero CI refs| X["manual bare-eval only"] A["31 agent specs"] -.-> N["never consumed"] SN["schema should_not[]"] -.-> NE["never evaluated"] classDef bad stroke:#e0654f,color:#e0654f,stroke-dasharray:4 4; classDef ok stroke:#3ecf8e; class R,X,N,NE bad; class K ok;
Target — closed loop
flowchart TB P[PR opened] --> C["changed-skills
detector"] C --> S["sample-grade
via bare-eval
(only changed)"] S --> M{{"accuracy ≥ threshold?"}} M -->|no| BLK[block merge] M -->|yes| OK[green] RB[routing-benchmark] --> S AG[agent-activation
eval set] --> S W[weekly cron] --> FULL["full 112-spec grade
+ trend to telemetry"] FULL --> TR[(activation precision/recall
misroute rate over time)] classDef good stroke:#3ecf8e; class S,M,FULL,AG good;
The unlock is changed-only sampling: grading every skill on every PR is too costly, but grading just the touched skill (plus the router benchmark) is cheap enough to gate.
Meta — this audit ran the method it recommends
Every finding above passed through an adversarial verify pass: a second agent re-checked each claim against the real file and dropped or corrected the ones that didn't hold. That's exactly the "grade, don't assert" loop §⑧ proposes — applied to the audit itself. What it caught in the first-pass analysis:
anti-triggers: field; the real gap is only the ~5 description-level ones CC's router actually reads.trigger_evals — deleting them would remove real coverage.implement=520 not 521); dream's "23 h2 blocks" was actually 14; the weekly eval "grades 112 specs" was actually "rates 5 random descriptions."hq-ext:auto half of the router comparison could not be verified from this repo (its plugin cache isn't the source of truth here) — so its claims are held as lower-confidence.12 agents · 954k tokens · 6 axes analyzed then each finding re-verified. The corrections are the point: an ungated audit would have shipped all four errors as fact.
Ranked by (impact on real quality signal) ÷ (build cost). The top three are all eval-layer; the authoring fixes are cheap polish.
Five phases ordered by (signal ÷ cost) and by the public/private boundary: everything that can run in orchestkit's public CI at $0 API ships first; behavioral grading that needs token budget runs opt-in or on HQ's private runner (see the HQ tab). Each phase is a self-contained PR with a hard acceptance gate.
flowchart LR
subgraph PUB["PUBLIC · orchestkit · $0 API"]
P0["Phase 0
routing-benchmark
PR gate"]
P4["Phase 4
authoring polish
sediment · DoD · trims"]
end
subgraph HYB["BEHAVIORAL · needs API budget"]
P1["Phase 1
changed-scope
skill grade"]
P2["Phase 2
agent
activation evals"]
end
subgraph PRIV["PRIVATE · HQ infra"]
P3["Phase 3
Langfuse trend
ledger + release gate"]
end
P0 --> P1 --> P2 --> P3
P0 -.-> P4
P1 -.->|opt-in label OR HQ runner| P3
classDef pub stroke:#3ecf8e;
classDef priv stroke:#e0a94f;
class P0,P4 pub;
class P3 priv;
Phase 0 + 4 need no API budget and no HQ dependency — do them this week. Phases 1–3 escalate into behavioral + trend territory.
The constraint that shapes everything: orchestkit is Tier-D public OSS — its committed .mcp.json has no HQ servers, so it cannot hard-depend on private HQ infra without breaking public installs. HQ's own inventory (research-06-orchestkit.md) says it plainly: "ORK is what you'd install if you joined any company tomorrow; hq-ext is what you'd install if you joined Yonatan-HQ." So the model is not "ork calls HQ" — it's "HQ consumes ork as a downstream," exactly like the platform's practice_score.py worker already does for coaching metrics.
flowchart TB
subgraph ORK["orchestkit — PUBLIC (Tier-D OSS) · stays self-contained"]
SPEC["eval specs +
routing-benchmark +
eval-runner (Langfuse optional)"]
DW["dual-write analytics sink
(already exists · v7.87.0)"]
end
subgraph HQ["Yonatan-HQ platform — PRIVATE · auth + API budget"]
WK["eval worker
à la practice_score.py
(runs ork's specs)"]
LF[("Langfuse
traces + scores")]
KB[("hq-knowledge
pgvector RAG")]
OBS["observability-ops
agent · monitor"]
GATE{{"promote /
strict-mode gate"}}
DIG["skill-eval digest
confidence-rendered"]
end
SPEC --> WK
DW --> WK
WK --> LF
WK --> KB
LF --> OBS
KB -->|kb_gaps| OBS
OBS --> DIG
LF --> GATE
GATE -.->|block regressed release| ORK
classDef pub stroke:#3ecf8e;
classDef priv stroke:#e0a94f;
class SPEC,DW pub;
class WK,LF,KB,OBS,GATE,DIG priv;
The bridge already exists: ork dual-writes analytics to the platform sink (CHANGELOG v7.87.0). Extend that channel to carry eval events — no new coupling, no public-install breakage.
practice-score template — copy it for evalsHQ already solved this exact shape for a different metric. /hq-ext:practice-score reads a weekly 5-dimension coaching score that a platform worker (cc_hooks_tasks/practice_score.py) computes on a Sunday cron, exposes via GET /api/practice-score/latest, and a read-only skill renders — keyed by a confidence field so a low-sample dimension shows ⚪ low instead of a fake number. That is precisely the honesty an eval digest needs.
flowchart LR CR["Sunday cron"] --> W["platform worker
computes scores"] W --> API["/api/…/latest
{value, sample_size,
delta, confidence}"] API --> S["read-only digest skill
renders w/ confidence"] S --> U([operator]) classDef p stroke:#7c6cf0; class W,API,S p;
Reuse verbatim: swap "practice score" for "skill/agent eval score." The worker runs ork:eval-runner against ork's specs; the digest becomes /ork:eval-status (or an hq-ext skill) that renders pass-rate + 7-day delta + confidence.
Every asset below exists in the HQ stack today. The mapping is the point: our weakest axis (Eval maturity 4/10) is mostly an execution + tracking problem, and HQ already owns that substrate.
Do
env-gated and degrading (it already is).observability-ops + the practice-score worker pattern instead of building ork-side monitors.Don't
mcp__hq-* servers to orchestkit's committed .mcp.json — it breaks public installs and leaks private infra.Net: ork gets a real, trended, gated eval system; the public plugin stays clean; HQ reuses its own Langfuse + worker + knowledge stack. Lower-confidence note: HQ endpoints (/api/practice-score, Langfuse project wiring) were read from the hq-ext plugin cache, not the live platform repo — confirm exact routes before building.