OrchestKit · self-audit

Are our skills & sub-agents actually good — and can we prove it?

A data-grounded audit of OrchestKit's 113 skills + 38 agents against Matt Pocock's "Writing Great Skills" framework — the four axes of Predictability. Every number below is extracted from src/ at HEAD; every qualitative claim was adversarially verified against the real files by a parallel agent panel.

113 skills 38 agents 112 eval specs HEAD · main source: corpus.json (extracted) verified: 6-agent panel
verified 2026-07-01 · real data
6.7 / 10

Strong on craft, blind on proof.

Our skills follow the authoring rules well — progressive disclosure (93/113), front-loaded leading words, honest routers, a 3-tier invocation model with a CI keyword-collision test. The weak axis is evaluation: the weekly "real" eval rates skill descriptions 1–10 on 5 random skills — it never grades query→behavior. The 50-pair routing benchmark is gated by zero CI workflows, and 31 agent eval specs sit on disk never consumed. We assert quality; we don't measure it.

The framework we're grading against

Matt Pocock · writing-great-skills

The root virtue is Predictability — "the degree to which a skill makes the agent behave the same way on every run — the same process, not the same output." Everything hangs off four axes and five failure modes.

mindmap
  root((Predictability))
    Invocation
      Model-invoked · has description
      User-invoked · no description
      Tight descriptions · leading words
      One trigger per branch
    Information Hierarchy
      Steps first
      In-file reference
      Disclosed reference · pointers
      Progressive disclosure
    Steering
      Leading words · pretrained concepts
      Branches
      Completion criteria
      Guard premature completion
    Pruning
      Single source of truth
      Kill duplication
      Kill sediment · stale
      Delete no-ops
      

Four axes of predictability. Failure modes to hunt: premature completion · duplication · sediment · sprawl · no-ops.

Lessons from the source

verbatim SKILL.md + GLOSSARY.md

The eight most actionable techniques from the guide itself. Three (④⑤⑥) sharpen or correct fixes proposed elsewhere in this audit — the guide is more precise than a summary of it.

The guide is a worked example of itself: it's disable-model-invocation: true (user-invoked, zero context load), declares "this skill is all reference", and discloses every definition to GLOSSARY.md behind one pointer. The terms are its leading words.

How our surface actually fits together

real components

Two front-door routers sit over a shared library. The eval layer should close the loop back onto skills+agents — today it mostly doesn't (dashed = weak/structural-only).

flowchart TB
  U([plain-english goal]) --> HX["/hq-ext:auto
cross-plugin conductor"] U --> OA["/ork:auto
dev-verb intent router"] HX -->|dev-verbs, one-directional| OA OA -->|classify → confirm → hand off| SK subgraph LIB["shared library · src/"] SK["113 skills
32 user-invocable · 30 with triggers"] AG["38 agents
37/38 skills-wired"] HK["hooks
guardrails + telemetry"] SK -. spawns .-> AG AG -. wired to .-> SK end subgraph EVAL["eval layer"] ES["112 eval specs
query → expected_behavior"] RB["routing-benchmark.json
50 labeled pairs"] GT["golden tests
agent routing"] end SK -.->|covered by| ES OA -.->|gated by?| RB AG -.->|covered by| GT ES -. "CI: STRUCTURAL ONLY" .-> CI{{"PR gate
schema + scaffold"}} ES ==>|"weekly cron + auth"| LLM[["claude -p --bare
real grade"]] RB -. "manual only" .-> LLM classDef weak stroke-dasharray:5 5,stroke:#e0654f,color:#e0654f; classDef strong stroke:#3ecf8e; class ES,RB,GT,CI weak; class SK,AG,HK strong;

Green = strong & real. Red-dashed = the proof gap: specs exist but the real graded loop only fires weekly, and routing accuracy is never gated on a PR.

Scorecard — where we stand per axis

verified by agent panel

Scores are the panel's verified figures (post-adversarial-check), grounded in the real counts. Authoring axes score high; the eval axis drags the aggregate down.

Best practice vs. ours — with real evidence

per axis

Each row: what the framework prescribes, what we already do well (with a real number), and the concrete gap that survived verification.

Gaps shown are the ones the verification pass confirmed against a real file. Rejected/inflated claims were dropped.

Real-data charts

extracted from src/ at HEAD

SKILL.md length distribution

Cap is 500 lines (400 preferred). 1 over cap (implement=520); 8 over 400. No test enforces the cap.

8<100
39100–200
44200–300
14300–400
8>400

Description length (chars)

Tight = better invocation. Sweet spot 200–350 (71 skills). 7 bloated >500 (portless=661).

2<200
71200–350
33350–500
6500–650
1>650

Progressive disclosure

93/113

skills use a references/ dir · avg 4.9 files

WHEN clause in desc

111/113

"use when/for/to" — trigger conditions present

Anti-triggers (structured)

30 skills

anti-triggers: YAML field · but only ~5 in the description CC's router reads

Agents skills-wired

37/38

activation fix applied · 1 orphan

Agents "proactively"

0/38

dead language removed (audit Δ0)

Eval spec coverage

112/113

only portless lacks one

LLM behavior-graded in CI

0%

weekly eval rates descriptions 1–10 on 5 random skills — not behavior

Agent specs consumed

0/31

31 .eval.yaml on disk · golden runner reads none

Agent model tiering (38 agents)

Healthy spread — cheap default (inherit/haiku) with opus reserved for hard reasoning.

The sprawl cluster (SKILL.md > 400 lines)

All 8 are the big orchestrators — the split-into-references candidates.

Provenance sediment — markers per skill (Pruning axis)

(CC x.y) version tags + M1xx/#NNNN issue markers baked into steering prose. 344 total (204 CC-tags across 39 skills + 140 milestone/issue markers) — cognitive load that changes no behavior vs. the default.

The No-Ops test: "does this tag change behaviour vs the default?" If no → demote to a footnote or references/cc-enhancements.md. A test:skills lint could enforce this mechanically.

Router deep-dive — the two front doors

/ork:auto · /hq-ext:auto

Both are textbook Router Skills: classify → confirm → hand off, and never do the work themselves. hq-ext:auto conducts across both plugins and delegates every dev-verb one-directionally to ork:auto, the single source of truth for fix/build/review/test.

sequenceDiagram
  participant U as User
  participant H as /hq-ext:auto
  participant O as /ork:auto
  participant S as specialist skill
  U->>H: "get me ready & unblock the promote"
  Note over H: decompose → assign from catalog.json → compose waves
  H->>H: ops reads (∥) + gated mutate
  H->>O: dev-verb clause (one-directional)
  Note over O: classify (CoT) → confirm route
  O->>S: hand off + follow ITS phases
  S-->>U: result (specialist owns its report)
  Note over O,S: routing-benchmark.json · 50 pairs · target ≥95%
but only graded manually via /ork:bare-eval
Dimension/ork:auto/hq-ext:autoBest practice
Patternclassify→confirm→handoff cleandecompose→assign→compose→confirm→execute cleanRouter guides multiple user-invoked options
Description length533 ch long~600 ch longTight; front-load leading words
Accuracy gate50-pair benchmark, manual only not gatedinherits ork gate; no own benchmark ungatedClassification quality IS the job → must be gated
Honest gapsoptimize has no skill → says so honestdegrade-when-no-fanout documented honestSurface fallback rate; don't fake skills
Compositionnever routes to itself no recursionauto→ork:auto only, no cycle safeOne-directional; mutating ⇒ confirm

Verdict: the routers are our best-authored skills (~8.5/10). The single fix that matters: wire routing-benchmark.json into a PR-time gate so a taxonomy edit can't silently regress routing accuracy.

Eval maturity — current vs. target

the weak axis

Today's pipeline validates that specs are well-formed, not that skills behave. The target closes the loop affordably: sample-grade changed skills on every PR; full-grade weekly; and — the real blind spot — give agents their own activation + output-quality evals.

Current — proof gap

flowchart TB
  P[PR opened] --> V["validate eval schema
+ scaffold"] P --> K["keyword-match +
collision test
(deterministic · real)"] V --> G{{green}} K --> G W[weekly cron] -.->|needs auth| R["rate DESCRIPTIONS 1-10
on 5 RANDOM skills
NOT query→behavior"] R -.-> D[(one-shot json ·
no trend ledger)] B["routing-benchmark
50 pairs"] -.->|zero CI refs| X["manual bare-eval only"] A["31 agent specs"] -.-> N["never consumed"] SN["schema should_not[]"] -.-> NE["never evaluated"] classDef bad stroke:#e0654f,color:#e0654f,stroke-dasharray:4 4; classDef ok stroke:#3ecf8e; class R,X,N,NE bad; class K ok;

Target — closed loop

flowchart TB
  P[PR opened] --> C["changed-skills
detector"] C --> S["sample-grade
via bare-eval
(only changed)"] S --> M{{"accuracy ≥ threshold?"}} M -->|no| BLK[block merge] M -->|yes| OK[green] RB[routing-benchmark] --> S AG[agent-activation
eval set] --> S W[weekly cron] --> FULL["full 112-spec grade
+ trend to telemetry"] FULL --> TR[(activation precision/recall
misroute rate over time)] classDef good stroke:#3ecf8e; class S,M,FULL,AG good;

The unlock is changed-only sampling: grading every skill on every PR is too costly, but grading just the touched skill (plus the router benchmark) is cheap enough to gate.

Meta — this audit ran the method it recommends

Every finding above passed through an adversarial verify pass: a second agent re-checked each claim against the real file and dropped or corrected the ones that didn't hold. That's exactly the "grade, don't assert" loop §⑧ proposes — applied to the audit itself. What it caught in the first-pass analysis:

  • Rejected: "only 7/113 skills have anti-triggers → biggest gap." Reality: 30 carry a structured anti-triggers: field; the real gap is only the ~5 description-level ones CC's router actually reads.
  • Rejected: "delete 9 skills' dead trigger blocks as sediment." Those keywords feed a CI keyword-match + collision test and 6 have trigger_evals — deleting them would remove real coverage.
  • Corrected: line counts were consistently +1 (implement=520 not 521); dream's "23 h2 blocks" was actually 14; the weekly eval "grades 112 specs" was actually "rates 5 random descriptions."
  • Flagged honestly: the hq-ext:auto half of the router comparison could not be verified from this repo (its plugin cache isn't the source of truth here) — so its claims are held as lower-confidence.

12 agents · 954k tokens · 6 axes analyzed then each finding re-verified. The corrections are the point: an ungated audit would have shipped all four errors as fact.

What to improve — ranked

highest leverage first

Ranked by (impact on real quality signal) ÷ (build cost). The top three are all eval-layer; the authoring fixes are cheap polish.

Action plan — sequenced & costed

public-first · API-budget-aware

Five phases ordered by (signal ÷ cost) and by the public/private boundary: everything that can run in orchestkit's public CI at $0 API ships first; behavioral grading that needs token budget runs opt-in or on HQ's private runner (see the HQ tab). Each phase is a self-contained PR with a hard acceptance gate.

flowchart LR
  subgraph PUB["PUBLIC · orchestkit · $0 API"]
    P0["Phase 0
routing-benchmark
PR gate"] P4["Phase 4
authoring polish
sediment · DoD · trims"] end subgraph HYB["BEHAVIORAL · needs API budget"] P1["Phase 1
changed-scope
skill grade"] P2["Phase 2
agent
activation evals"] end subgraph PRIV["PRIVATE · HQ infra"] P3["Phase 3
Langfuse trend
ledger + release gate"] end P0 --> P1 --> P2 --> P3 P0 -.-> P4 P1 -.->|opt-in label OR HQ runner| P3 classDef pub stroke:#3ecf8e; classDef priv stroke:#e0a94f; class P0,P4 pub; class P3 priv;

Phase 0 + 4 need no API budget and no HQ dependency — do them this week. Phases 1–3 escalate into behavioral + trend territory.

The phases

files · gate · cost

Sequencing logic

  • Why Phase 0 first: the routing benchmark is the one real labeled dataset we already own, and easy pairs need no LLM — a deterministic exact-match check is $0 and catches the highest-severity gap (a misroute in the front door). Zero reasons to defer.
  • Why behavioral grading is gated, not default: grading every skill on every PR burns budget. Phase 1 grades only the touched skill; even that runs behind an opt-in label or on HQ's runner so public forks never incur cost or need secrets.
  • Why agents come before the trend ledger: agents are the bigger blind spot (31 specs unconsumed). Get a per-agent signal existing (Phase 2) before investing in trend infrastructure (Phase 3) to track it.
  • Why authoring polish is parallel, not blocking: Phase 4 (strip 344 no-ops, inline Definition-of-Done, trim 7 descriptions) is pure public cleanup — it can land anytime and doesn't depend on the eval work.

Using the Yonatan-HQ platform to actually improve ork

grounded in real infra

The constraint that shapes everything: orchestkit is Tier-D public OSS — its committed .mcp.json has no HQ servers, so it cannot hard-depend on private HQ infra without breaking public installs. HQ's own inventory (research-06-orchestkit.md) says it plainly: "ORK is what you'd install if you joined any company tomorrow; hq-ext is what you'd install if you joined Yonatan-HQ." So the model is not "ork calls HQ" — it's "HQ consumes ork as a downstream," exactly like the platform's practice_score.py worker already does for coaching metrics.

flowchart TB
  subgraph ORK["orchestkit — PUBLIC (Tier-D OSS) · stays self-contained"]
    SPEC["eval specs +
routing-benchmark +
eval-runner (Langfuse optional)"] DW["dual-write analytics sink
(already exists · v7.87.0)"] end subgraph HQ["Yonatan-HQ platform — PRIVATE · auth + API budget"] WK["eval worker
à la practice_score.py
(runs ork's specs)"] LF[("Langfuse
traces + scores")] KB[("hq-knowledge
pgvector RAG")] OBS["observability-ops
agent · monitor"] GATE{{"promote /
strict-mode gate"}} DIG["skill-eval digest
confidence-rendered"] end SPEC --> WK DW --> WK WK --> LF WK --> KB LF --> OBS KB -->|kb_gaps| OBS OBS --> DIG LF --> GATE GATE -.->|block regressed release| ORK classDef pub stroke:#3ecf8e; classDef priv stroke:#e0a94f; class SPEC,DW pub; class WK,LF,KB,OBS,GATE,DIG priv;

The bridge already exists: ork dual-writes analytics to the platform sink (CHANGELOG v7.87.0). Extend that channel to carry eval events — no new coupling, no public-install breakage.

The practice-score template — copy it for evals

already shipped in hq-ext

HQ already solved this exact shape for a different metric. /hq-ext:practice-score reads a weekly 5-dimension coaching score that a platform worker (cc_hooks_tasks/practice_score.py) computes on a Sunday cron, exposes via GET /api/practice-score/latest, and a read-only skill renders — keyed by a confidence field so a low-sample dimension shows ⚪ low instead of a fake number. That is precisely the honesty an eval digest needs.

flowchart LR
  CR["Sunday cron"] --> W["platform worker
computes scores"] W --> API["/api/…/latest
{value, sample_size,
delta, confidence}"] API --> S["read-only digest skill
renders w/ confidence"] S --> U([operator]) classDef p stroke:#7c6cf0; class W,API,S p;

Reuse verbatim: swap "practice score" for "skill/agent eval score." The worker runs ork:eval-runner against ork's specs; the digest becomes /ork:eval-status (or an hq-ext skill) that renders pass-rate + 7-day delta + confidence.

HQ assets → which ork gap each one closes

real, callable infra

Every asset below exists in the HQ stack today. The mapping is the point: our weakest axis (Eval maturity 4/10) is mostly an execution + tracking problem, and HQ already owns that substrate.

The honest boundary

Do

  • Keep ork's eval-runner + specs self-contained; Langfuse reporting env-gated and degrading (it already is).
  • Run the real graded evals on HQ's authenticated runner — that's where API budget + secrets live.
  • Extend the existing dual-write sink to carry eval events; let HQ's worker pull + score.
  • Reuse observability-ops + the practice-score worker pattern instead of building ork-side monitors.

Don't

  • Add mcp__hq-* servers to orchestkit's committed .mcp.json — it breaks public installs and leaks private infra.
  • Make any ork skill/agent require Langfuse or hq-knowledge to function.
  • Put API-spend evals in orchestkit's public GH Actions on push — cost + secret-exposure on forks.
  • Duplicate HQ's Langfuse/worker infra inside ork — consume it downstream, don't rebuild it.

Net: ork gets a real, trended, gated eval system; the public plugin stays clean; HQ reuses its own Langfuse + worker + knowledge stack. Lower-confidence note: HQ endpoints (/api/practice-score, Langfuse project wiring) were read from the hq-ext plugin cache, not the live platform repo — confirm exact routes before building.