M126 Bundle B — Bundle in depth

Two CC 2.1.120 + 2.1.121 features, both shipping in one PR. #1542 + #1545.
CC 2.1.120 CC 2.1.121 surface CI YAML + 3 SKILL.md + 3 shell est ~1.5 hr risk low-medium

What CC 2.1.120 added

A non-interactive subcommand that runs /ultrareview from CI/scripts. Prints findings to stdout, exits 0 on completion, 1 on findings. --json for parsing.

Try the CLI invocations

target form
terminal — claude ultrareview

How OrchestKit adopts it

WhereChange
.github/workflows/ultrareview.ymlNEW. Triggered on PR open/sync. Runs claude ultrareview $PR_BRANCH --json, posts severity-bucketed comment.
/ork:create-prOptionally invoke pre-PR-open. Surface findings in PR body's ## Pre-flight section.
/ork:review-prWhen user asks "ultra" review, defer to the CLI subcommand instead of re-implementing the multi-agent loop in skill instructions. Saves context, avoids drift.
tests/skills/test-ultrareview.shNEW smoke. Run claude ultrareview HEAD~1 --json against a known-bad commit and assert findings array non-empty.

What CC 2.1.121 added

The env var CLAUDE_CODE_FORK_SUBAGENT=1 previously only worked in interactive sessions. As of 2.1.121 it also works in non-interactive paths (claude -p, SDK).

The eval pipeline state-leak class

OrchestKit runs ~100 graders per CI invocation via claude -p --bare. Without forking, graders inherit harness state from the previous run.

CLAUDE_CODE_FORK_SUBAGENT

Run #1: 5 graders

avg:

Run #2: same 5 graders, same fixtures

avg: · drift:

What leaks (today, env var unset)

StatePersists acrossConsequence
memory MCP query cachegraders in same sessionGrader sees stale memory hit from previous run; thinks fixture matches a past decision when it doesn't.
.claude/chain/*.jsonfilesystem (all graders)Grader for "implement" thinks "explore" already ran because exploration.json is on disk. Skips upstream check.
ToolSearch deferred-tool cachesame sessionFirst grader's deferred MCP load bleeds into next grader's tool registry.
model picker prefsame sessionGrader N-1 set --model=opus; grader N inherits even though its prompt asked for haiku.

Eval pipeline (today, before Bundle B)

┌─────────────────────────────────────────────────────────────┐ .github/workflows/orchestkit-eval.yml triggers on PR open + weekly cron └─┬───────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ tests/evals/scripts/run-{agent,skill,quality}-eval.sh for each test case: 1. spawn skill/agent 2. capture output 3. claude -p --bare "$GRADER_PROMPT" ←─ leaks state 4. parse score, append to results.jsonl └─┬────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ tests/evals/scripts/check-eval-regression.sh diff results.jsonl vs baseline.jsonl fail CI if any score drifts > 0.5 └─┬────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ PR comment with score table flaky: ~5-10% of graders need retry retry can produce different score than original └──────────────────────────────────────────────────────────────┘

After Bundle B

┌──────────────────────────────────────────────────────────────┐ .github/workflows/orchestkit-eval.yml env: CLAUDE_CODE_FORK_SUBAGENT: 1 ←── new └─┬────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ run-{agent,skill,quality}-eval.sh 3. CLAUDE_CODE_FORK_SUBAGENT=1 claude -p --bare ... ↑ each grader = fresh forked subagent context └─┬────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ Same input → same score. Deterministic. Retry rate: 0% · drift rate: 0% Cost: +5-10% per grader (cold-start) — but no retries, so net cost is flat or lower. └──────────────────────────────────────────────────────────────┘

Files touched

FileIssueChange
.github/workflows/ultrareview.yml#1542NEW workflow file
.github/workflows/orchestkit-eval.yml#1545Add env: CLAUDE_CODE_FORK_SUBAGENT: 1
tests/evals/scripts/run-agent-eval.sh#1545Set env var in grader shell-out
tests/evals/scripts/run-skill-eval.sh#1545Same
tests/evals/scripts/run-quality-eval.sh#1545Same
src/skills/bare-eval/SKILL.md#1545Document the env var as the fix for grader determinism
src/skills/create-pr/SKILL.md#1542Optional pre-flight ultrareview step
src/skills/review-pr/SKILL.md#1542Defer "ultra" mode to CLI subcommand
tests/skills/test-ultrareview.sh#1542NEW smoke test
tests/evals/scripts/test-grader-determinism.sh#1545NEW: same input twice → assert identical scores

Backwards compatibility

Risk profile

Low-medium. CI workflow changes can break in CI but they're contained — no production code, existing eval scripts already work. The env var is additive. The new ultrareview workflow is new-only — can't break anything that currently runs.