Two CC 2.1.120 + 2.1.121 features, both shipping in one PR. #1542 + #1545.
CC 2.1.120CC 2.1.121surface CI YAML + 3 SKILL.md + 3 shellest ~1.5 hrrisk low-medium
What CC 2.1.120 added
A non-interactive subcommand that runs /ultrareview from CI/scripts. Prints findings to stdout, exits 0 on completion, 1 on findings. --json for parsing.
Try the CLI invocations
target form
terminal — claude ultrareview
How OrchestKit adopts it
Where
Change
.github/workflows/ultrareview.yml
NEW. Triggered on PR open/sync. Runs claude ultrareview $PR_BRANCH --json, posts severity-bucketed comment.
When user asks "ultra" review, defer to the CLI subcommand instead of re-implementing the multi-agent loop in skill instructions. Saves context, avoids drift.
tests/skills/test-ultrareview.sh
NEW smoke. Run claude ultrareview HEAD~1 --json against a known-bad commit and assert findings array non-empty.
What CC 2.1.121 added
The env var CLAUDE_CODE_FORK_SUBAGENT=1 previously only worked in interactive sessions. As of 2.1.121 it also works in non-interactive paths (claude -p, SDK).
The eval pipeline state-leak class
OrchestKit runs ~100 graders per CI invocation via claude -p --bare. Without forking, graders inherit harness state from the previous run.
CLAUDE_CODE_FORK_SUBAGENT
Run #1: 5 graders
avg: —
Run #2: same 5 graders, same fixtures
avg: — · drift: —
What leaks (today, env var unset)
State
Persists across
Consequence
memory MCP query cache
graders in same session
Grader sees stale memory hit from previous run; thinks fixture matches a past decision when it doesn't.
.claude/chain/*.json
filesystem (all graders)
Grader for "implement" thinks "explore" already ran because exploration.json is on disk. Skips upstream check.
ToolSearch deferred-tool cache
same session
First grader's deferred MCP load bleeds into next grader's tool registry.
model picker pref
same session
Grader N-1 set --model=opus; grader N inherits even though its prompt asked for haiku.
Eval pipeline (today, before Bundle B)
┌─────────────────────────────────────────────────────────────┐│.github/workflows/orchestkit-eval.yml││ triggers on PR open + weekly cron │└─┬───────────────────────────────────────────────────────────┘│▼┌──────────────────────────────────────────────────────────────┐│tests/evals/scripts/run-{agent,skill,quality}-eval.sh││ for each test case: ││ 1. spawn skill/agent ││ 2. capture output ││ 3. claude -p --bare "$GRADER_PROMPT"←─ leaks state││ 4. parse score, append to results.jsonl │└─┬────────────────────────────────────────────────────────────┘│▼┌──────────────────────────────────────────────────────────────┐│tests/evals/scripts/check-eval-regression.sh││ diff results.jsonl vs baseline.jsonl ││ fail CI if any score drifts > 0.5 │└─┬────────────────────────────────────────────────────────────┘│▼┌──────────────────────────────────────────────────────────────┐│ PR comment with score table ││flaky: ~5-10% of graders need retry││retry can produce different score than original│└──────────────────────────────────────────────────────────────┘
After Bundle B
┌──────────────────────────────────────────────────────────────┐│.github/workflows/orchestkit-eval.yml││ env: ││CLAUDE_CODE_FORK_SUBAGENT: 1←── new│└─┬────────────────────────────────────────────────────────────┘│▼┌──────────────────────────────────────────────────────────────┐│ run-{agent,skill,quality}-eval.sh ││ 3. CLAUDE_CODE_FORK_SUBAGENT=1 claude -p --bare ...││↑ each grader = fresh forked subagent context│└─┬────────────────────────────────────────────────────────────┘│▼┌──────────────────────────────────────────────────────────────┐│ Same input → same score. Deterministic.││Retry rate: 0% · drift rate: 0%││ Cost: +5-10% per grader (cold-start) — but no retries, ││ so net cost is flat or lower. │└──────────────────────────────────────────────────────────────┘
Files touched
File
Issue
Change
.github/workflows/ultrareview.yml
#1542
NEW workflow file
.github/workflows/orchestkit-eval.yml
#1545
Add env: CLAUDE_CODE_FORK_SUBAGENT: 1
tests/evals/scripts/run-agent-eval.sh
#1545
Set env var in grader shell-out
tests/evals/scripts/run-skill-eval.sh
#1545
Same
tests/evals/scripts/run-quality-eval.sh
#1545
Same
src/skills/bare-eval/SKILL.md
#1545
Document the env var as the fix for grader determinism
src/skills/create-pr/SKILL.md
#1542
Optional pre-flight ultrareview step
src/skills/review-pr/SKILL.md
#1542
Defer "ultra" mode to CLI subcommand
tests/skills/test-ultrareview.sh
#1542
NEW smoke test
tests/evals/scripts/test-grader-determinism.sh
#1545
NEW: same input twice → assert identical scores
Backwards compatibility
✓claude ultrareview: requires CC ≥ 2.1.120; CI workflow checks version, falls back to running /ork:review-pr on older CC.
✓FORK_SUBAGENT env var: requires CC ≥ 2.1.121 in non-interactive paths; older CC silently ignores the env var (no-op, no break).
✓All existing eval invocations: continue to work unchanged on older CC. Determinism improves on 2.1.121+.
Risk profile
Low-medium. CI workflow changes can break in CI but they're contained — no production code, existing eval scripts already work. The env var is additive. The new ultrareview workflow is new-only — can't break anything that currently runs.