---
name: agentv-bench
description: >-
  Run AgentV evaluations and optimize agents through eval-driven iteration.
  Triggers: run evals, benchmark agents, optimize prompts/skills against evals, compare
  agent outputs across providers, analyze eval results, offline evaluation of recorded sessions,
  run autoresearch, optimize unattended, run overnight optimization loop.
  Not for: writing/editing eval YAML without running (use agentv-eval-writer),
  analyzing existing traces/JSONL without re-running (use agentv-trace-analyst).
---

# AgentV Bench


A skill for evaluating agents and iteratively improving them through data-driven optimization.

At a high level, the process goes like this:

- Understand what the agent does and what "good" looks like
- Write evaluation test cases (EVAL.yaml or evals.json)
- Run the agent on those test cases, grade the outputs
- Analyze the results — what's working, what's failing, and why
- Improve the agent's prompts/skills/config based on the analysis
- Repeat until you're satisfied

Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress. Maybe they want to start from scratch — help them write evals, run them, and iterate. Maybe they already have results — jump straight to analysis and improvement.

Be flexible. If the user says "I don't need a full benchmark, just help me debug this failure", do that instead.

After the agent is working well, you can also run description optimization to improve skill triggering accuracy (see `references/description-optimization.md`).

## Communicating with the user

This skill is used by people across a wide range of familiarity with evaluation tooling. Pay attention to context cues:

- "evaluation" and "benchmark" are borderline but OK in most cases
- For "YAML", "grader", "assertion", "deterministic judge" — see serious cues from the user that they know what those mean before using them without explanation
- Briefly explain terms if in doubt

When presenting results, default to summary tables. Offer detail on request. In CI/headless mode, skip interactive prompts and exit with status codes.

---

## Step 1: Understand the Agent

Before running or optimizing, understand what you're working with.

1. **Read the agent's artifacts** — prompts, skills, configs, recent changes. Understand the full picture: what tools are available, what the expected input/output looks like, what constraints exist.

2. **Identify success criteria** — what does "good" look like for this agent? What are the edge cases? What would a failure look like? Talk to the user if this isn't clear from the artifacts alone.

3. **Understand the target harness** — which provider runs the agent (Claude, GPT, Copilot CLI, Gemini, custom CLI)? This affects what grader types are available and how to run tests. Targets are configured in `.agentv/targets.yaml` (canonical location, searched from the eval file directory upward). Sensitive values like `api_key` must use `${{ ENV_VAR }}` syntax — literal secrets are rejected as a security guardrail.

4. **Challenge assumptions** — if evals already exist, review their quality before running:
   - Are the test cases testing the right things?
   - Are assertions specific enough to catch real failures?
   - Are there ambiguous or contradictory test cases?
   - Flag eval issues before proceeding — running bad evals wastes time.

5. **Check integrity** — ensure task prompts (what the agent receives) are not also used as grader prompts (how outputs are scored). If a prompt file appears in both locations, note the overlap and optimize only for the task purpose.

---

## Step 2: Write Evaluations

AgentV supports two evaluation formats:

**EVAL.yaml** (native, full features) — supports workspaces, code graders, multi-turn conversations, tool trajectory scoring, workspace file tracking, multi-provider targets. Use this for agent evaluation.

```yaml
# example.eval.yaml
tests:
  - id: basic-code-review
    input: "Review this TypeScript file for bugs and suggest improvements"
    criteria: "Identifies the null pointer bug on line 12 and suggests a fix"
    assertions:
      - type: contains
        value: "null"
      - Review identifies the null pointer bug and suggests a concrete fix

workspace:
  template: ./workspace-template
  hooks:
    before_each:
      reset: fast
```

Multi-skill evaluation is handled naturally via input messages — describe the task in the test input, and the agent uses whatever skills it needs.

**evals.json** (skill-creator compatible) — auto-promoted to EVAL-equivalent format:
- `prompt` → input messages
- `expected_output` → reference answer
- `assertions` → graders
- `files[]` paths resolved relative to the evals.json location

```json
{
  "skill_name": "my-agent",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "assertions": ["Output includes error handling", "Uses async/await"]
    }
  ]
}
```

### Writing good test cases

Start with 2-3 realistic test cases — the kind of thing a real user would actually say. Share them with the user before running: "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?"

Good assertions are objectively verifiable and have descriptive names. Subjective quality ("the output is good") is better evaluated qualitatively — don't force assertions onto things that need human judgment.

**Grader types** (cheapest to most expensive): `exact`, `contains`, `regex`, `is-json`, `field-accuracy`, `composite`, `code-grader`, `tool-trajectory`, `llm-grader`. See `references/eval-yaml-spec.md` for full config and grading recipes for each type.

Prefer deterministic graders over LLM graders whenever possible. If an assertion can be checked with `contains` or `regex`, don't use `llm-grader`.

---

## Step 3: Run and Grade

This section is one continuous sequence — don't stop partway through.

Each run produces a new `.agentv/results/runs/<timestamp>/` directory automatically. Use timestamps to identify iterations when comparing runs.

### Choosing a run mode

**User instruction takes priority.** If the user says "run in subagent mode", "use subagent mode", or "use CLI mode", use that mode directly.

If the user has not specified a mode, default to `subagent`.

| `AGENT_EVAL_MODE` | Mode | How |
|----------------------|------|-----|
| `subagent` (default) | **Subagent mode** | Subagent-driven eval — parses eval.yaml, spawns executor + grader subagents. Zero CLI dependency. |
| `cli` | **AgentV CLI** | `agentv eval <path>` — end-to-end, multi-provider |

Set `AGENT_EVAL_MODE` in `.env` at the project root as the default when no mode is specified. If absent, default to `subagent`. **User instruction always overrides this.**

**`subagent`** — Parses eval.yaml directly, spawns executor subagents to run each test case in the current workspace, then spawns grader subagents to evaluate all assertion types natively. No CLI or external API calls required. Read `references/subagent-pipeline.md` for the detailed procedure.

**`cli`** — AgentV CLI handles execution, grading, and artifact generation end-to-end. Works with all providers. Use when you need multi-provider benchmarking or CLI-specific features.

### Running evaluations

**AgentV CLI mode** (end-to-end, EVAL.yaml):
```bash
agentv eval <eval-path> --output .agentv/artifacts/
```

**Subagent mode** — read `references/subagent-pipeline.md` for the detailed procedure. In brief: use `pipeline input` to extract inputs, dispatch one `executor` subagent per test case (all in parallel), then proceed to grading below.

**Spawn all runs in the same turn.** For each test case that needs both a "with change" and a "baseline" run, launch them simultaneously. Don't run one set first and come back for the other — launch everything at once so results arrive around the same time.

**Multi-target benchmarking:**
```bash
agentv eval <eval-path> --target claude --target gpt --target copilot
```

**Baseline strategy:**
- **New agent**: baseline is "no prompt" or minimal prompt — same eval, no agent-specific configuration
- **Improving existing**: snapshot the current version before editing (`cp -r <prompt-dir> <workspace>/prompt-snapshot/`), use as baseline throughout
- **Multi-target**: each target is its own baseline — no need for a separate "without" run

### While runs are in progress, draft graders

Don't just wait for runs to finish — use this time productively. If assertions don't exist yet, draft them now. If they exist, review them and explain what they check to the user.

Good assertions are *discriminating* — they pass when the agent genuinely succeeds and fail when it doesn't. An assertion that passes for both good and bad outputs is worse than no assertion.

### As runs complete, capture timing data

When each subagent task completes, you receive a notification containing `total_tokens` and `duration_ms`. **Save this data immediately** to `timing.json` in the run directory. See `references/schemas.md` for the timing.json schema.

This is the only opportunity to capture this data — it comes through the task notification and isn't persisted elsewhere. Process each notification as it arrives.

### Grading

**In CLI mode**, `agentv eval` handles all grading end-to-end — no manual phases needed.

**In subagent mode**, grading has three phases. **All three are required — do not stop after phase 1.**

**Phase 1: Code graders** (deterministic, zero-cost)

```bash
agentv pipeline grade <run-dir>
```

This evaluates all deterministic assertions against `response.md` files. Two types are handled:
- **`code-grader` scripts** — external scripts executed against the response (arbitrary logic, any language)
- **Built-in assertion types** — evaluated in-process: `contains`, `contains-any`, `contains-all`, `icontains`, `regex`, `equals`, `starts-with`, `ends-with`, `is-json`, and variants

Both types are configured by `pipeline input` into `code_graders/<name>.json` and graded by `pipeline grade`. Results are written to `<test-id>/code_grader_results/<name>.json`. Alternatively, pass `--grader-type code` to `pipeline run` to run these inline.

**Do not dispatch LLM grader subagents for tests that only have `contains`, `regex`, or other built-in assertions** — `pipeline grade` handles them entirely, at zero cost. To detect which tests need Phase 2, check whether `<test-id>/llm_graders/` contains any `.json` config files — `pipeline input` only writes there for `llm-grader` assertions. Tests with an empty (or missing) `llm_graders/` directory are done after Phase 1.

**Phase 2: LLM grading** (semantic — do NOT skip this phase)

Dispatch one `grader` subagent per (test × LLM grader) pair, **all in parallel**. Do not write a script to call an LLM API instead — the grader subagents use their own reasoning, which IS the LLM grading.
Example: 5 tests × 2 LLM graders = 10 grader subagents launched simultaneously.

**Do NOT dispatch a single grader for multiple tests.** Each subagent grades exactly one (test, grader) pair.

**Before dispatching graders, read `agents/grader.md` and embed its full content as the system instructions in every grader subagent prompt.** The grader is a `general-purpose` task agent — there is no auto-resolved "grader" type. Without `agents/grader.md` embedded verbatim, the subagent has no grading process, no output format, and no file-path knowledge, and will produce empty or incorrect output.

Each grader subagent (operating under `agents/grader.md` instructions):
1. Reads `<test-id>/llm_graders/<name>.json` for the grading prompt
2. Reads `<test-id>/response.md` for the candidate output
3. Grades the response against the prompt criteria
4. **Writes its result to disk**: `<run-dir>/<evalset>/<test-id>/llm_grader_results/<name>.json`
5. Returns score (0.0–1.0) and per-assertion evidence to the orchestrator

**Writing to disk is critical.** Assertion arrays are lost if accumulated only in the orchestrator's context across multiple batches (context summarization drops detail). Writing per-test results to `llm_grader_results/<name>.json` makes grading resumable and assertion evidence durable.

The result file format is:
```json
{ "score": 0.85, "assertions": [{"text": "...", "passed": true, "evidence": "..."}] }
```

After **all** grader subagents complete, run Phase 3 directly.

**Phase 3: Merge and validate**

```bash
agentv pipeline bench <run-dir>
agentv results validate <run-dir>
```

`pipeline bench` reads LLM grader results from `llm_grader_results/<name>.json` per test automatically, merges with code-grader scores, computes weighted pass_rate, and writes `grading.json` + `index.jsonl` + `benchmark.json`.

> **Diagnosing `pass_rate=0`:** If `pipeline bench` reports `pass_rate=0` across the board, do **not** assume the tests genuinely failed. First verify the grading pipeline ran correctly: check that `<test-id>/llm_grader_results/<name>.json` exists and is non-empty for each test. If these files are absent or empty, the grader subagents failed to produce output (most common cause: `agents/grader.md` was not embedded in the subagent prompts — see Phase 2). Treat `pass_rate=0` as a real signal only after confirming grader results exist.

### Artifacts

All artifacts use established schemas — see `references/schemas.md` for the full definitions. Do not modify the structure. Key artifacts per run:
- **grading.json**: per-test assertions with `{text, passed, evidence}`, plus summary
- **timing.json**: `{total_tokens, duration_ms, total_duration_seconds}`
- **benchmark.json**: per-target aggregate `{pass_rate, time_seconds, tokens}`

Write artifacts to `.agentv/artifacts/` or the iteration directory.

### Workspace features (EVAL.yaml only)

- **Workspace isolation** — clone repos, run setup/teardown hooks (before_all, before_each, after_each, after_all)
- **Materialization modes** — `pooled` (reuse slots), `temp` (fresh per run), `static` (existing dir)
- **Multi-repo** — clone multiple repos with sparse checkout and shallow clone support
- **File change tracking** — grade by diffing workspace files before/after agent execution

---

## Step 4: Analyze Results

Once all runs are graded, analyze the results before attempting improvements.

### Pattern analysis

Read the JSONL results and look for:

- **Always-pass tests** — assertion too loose or non-discriminating. If it passes for both good and bad outputs, it's not testing anything.
- **Always-fail tests** — task impossible, eval broken, or assertion misconfigured. Don't optimize against broken evals.
- **Flaky tests** — non-deterministic results across runs. Investigate before treating failures as real.
- **Systematic failures** — same failure pattern across multiple tests. This usually points to a missing instruction or wrong approach.
- **Deterministic upgrade candidates** — `llm-grader` assertions that could be replaced with `contains`, `regex`, or `is-json` (cheaper, faster, more reliable).

### Dispatch subagents

- **Dispatch `analyzer`** (read `agents/analyzer.md`) for a structured quality audit: deterministic upgrade suggestions, weak assertion detection, cost/quality flags, and benchmark pattern analysis.

- **Dispatch `comparator`** (read `agents/comparator.md`) for blind N-way comparison between iterations or targets. The comparator blinds provider identities, generates task-specific rubrics, scores each output, then unblinds and attributes improvements.

### Trace analysis

Use CLI tools for deeper investigation:
```bash
agentv inspect <results-file>          # Detailed execution trace inspection
agentv compare <file-a> <file-b>     # Structured diff between runs
```

Look for: tool call patterns, error recovery behavior, conversation flow, wasted steps.

### Present results to the user

Show a summary table:

```
| Test ID          | Score | Pass/Fail | Delta | Notes                    |
|------------------|-------|-----------|-------|--------------------------|
| basic-code-review| 0.85  | ✓ PASS    | +0.15 | Found the bug this time  |
| edge-case-empty  | 0.00  | ✗ FAIL    | —     | Crashed on empty input   |
```

Highlight:
- Current pass rate and delta from baseline
- Comparison results (which target/iteration won and why)
- Analyst observations the aggregate stats would hide

Ask: "How does this look? Anything you'd change about the evals or the approach?"

---

## Step 5: Improve

This is the heart of the loop. You've run the test cases, analyzed the results, and now you need to make the agent better.

### How to think about improvements

1. **Generalize from the analysis.** You're iterating on a small eval set, but the agent will be used on many different inputs. Don't overfit to specific test cases. Rather than fiddly patches or oppressively rigid MUSTs, try different approaches and see what works. It's cheap to experiment.

2. **Keep the prompt lean.** Read the execution transcripts, not just the final outputs. If the agent wastes time on unproductive steps, remove the instructions causing that. If it always ignores a section, that section isn't pulling its weight.

3. **Explain the why.** Today's LLMs are smart. They have good theory of mind and can go beyond rote instructions when given good reasoning. If you find yourself writing ALWAYS or NEVER in all caps, that's a yellow flag — reframe as an explanation of why the thing matters. That's more humane, powerful, and effective.

4. **Look for repeated work.** Read the transcripts from test runs and notice if the agent independently takes the same multi-step approach to something across cases. If all test runs result in writing the same helper script, bundle it. If every run makes the same mistake, the instruction is missing or unclear.

### Applying changes

- **Surgical edits**: ADD (new rule for a missing constraint), UPDATE (refine for clarity), DELETE (remove redundant or harmful rules), NEGATIVE CONSTRAINT (explicitly state what NOT to do)
- **One change per iteration** to isolate effects. If you change three things and the score improves, you don't know which change helped.
- **Variant tracking**: When a change helps some tests but hurts others, maintain 2-3 prompt variants. Compare variants to find the best overall approach before converging.
- **When converging**: Generalize specific patches into broad principles. Remove redundancy and contradictions. Ensure the prompt is clear, focused, and under 200 lines.

### Evaluation integrity

**Critical**: Only optimize **task prompts** (what the agent receives), never **judge prompts** (how graders score outputs). Modifying judge prompts games the evaluation without improving the agent.

If a prompt file is referenced in both task input and grader configs, optimize for the task purpose only. Document which prompts were modified in the optimization log.

### The iteration loop

After improving:

1. Apply your changes to the agent's prompts/skills/config
2. Re-run all test cases (agentv creates a new `.agentv/results/runs/<timestamp>/` directory automatically)
3. Compare against the previous iteration (Step 4). If running in automated mode, use the **automated keep/discard** logic below instead of manual judgment — it will decide whether to keep or revert the change for you.
4. Present results to the user (or log the decision if running automated keep/discard)
5. Stop when ANY of:
   - The user says they're happy
   - Feedback is all empty (everything looks good)
   - You're not making meaningful progress (no improvement for 2 consecutive iterations)
   - Target pass rate is reached
   - Maximum iterations exhausted

**Human checkpoints**: At iterations 3, 6, and 9, always present progress to the user regardless of automation settings. Push back if optimization is accumulating contradictory rules or overfitting to specific test cases.

### Automated keep/discard

For autonomous iteration, use `agentv compare --json` to automatically decide whether to keep or discard each change based on wins/losses/ties. Read `references/autoresearch.md` for the full decision rules, logging format, and integration with the iteration loop.

---

## Entering Mid-Lifecycle

Users can start at any step by providing existing data:

| Entry point | Required input | Example prompt |
|------------|---------------|----------------|
| Step 1 (Understand) | `eval-path` | "Optimize my agent against evals/support.yaml" |
| Step 2 (Write Evals) | Agent artifacts | "Write evals for this agent" |
| Step 3 (Run + Grade) | `eval-path` | "Run this eval and show me results" |
| Step 4 (Analyze) | `results-path` | "Analyze why my agent is failing on these results" |
| Step 5 (Improve) | Analysis + strategy | "Apply these optimization suggestions" |

When entering mid-lifecycle, run only the requested step and subsequent steps. Don't re-run earlier steps unless the user requests a full loop.

---

## Advanced: Blind Comparison

For situations where you want a rigorous comparison between two versions (e.g., "is the new version actually better?"), dispatch the `comparator` subagent. It blinds identities, generates task-specific rubrics, scores outputs, then unblinds and explains why the winner won.

This is optional and requires subagents. The human review loop is usually sufficient.

---

## Description Optimization

After the agent is working well, offer to optimize the skill's `description` field for better triggering accuracy. Read `references/description-optimization.md` for the full procedure (generate trigger EVAL.yaml, review with user, iterate, apply).

---

## Autoresearch Mode

Autoresearch is an unattended eval-improve loop that runs multiple optimize cycles without human intervention. The user triggers it with natural language (e.g., "run autoresearch on this skill", "optimize this skill unattended"). It uses the mutator subagent (`agents/mutator.md`) to rewrite artifacts based on failure analysis, and automated keep/discard to decide whether to keep or revert each change.

Read `references/autoresearch.md` for the full procedure (prerequisites, artifact layout, keep/discard rules, the step-by-step loop, convergence criteria, and context hygiene).

---

## Environment Adaptation

For provider-specific notes (Copilot, Codex, Claude SDK, custom CLI), CI/headless mode behavior, and fallback strategies when subagents aren't available, read `references/environment-adaptation.md`.

---

## Subagent Reference

The `agents/` directory contains instructions for specialized subagents. Read them when you need to spawn the relevant subagent.

| Agent | File | Purpose | When to dispatch |
|-------|------|---------|-----------------|
| executor | `agents/executor.md` | Perform test case tasks as the target agent | Step 3 (agent targets — one per test case) |
| grader | `agents/grader.md` | Grade responses with per-assertion evidence | Step 3 (grading — one per test × LLM grader pair) |
| comparator | `agents/comparator.md` | Blind N-way comparison + post-hoc analysis | Step 4 (comparing iterations/targets) |
| analyzer | `agents/analyzer.md` | Quality audit, deterministic upgrades, benchmarks | Step 4 (pattern analysis) |
| mutator | `agents/mutator.md` | Rewrite artifact from failure analysis | Step 5 (autoresearch — dispatched per cycle) |

The `references/` directory has additional documentation:
- `references/autoresearch.md` — Autoresearch unattended optimization loop and automated keep/discard rules
- `references/eval-yaml-spec.md` — Eval YAML schema and assertion grading recipes
- `references/subagent-pipeline.md` — Detailed subagent-mode pipeline commands and output structure
- `references/description-optimization.md` — Skill description optimization workflow
- `references/environment-adaptation.md` — Provider-specific notes and CI/headless behavior
- `references/schemas.md` — JSON schemas for all artifacts (grading.json, benchmark.json, etc.)
- `references/migrating-from-skill-creator.md` — Guide for users coming from Anthropic's skill-creator

---

Repeating the core loop for emphasis:

- Understand what the agent does
- Write evaluation test cases
- Run the agent and grade outputs
- Analyze results — surface patterns, dispatch analyst and comparator subagents
- Improve the agent based on analysis
- Repeat until you and the user are satisfied

Take your time with improvements. Read the transcripts. Understand why failures happened. Make changes that generalize beyond the test set. This is important work.

--- references/autoresearch.md ---
# Autoresearch Mode

Autoresearch is an unattended eval-improve loop that runs multiple optimize cycles without human intervention. The user triggers it with natural language (e.g., "run autoresearch on this skill", "optimize this skill unattended"). No YAML schema changes or CLI flags are needed.

## Automated Keep/Discard

After each iteration, you can automatically decide whether to keep or discard the change using structured comparison output. This replaces manual judgment at steps 3–4 of the iteration loop (Step 5 in SKILL.md), except at human checkpoint iterations (3, 6, 9) where you must still present results to the user.

### 1. Run the comparison

After re-running test cases, compare the new results against the previous iteration's baseline:

```bash
agentv compare <baseline>.jsonl <candidate>.jsonl --json
```

Where `<baseline>.jsonl` is the `index.jsonl` from the previous best iteration and `<candidate>.jsonl` is the `index.jsonl` from the run you just completed.

### 2. Parse the output

The `--json` flag produces structured output:

```json
{
  "summary": {
    "wins": 3,
    "losses": 1,
    "ties": 6,
    "mean_delta": 0.05
  }
}
```

- **wins**: number of test cases where the candidate scored higher than the baseline
- **losses**: number of test cases where the candidate scored lower
- **ties**: number of test cases with no score change
- **mean_delta**: average score difference across all test cases (positive = candidate is better)

### 3. Apply decision rules

Use these rules in order:

| Condition | Decision | Action |
|-----------|----------|--------|
| `wins > losses` | **KEEP** | Promote the candidate to the new baseline. Copy or note its `index.jsonl` path as the baseline for the next iteration. |
| `wins <= losses` | **DISCARD** | Revert the prompt/skill/config change. The previous baseline remains. Try a different mutation on the next iteration. |
| `mean_delta == 0` AND candidate prompt is shorter (fewer lines) | **KEEP** | Simpler prompts are preferred when performance is equal. Promote the candidate as the new baseline. |

When `mean_delta == 0` and the candidate prompt is *not* shorter, treat it as a **DISCARD** — there's no reason to keep a change that adds complexity without improving results.

### 4. Log the decision

Before proceeding to the next iteration, log the decision and rationale so the user can review later:

```
Iteration 2: KEEP
  wins=3, losses=1, ties=6, meanDelta=+0.05
  Rationale: candidate wins outweigh losses (3 > 1)
  Baseline promoted: .agentv/results/runs/20250101-120000/index.jsonl
```

```
Iteration 3: DISCARD
  wins=1, losses=2, ties=7, meanDelta=-0.03
  Rationale: candidate losses outweigh wins (2 > 1)
  Reverted to baseline: .agentv/results/runs/20250101-110000/index.jsonl
  Next: try a different mutation
```

Include this log in your progress summary. At human checkpoints (iterations 3, 6, 9), present the full log of automated decisions since the last checkpoint alongside the current results.

### 5. Integration with the iteration loop

The automated keep/discard replaces the manual compare-and-present cycle (steps 3–4) during non-checkpoint iterations. The full flow becomes:

1. Apply change to prompts/skills/config
2. Re-run all test cases
3. Run `agentv compare baseline.jsonl candidate.jsonl --json`
4. Apply keep/discard rules → promote or revert
5. Log the decision
6. If this is iteration 3, 6, or 9 → present progress to the user (human checkpoint)
7. Check stop conditions → continue or stop

Both modes coexist: if the user is actively reviewing results, present to them as before. If the user has asked you to iterate autonomously, use automated keep/discard and only pause at human checkpoints.

---

## Prerequisites

- An eval file (`EVAL.yaml` or `evals.json`) must exist for the artifact being optimized.
- The artifact must be a file or directory (SKILL.md, prompt template, agent config, or a directory of related files like a skill with references/).
- The user should have run at least one interactive eval cycle to build confidence in eval quality before going unattended.

## The loop

```
1. RUN EVAL   — agentv eval with current artifact
2. ANALYZE    — dispatch analyzer subagent on results
3. DECIDE     — if score > best_score: KEEP, else DROP (automated keep/discard above)
4. MUTATE     — dispatch mutator subagent with failure analysis (agents/mutator.md)
5. GOTO 1     — until convergence or max_cycles
```

## Experiment naming

Derive the experiment name from the artifact: `autoresearch-<name>` (e.g., `autoresearch-pdf-skill`). The user can also provide a custom name.

## Artifact mutation flow

The mutator rewrites artifacts in the working tree in place. **Git is used for versioning** — HEAD always contains the best-known version:

1. Record the starting commit SHA before the first cycle: `initial_sha=$(git rev-parse HEAD)`.
2. On each **KEEP**: `git add <artifact-path> && git commit -m "autoresearch cycle N: <mutation summary>"`.
3. On each **DROP**: `git checkout -- <artifact-path>` (restores working tree to HEAD, the last KEEP commit).
4. The eval always runs against the real file path — no temp files or indirection.
5. The mutator can reference the original via `git show <initial_sha>:<path>`.

## How the skill invokes eval

Shell out to `agentv eval <eval-path> --experiment autoresearch-<name>` via the Bash tool, same as the existing interactive bench workflow.

## Artifact layout

Each cycle is a standard eval run. Autoresearch session metadata lives in `_autoresearch/` within the experiment directory:

```
.agentv/results/runs/<experiment>/
  _autoresearch/
    iterations.jsonl               # one line per cycle — data for chart + mutator
    trajectory.html                # live-updating score trajectory chart
  2026-04-15T10-30-00/             # cycle 1 — standard run artifacts
    index.jsonl
    grading.json
    timing.json
    benchmark.json
    report.html
  2026-04-15T10-35-00/             # cycle 2 — standard run artifacts
    ...
```

No `original.md` or `best.md` files — git history serves as the backup. The `_` prefix convention distinguishes workflow folders from timestamped run dirs.

## iterations.jsonl

One JSON object per line, one line per cycle:

```jsonl
{"cycle":1,"score":0.65,"decision":"keep","cost_usd":0.12,"assertions":{"IDENTIFIES_BUG":0.8,"SUGGESTS_FIX":0.4},"mutation":"added explicit null-check instruction","run_dir":"2026-04-15T10-30-00","timestamp":"2026-04-15T10:32:15Z"}
```

Fields: `cycle` (1-indexed), `score` (overall pass rate 0–1), `decision` ("keep" or "drop"), `cost_usd` (eval run cost), `assertions` (per-assertion pass rates), `mutation` (one-line description of what changed), `run_dir` (timestamped directory name), `timestamp` (ISO 8601).

## trajectory.html

A standalone HTML chart file with embedded Chart.js. Copy the template from `scripts/trajectory.html` into the `_autoresearch/` directory. It fetches `iterations.jsonl` from the same directory on each auto-refresh — no data injection needed. Shows:

- Score over iterations (line chart) with KEEP (green) / DISCARD (red) markers
- Per-assertion pass rates over iterations
- Cumulative cost across iterations
- Best vs original score summary

Auto-refreshes every 2 seconds during the loop. Becomes static after completion (remove the auto-refresh meta tag on final update).

## Convergence

Stop after **3** consecutive cycles with no improvement (no KEEP). Also stop at **max_cycles** (default 10). Either limit can be overridden by the user.

## Human checkpoints

Autoresearch mode **skips** human checkpoints at iterations 3/6/9. The user opted in to unattended operation by requesting autoresearch.

## Context hygiene

The orchestrator must run indefinitely without exhausting its context window. To do this:

- **Never read eval results, artifacts, or transcripts into your own context.** Use bash commands (jq, agentv CLI) that output small structured summaries.
- **Delegate all heavy reading to subagents.** The mutator reads artifacts, grading results, and transcripts from disk — you pass it paths, not content.
- **Use bash for all file I/O** in the loop body: appending to `iterations.jsonl`, git operations, score extraction. The only tool calls per cycle should be bash commands and one subagent dispatch (mutator).
- **trajectory.html auto-loads `iterations.jsonl`** via fetch — no need to read or update the HTML file after initial copy.

## Procedure

Follow this step-by-step procedure to execute autoresearch:

### 1. Setup

1. Determine the **artifact path** (file or directory to optimize) and **eval path** (EVAL.yaml or evals.json).
2. Detect **artifact mode**: `file` if the artifact path is a file, `directory` if it's a directory.
3. Derive the **experiment name**: `autoresearch-<name>` from the artifact filename/dirname, or use a user-provided name.
4. Set the experiment directory: `.agentv/results/runs/<experiment>/`.
5. Create the `_autoresearch/` subdirectory inside the experiment directory.
6. Record `initial_sha=$(git rev-parse HEAD)` — the commit before any mutations.
7. Copy `scripts/trajectory.html` to `_autoresearch/trajectory.html`.
8. Initialize variables:
   - `best_score = 0`
   - `convergence_count = 0`
   - `cycle = 1`
   - `max_cycles = 10` (or user-specified)
   - `max_convergence = 3` (or user-specified)

### 2. Main loop

Repeat while `cycle <= max_cycles` and `convergence_count < max_convergence`:

**a. Run eval**

```bash
agentv eval <eval-path> --experiment autoresearch-<name>
```

**b. Extract scores (bash only — do NOT read result files into your context)**

Find the latest timestamped directory in the experiment folder. Use bash/jq to extract small structured values:

```bash
# Find latest run dir
RUN_DIR=$(ls -td <experiment-dir>/20*/ | head -1)

# Overall score (mean of all scores in index.jsonl)
SCORE=$(jq -sr '[.[].scores[].score] | add / length' "$RUN_DIR/index.jsonl")

# Per-assertion pass rates as JSON object
PASS_RATES=$(jq -sr '[.[].scores[]] | group_by(.type) | map({key: .[0].type, value: (map(.score) | add / length)}) | from_entries' "$RUN_DIR/index.jsonl")

# Cost (if timing.json exists)
COST=$(jq -r '.cost_usd // 0' "$RUN_DIR/timing.json" 2>/dev/null || echo 0)
```

Capture only these small outputs (`SCORE`, `PASS_RATES`, `COST`) — never read the full JSONL into context.

**c. Update iterations.jsonl (bash only)**

After the KEEP/DROP decision (step e), append one JSON line via bash:

```bash
echo '{"cycle":'$CYCLE',"score":'$SCORE',"decision":"'$DECISION'","cost_usd":'$COST',"assertions":'$PASS_RATES',"mutation":"'"$MUTATION_DESC"'","run_dir":"'"$(basename $RUN_DIR)"'","timestamp":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> <experiment-dir>/_autoresearch/iterations.jsonl
```

**d. trajectory.html — no action needed**

The trajectory chart fetches `iterations.jsonl` directly via HTTP on each auto-refresh. No file manipulation required after the initial copy in setup.

**e. Decide: KEEP or DROP**

Apply the automated keep/discard rules from the section above:

1. Run `agentv compare <baseline>.jsonl <candidate>.jsonl --json` where `<baseline>` is the best iteration's `index.jsonl` (or the first run's `index.jsonl` for cycle 1) and `<candidate>` is this cycle's `index.jsonl`.
2. If `wins > losses` → **KEEP**.
3. If `wins <= losses` → **DISCARD**.
4. If `mean_delta == 0` and the artifact is simpler → **KEEP** (simpler is better at equal performance). Simplicity: for files, compare line count; for directories, compare total size via `du -sb`.

For cycle 1, there is no baseline to compare against — always **KEEP** the first cycle.

**f. If KEEP**

- Update `best_score` to this cycle's score.
- Commit the artifact: `git add <artifact-path> && git commit -m "autoresearch cycle N: <mutation summary>"`.
- Record the current `index.jsonl` path as the new baseline for future comparisons.
- Reset `convergence_count = 0`.

**g. If DROP**

- Revert the working tree to HEAD: `git checkout -- <artifact-path>` (for files) or `git checkout -- <artifact-path>/` (for directories).
- Increment `convergence_count`.

**h. Check stop conditions**

If `convergence_count >= max_convergence` or `cycle >= max_cycles` → break out of the loop.

**i. Mutate**

Dispatch the **mutator** subagent (`agents/mutator.md`) with:
- `artifact-path`: the file or directory to mutate
- `artifact-mode`: `file` or `directory`
- `initial-sha`: the starting commit SHA (for referencing the original via `git show`)
- `pass-rates`: the `$PASS_RATES` JSON object from step (b) (small — just assertion names and rates)
- `run-dir`: path to this cycle's run directory (the mutator reads `grading.json` and transcripts itself)
- `iterations-path`: path to `_autoresearch/iterations.jsonl` (the mutator reads mutation history itself)
- For directory mode: `focus-files` (optional — files most likely contributing to failures, derived from assertion names)

**Do NOT pass failure descriptions, transcripts, or grading content** to the mutator — pass paths and let it read what it needs from disk. This keeps the orchestrator's context clean.

The mutator rewrites artifacts in place. Verify the artifact was modified (e.g., `git diff --stat`) before continuing.

**j. Continue**

Increment `cycle` and return to step (a).

### 3. Completion

1. Finalize `trajectory.html`: remove the line containing `<!-- __AUTO_REFRESH__ -->` (which includes the `<meta http-equiv="refresh">` tag) so the chart becomes static.
2. Log a final summary:
   - Total cycles run
   - Final best score vs original score (cycle 1)
   - Number of KEEPs and DROPs
   - Total cost across all cycles
   - The optimized artifact is in the working tree (and the latest commit)
   - Run `git diff <initial_sha>` to see total changes from the original
   - Run `git log --oneline <initial_sha>..HEAD` to see the mutation history
   - Path to `_autoresearch/trajectory.html` (the score chart)
3. Present results to the user with a recommendation: adopt the optimized version, revert to original (`git checkout <initial_sha> -- <artifact-path>`), or continue iterating interactively.

## Interactive/autonomous hybrid

Users can start in interactive mode (the existing Step 3–5 loop with human checkpoints), build confidence in their eval quality, and then switch to autoresearch mode to run unattended. The two modes share the same eval infrastructure and artifact layout — autoresearch simply automates the keep/discard decisions and removes human checkpoints.

## Model empathy recommendation

For best results, use same-model pairings: the meta-agent running autoresearch should match the model used by the task agent being evaluated (e.g., Claude optimizing a Claude agent, GPT optimizing a GPT agent). Per AutoAgent research findings, same-model pairings produce better mutations because the optimizer has implicit knowledge of how the target model interprets instructions.

--- references/description-optimization.md ---
# Description Optimization

Optimize the `description` field in a skill's SKILL.md frontmatter for better triggering
accuracy. Use this after the agent/skill is working well — this is a polish step, not a
core workflow step.

**Provider compatibility**: Description optimization applies to any agent platform with
skill-discovery mechanisms — Claude Code, Codex (`.agents/` or `.codex/` folders), Copilot,
and others. The `skill-trigger` grader checks whether the agent invoked the right skill,
regardless of how discovery works on that platform.

## Step 1: Generate Trigger EVAL.yaml

Create 20 test cases:
- **10 should-trigger**: realistic prompts where this skill should activate — different
  phrasings, casual speech, uncommon use cases, edge cases where this skill competes with
  another but should win
- **10 should-not-trigger**: near-miss prompts that share keywords but actually need
  something different — adjacent domains, ambiguous phrasing where naive matching would
  trigger but shouldn't

Prompts must be realistic — include file paths, personal context, typos, casual speech.
Not abstract requests like "format data" but concrete ones like "ok so my boss sent me
Q4-sales-FINAL-v2.xlsx and she wants me to add a profit margin column..."

The should-not-trigger cases are the most valuable. "Write a fibonacci function" as a
negative test for an eval skill is useless — it doesn't test anything. The negative cases
should be genuinely tricky near-misses.

Write as EVAL.yaml with top-level input (the user prompt doesn't specify the skill name —
it's a natural utterance):

```yaml
# trigger-eval.eval.yaml
tests:
  - id: should-trigger-casual-optimize
    input: "ok so I have this agent that keeps failing on the code review tasks, can you help me figure out why and fix it"
    assertions:
      - type: skill-trigger
        skill: agentv-bench
  - id: should-not-trigger-build-error
    input: "my TypeScript build is failing with type errors in src/auth.ts"
    assertions:
      - type: skill-trigger
        skill: agentv-bench
        should_trigger: false
```

## Step 2: Review with User

Present the eval set. The user adjusts queries, toggles should-trigger, adds/removes cases.
This step matters — bad eval queries lead to bad descriptions.

## Step 3: Iterate on Description

Run the trigger eval, identify misfires, rewrite the description, re-run. Max 5 iterations.
Select best description by held-out test accuracy (split 60% train / 40% test) to avoid
overfitting.

Use the grader and analyzer subagents to identify trigger failures and propose description
improvements — the same eval → grade → analyze → improve loop used for agent output quality.

## Step 4: Apply

Update the skill's SKILL.md frontmatter with the optimized description. Show the user
before/after with accuracy scores.

--- references/environment-adaptation.md ---
# Environment Adaptation

Provider-specific notes, CI/headless behavior, and fallback strategies for environments
with limited capabilities.

## CI/Headless Mode

Skip interactive prompts. Exit with pass/fail status code. Always generate artifacts for
downstream consumption.

## No Subagents Available (e.g., Claude.ai)

Run test cases serially. Skip blind comparison. Present results directly in conversation —
for each test case, show the prompt and output. Ask for feedback inline. Skip benchmarking
(it relies on baseline comparisons that aren't meaningful without subagents).

## Provider-Specific Notes

- **Copilot CLI**: Uses ACP protocol via `copilot --acp --stdio`
- **Claude SDK**: Requires `@anthropic-ai/claude-agent-sdk` installed
- **Codex**: Supports skills via `.agents/` or `.codex/` folders. Emits `command_execution`
  and `file_change` tool calls.
- **Custom CLI**: Needs `command` and output file pattern in target config
- **Target config**: Uses `${{ ENV_VAR }}` syntax (not `${ENV_VAR}`) for API keys

**Note**: "Description Optimization" (see `references/description-optimization.md`) applies
to any platform with skill-discovery mechanisms. All listed providers support skills.

## Unsupported Providers: Use a Code-Grader

The built-in `skill-trigger` grader covers Claude, Copilot, Pi, Codex and VS Code out
of the box. For providers with different tool-call formats, write a code-grader that inspects
the agent's tool call trace.

A code-grader receives the full evaluation context including the agent's output messages and
tool calls. You can inspect these to determine whether the skill was invoked:

```yaml
# Example: code-grader for Codex skill-trigger detection
tests:
  - id: should-trigger-codex
    input: "Analyze this CSV file"
    assertions:
      - type: code-grader
        path: ./judges/codex-skill-trigger.ts
```

```typescript
// judges/codex-skill-trigger.ts
import { defineCodeGrader } from '@agentv/eval';

export default defineCodeGrader(({ output }) => {
  const skillName = 'csv-analyzer';
  const toolCalls = (output ?? []).flatMap((msg) => msg.toolCalls ?? []);
  const firstTool = toolCalls[0];

  if (!firstTool) {
    return { score: 0, reason: 'No tool calls recorded' };
  }

  // Codex reads skill files via shell commands
  if (firstTool.tool === 'command_execution') {
    const cmd = String(firstTool.input ?? '');
    if (cmd.includes(skillName)) {
      return { score: 1, reason: `Skill "${skillName}" triggered via command: ${cmd}` };
    }
  }

  // Check if skill file was read via file_change or other tools
  if (firstTool.tool === 'file_change') {
    const path = String((firstTool.input as Record<string, unknown>)?.path ?? '');
    if (path.includes(skillName)) {
      return { score: 1, reason: `Skill file accessed: ${path}` };
    }
  }

  return { score: 0, reason: `First tool was "${firstTool.tool}" — not a skill invocation for "${skillName}"` };
});
```

This approach is more flexible than config overrides — you can match any tool-call pattern,
check multiple fields, and add provider-specific logic as needed.

--- references/eval-yaml-spec.md ---
# Eval YAML Spec — Schema and Assertion Grading Recipes

This reference documents the eval.yaml schema and grading recipes for every assertion type.
The grader agent uses this to evaluate assertions without the CLI.

## 1. Eval YAML Structure

### Top-level fields

- `name` (string, optional) — eval name
- `description` (string, optional) — description
- `execution` (object, optional) — `target`, `model`, etc.
- `workspace` (object, optional) — workspace config (template, hooks)
- `tests` (array, required) — test cases

### Per-test fields

- `id` (string, required) — unique test identifier
- `input` (string | Message[], required) — task input. String shorthand expands to `[{role: user, content: "..."}]`
- `expected_output` (string | Message[], optional) — reference answer. String shorthand expands to `[{role: assistant, content: "..."}]`
- `criteria` (string, optional) — human-readable success criteria
- `assertions` (array, optional) — grader assertions
- `conversation_id` (string, optional) — groups related tests
- `execution` (object, optional) — per-test execution override

## 2. Assertion Types and Grading Recipes

For each assertion type: YAML config fields, grading recipe (exact pseudocode for deterministic types), and PASS/FAIL conditions.

### Deterministic assertions (zero-cost, instant)

#### `contains`

- **Fields:** `value` (string, required)
- **Recipe:**
  ```
  response.toLowerCase().includes(value.toLowerCase())
  ```
  Note: case-insensitive by default in AgentV. If `case_sensitive: true`, use exact match.
- **PASS:** substring found. **FAIL:** substring not found.

#### `contains-any`

- **Fields:** `value` (string[], required)
- **Recipe:**
  ```
  value.some(v => response.toLowerCase().includes(v.toLowerCase()))
  ```
- **PASS:** at least one substring found.

#### `contains-all`

- **Fields:** `value` (string[], required)
- **Recipe:**
  ```
  value.every(v => response.toLowerCase().includes(v.toLowerCase()))
  ```
- **PASS:** all substrings found.

#### `icontains` / `icontains-any` / `icontains-all`

Same as contains variants but explicitly case-insensitive.

#### `equals`

- **Fields:** `value` (string, required)
- **Recipe:**
  ```
  response.trim() === value.trim()
  ```
- **PASS:** exact match after trimming.

#### `regex`

- **Fields:** `value` (string, required — a regex pattern)
- **Recipe:**
  ```
  new RegExp(value).test(response)
  ```
- **PASS:** pattern matches.

#### `starts-with`

- **Fields:** `value` (string, required)
- **Recipe:**
  ```
  response.startsWith(value)
  ```
  (or case-insensitive variant)
- **PASS:** response starts with value.

#### `ends-with`

- **Fields:** `value` (string, required)
- **Recipe:**
  ```
  response.endsWith(value)
  ```
  (or case-insensitive variant)
- **PASS:** response ends with value.

#### `is-json`

- **Fields:** none required
- **Recipe:**
  ```
  try { JSON.parse(response); return true } catch { return false }
  ```
- **PASS:** response is valid JSON. **FAIL:** parse error.

#### `field-accuracy`

- **Fields:** `expected` (object, required — JSON object with field paths and expected values)
- **Recipe:** Parse response as JSON. For each field path in `expected`, check if the value matches.
- **PASS:** all fields match. Partial score = `matched_fields / total_fields`.

### Metric assertions (require timing.json)

#### `latency`

- **Fields:** `threshold` (number, required — max duration in ms)
- **Recipe:** Read `timing.json`. Compare `duration_ms` against threshold.
- **PASS:** `duration_ms <= threshold`.

#### `cost`

- **Fields:** `threshold` (number, required — max cost in USD)
- **Recipe:** Read timing/token data. Compare cost against threshold.
- **PASS:** `cost <= threshold`.

#### `token-usage`

- **Fields:** `threshold` (number, required — max tokens)
- **Recipe:** Read `timing.json`. Compare `total_tokens` against threshold.
- **PASS:** `total_tokens <= threshold`.

#### `execution-metrics`

- **Fields:** Various threshold fields for tool calls, output chars, etc.
- **Recipe:** Read timing.json, compare each metric against its threshold.

### Tool inspection assertions

#### `tool-trajectory`

- **Fields:** `expected` (array of expected tool calls), `mode` (string: `exact` | `contains` | `order`)
- **Recipe:** Inspect transcript for tool call sequence. Match against expected based on mode.
- **PASS:** tool calls match expected pattern per mode.

#### `skill-trigger`

- **Fields:** `skill_name` (string, required)
- **Recipe:** Check if the agent invoked the named skill in its tool calls.
- **PASS:** skill was triggered.

### LLM-judged assertions (require Claude reasoning)

#### `llm-grader`

- **Fields:** `prompt` (string, required — either inline text or path to .md file)
- **Recipe:** Read the prompt. Evaluate the response against the criteria using your own reasoning. Produce score (0.0-1.0) with evidence.
- **PASS:** score >= 0.5 (configurable via `threshold`).

#### `rubric` / `rubrics`

- **Fields:** `rubric_items` or `criteria` (array of rubric items with descriptions and weights)
- **Recipe:** For each rubric item, evaluate the response. Score each item 0.0-1.0. Aggregate as weighted average.
- **PASS:** aggregate score >= threshold.

### Script-based assertions

#### `code-grader`

- **Fields:** `path` (string, required — path to script), `command` (string[], optional — custom command)
- **Script SDK:** Use `defineCodeGrader` from `@agentv/eval`:
  ```typescript
  import { defineCodeGrader } from '@agentv/eval';
  export default defineCodeGrader(({ outputText, trace }) => ({
    score: outputText.includes('expected') ? 1 : 0,
    assertions: [{ text: 'Contains expected', passed: outputText.includes('expected') }],
  }));
  ```
- **Recipe:** The CLI runs the script, passing context as JSON on stdin (`{output, outputText, input, inputText, ...}`). Script returns `{"score": N, "assertions": [...]}`
- **PASS:** score >= 0.5 (or as configured).

### Composite assertion

#### `composite`

- **Fields:** `assertions` (array of sub-assertions), `aggregation` (string: `weighted_average` | `min` | `max` | `all_pass`)
- **Recipe:** Evaluate each sub-assertion. Aggregate scores per aggregation mode.
- **PASS:** depends on aggregation mode.

## 3. Negate Support

When `negate: true` is set on any assertion, invert the pass/fail result:

- A passing check becomes a failure
- A failing check becomes a pass
- Score is inverted: `1.0 - score`

## 4. Common Assertion Fields

All assertion types support:

- `name` (string, optional) — human-readable name
- `type` (string, required) — the assertion type
- `weight` (number, optional, default 1.0) — weight in score aggregation
- `negate` (boolean, optional) — invert result
- `threshold` (number, optional) — minimum score to pass (for LLM types)

## 5. AgentV JSONL Output Format

Each line in the results JSONL file is an `EvaluationResult` object. In JSONL, field names use snake_case (applied by `toSnakeCaseDeep()`).

### Required fields

- `timestamp` (string, ISO-8601)
- `test_id` (string)
- `score` (number, 0.0-1.0, weighted average of all assertion scores)
- `assertions` (array of `{text, passed, evidence?}`)
- `output` (Message[]) — agent output messages
- `execution_status` (string: `ok` | `quality_failure` | `execution_error`)

### Optional fields

- `scores` (array of EvaluatorResult) — per-grader breakdown
- `input` (Message[]) — input messages
- `token_usage` (object: `{prompt_tokens, completion_tokens, total_tokens}`)
- `cost_usd` (number)
- `duration_ms` (number)
- `target` (string)
- `eval_set` (string)
- `error` (string)
- `file_changes` (string — unified diff)
- `mode` (string — `agent` for agent mode)

### `scores[]` entries (EvaluatorResult)

- `name` (string) — grader name
- `type` (string) — grader kind (kebab-case)
- `score` (number, 0.0-1.0)
- `assertions` (array of `{text, passed, evidence?}`)
- `weight` (number, optional)
- `verdict` (string: `pass` | `fail` | `skip`)
- `details` (object, optional — structured data from code graders)
- `reasoning` (string, optional)

## 6. Eval Set Support

An eval_set references multiple eval.yaml files:

```yaml
# eval_set.yaml
eval_set:
  - path: ./basic.eval.yaml
  - path: ./advanced.eval.yaml
```

Process each file's tests independently, then aggregate results.

## 7. Agent-Mode Pipeline CLI Commands

These CLI subcommands break the monolithic `eval run` into discrete steps for agent-mode execution. The agent handles LLM grading between steps.

### `agentv pipeline input <eval-path> --out <dir>`

Extracts inputs, target commands, and grader configs from an eval YAML file.

**Output structure:**
```
<out-dir>/
├── manifest.json
├── <test-id>/
│   ├── input.json              ← {input, input_files, metadata}
│   ├── invoke.json             ← {kind, command?, cwd?, timeout_ms?}
│   ├── criteria.md             ← human-readable success criteria
│   ├── expected_output.json    ← (if present)
│   ├── code_graders/<name>.json   ← {name, command, weight, config?}
│   └── llm_graders/<name>.json    ← {name, weight, threshold?, prompt_content}
```

**`manifest.json` format:**
```json
{
  "eval_file": "path/to/eval.yaml",
  "timestamp": "2026-03-24T...",
  "target": {"name": "target-name", "kind": "cli", "subagent_mode_allowed": false},
  "test_ids": ["test-01", "test-02"]
}
```

**`invoke.json` kinds:**
- `kind: "cli"` — has `command`, `cwd`, `timeout_ms`. Use the command to run the target.
- `kind: "agent"` — non-CLI provider. Check `manifest.json` `target.subagent_mode_allowed` to decide whether to dispatch executor subagents or fall back to `agentv eval` CLI.

### `agentv pipeline grade <export-dir>`

Runs code-grader assertions against `response.md` files in each test directory.

**Prerequisites:** `pipeline input` has been run and `response.md` exists in each test dir.

**Output:** `<test-id>/code_grader_results/<name>.json` for each code grader, containing:
```json
{
  "name": "grader-name",
  "type": "code-grader",
  "score": 1.0,
  "weight": 1.0,
  "assertions": [{"text": "...", "passed": true}]
}
```

### `agentv pipeline bench <export-dir>`

Merges code-grader results with LLM grader scores and produces final artifacts.

LLM grader results are read from disk at `<test-id>/llm_grader_results/<name>.json` per test.

**LLM grader result file format** (`llm_grader_results/<name>.json`):
```json
{ "score": 0.85, "assertions": [{"text": "...", "passed": true, "evidence": "..."}] }
```

**Output:**
- `<test-id>/grading.json` — merged grading with `graders`, `assertions`, `summary.pass_rate`
- `index.jsonl` — one JSON line per test: `{test_id, score, pass, graders: [...]}`
- `benchmark.json` — aggregate stats: `{metadata: {targets}, run_summary: {<target>: {mean, stddev, n}}}`

### Agent-Mode Workflow

```
1. agentv pipeline input eval.yaml --out ./export
2. (Agent runs targets or reads response.md)
3. agentv pipeline grade ./export
4. (Agent does LLM grading, produces scores JSON)
5. echo '<scores>' | agentv pipeline bench ./export
```

--- references/migrating-from-skill-creator.md ---
# Migrating from Skill-Creator to AgentV Lifecycle Skill

This reference covers how to use AgentV's unified agent-evaluation lifecycle skill (`agentv-bench`) with evals.json files originally created for Anthropic's skill-creator.

## Drop-in Replacement

AgentV runs skill-creator's evals.json directly — no conversion required:

```bash
# Run evals.json with AgentV
agentv eval evals.json

# Or run a single assertion offline (no API keys)
agentv eval assert <grader-name> --agent-output "..." --agent-input "..."
```

AgentV automatically:
- Promotes `prompt` → input messages
- Promotes `expected_output` → reference answer
- Converts `assertions` → LLM-grader graders
- Resolves `files[]` paths relative to the evals.json directory

If you're using the `agentv-bench` skill, it orchestrates these same AgentV commands. Code graders, grading, and artifact generation remain in AgentV core; the skill just orchestrates and summarizes the existing outputs.

## What You Gain

Moving from skill-creator's eval loop to AgentV's lifecycle skill gives you:

| Capability | skill-creator | AgentV lifecycle skill |
|-----------|---------------|----------------------|
| Workspace isolation | ❌ | ✅ Clone repos, run setup/teardown scripts |
| Code graders | ❌ | ✅ Python/TypeScript grader scripts via `defineCodeGrader()` |
| Tool trajectory scoring | ❌ | ✅ Evaluate tool call sequences |
| Multi-provider comparison | with-skill vs without-skill | N-way: Claude, GPT, Copilot, Gemini, custom CLI |
| Multi-turn evaluation | ❌ | ✅ Conversation tracking with `conversation_id` |
| Blind comparison | ❌ | ✅ Judge doesn't know which is baseline |
| Deterministic upgrade suggestions | ❌ | ✅ LLM-grader → contains/regex/is-json |
| Human review checkpoint | ❌ | ✅ Structured feedback gate |
| Workspace file tracking | ❌ | ✅ Evaluate by diffing workspace files |
| Agent mode (no API keys) | ❌ | ✅ Uses grader agent in agent mode |

## Artifact Compatibility

AgentV's companion artifacts are compatible with skill-creator's eval-viewer:

| Artifact | Format | Compatible with eval-viewer |
|----------|--------|---------------------------|
| `<test-id>/grading.json` | Per-assertion evidence with claims | ✅ Superset of skill-creator's per-test grading format |
| `benchmark.json` | Aggregate pass rates, timing, patterns | ✅ Superset of Agent Skills benchmark format |
| Results JSONL | Per-test results | ✅ Standard JSONL format |

AgentV's schemas are supersets — they include all fields skill-creator expects, plus additional fields (claims extraction, pattern analysis, deterministic upgrade candidates). Tools that read skill-creator artifacts will read AgentV artifacts correctly, ignoring the extra fields.

The optimizer scripts layer reads those same artifacts directly:
- `aggregate-benchmark.ts` consumes `benchmark.json`, `timing.json`, and results JSONL
- `generate-report.ts` and `eval-viewer/generate-review.ts` render review output from AgentV artifacts
- `improve-description.ts` proposes follow-up experiments from benchmark/grading observations

## Graduating to EVAL.yaml

When evals.json becomes limiting, convert to EVAL.yaml for the full feature set:

```bash
# Convert evals.json to EVAL.yaml
agentv convert evals.json

# Edit the generated YAML to add workspace config, code graders, etc.
# Then run with the full lifecycle
agentv eval eval.yaml
```

EVAL.yaml unlocks:
- **Workspace setup/teardown** — clone repos, install dependencies, clean up after tests
- **Code graders** — write graders in Python or TypeScript, not just LLM prompts
- **Rubric-based grading** — multi-dimensional scoring with weighted criteria
- **Retry policies** — automatic retries for flaky tests with configurable backoff
- **Test groups** — organize tests by category with shared config
- **Multi-turn conversations** — test agent interactions across multiple turns

## What Stays in Skill-Creator

AgentV does NOT replace these skill-creator capabilities:

- **Trigger optimization** — optimizing when/how a skill is triggered
- **.skill packaging** — bundling skills for distribution
- **Skill authoring** — creating new SKILL.md files from scratch
- **Skill discovery** — finding and installing skills

AgentV focuses on the **evaluation and optimization loop**. Skill-creator focuses on **skill authoring and packaging**. They are complementary — use skill-creator to write the skill, use AgentV to evaluate and optimize it.

## Example Workflow

```
1. Author a skill with skill-creator
2. skill-creator generates evals.json
3. Run evals.json through AgentV's lifecycle skill for richer evaluation:
   - Workspace isolation (test in a real repo)
   - Multi-provider comparison (does the skill work with GPT too?)
   - Blind comparison (is the new version actually better?)
   - Deterministic upgrades (replace vague LLM graders with precise checks)
4. Use AgentV's optimization loop to refine the skill's prompts
5. Return to skill-creator for packaging and distribution
```

--- references/schemas.md ---
# JSON Schemas

This document defines the JSON schemas used by skill-creator.

---

## evals.json

Defines the evals for a skill. Located at `evals/evals.json` within the skill directory.

```json
{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's example prompt",
      "expected_output": "Description of expected result",
      "files": ["evals/files/sample1.pdf"],
      "assertions": [
        "The output includes X",
        "The skill used script Y"
      ]
    }
  ]
}
```

**Fields:**
- `skill_name`: Name matching the skill's frontmatter
- `evals[].id`: Unique integer identifier
- `evals[].prompt`: The task to execute
- `evals[].expected_output`: Human-readable description of success
- `evals[].files`: Optional list of input file paths (relative to skill root)
- `evals[].assertions`: List of verifiable statements

---

## history.json

Tracks version progression in Improve mode. Located at workspace root.

```json
{
  "started_at": "2026-01-15T10:30:00Z",
  "skill_name": "pdf",
  "current_best": "v2",
  "iterations": [
    {
      "version": "v0",
      "parent": null,
      "assertion_pass_rate": 0.65,
      "grading_result": "baseline",
      "is_current_best": false
    },
    {
      "version": "v1",
      "parent": "v0",
      "assertion_pass_rate": 0.75,
      "grading_result": "won",
      "is_current_best": false
    },
    {
      "version": "v2",
      "parent": "v1",
      "assertion_pass_rate": 0.85,
      "grading_result": "won",
      "is_current_best": true
    }
  ]
}
```

**Fields:**
- `started_at`: ISO timestamp of when improvement started
- `skill_name`: Name of the skill being improved
- `current_best`: Version identifier of the best performer
- `iterations[].version`: Version identifier (v0, v1, ...)
- `iterations[].parent`: Parent version this was derived from
- `iterations[].assertion_pass_rate`: Pass rate from grading
- `iterations[].grading_result`: "baseline", "won", "lost", or "tie"
- `iterations[].is_current_best`: Whether this is the current best version

---

## grading.json

Output from the grader agent. Located at `<run-dir>/grading.json`.

**Important:** The `assertions` array must use the fields `text`, `passed`, and `evidence` — downstream tooling depends on these exact field names.

```json
{
  "assertions": [
    {
      "text": "The output includes the name 'John Smith'",
      "passed": true,
      "evidence": "Found in transcript Step 3: 'Extracted names: John Smith, Sarah Johnson'"
    },
    {
      "text": "The spreadsheet has a SUM formula in cell B10",
      "passed": false,
      "evidence": "No spreadsheet was created. The output was a text file."
    }
  ],
  "summary": {
    "passed": 2,
    "failed": 1,
    "total": 3,
    "pass_rate": 0.67
  },
  "execution_metrics": {
    "tool_calls": {
      "Read": 5,
      "Write": 2,
      "Bash": 8
    },
    "total_tool_calls": 15,
    "total_steps": 6,
    "errors_encountered": 0,
    "output_chars": 12450,
    "transcript_chars": 3200
  },
  "timing": {
    "executor_duration_seconds": 165.0,
    "grader_duration_seconds": 26.0,
    "total_duration_seconds": 191.0
  },
  "claims": [
    {
      "claim": "The form has 12 fillable fields",
      "type": "factual",
      "verified": true,
      "evidence": "Counted 12 fields in field_info.json"
    }
  ],
  "user_notes_summary": {
    "uncertainties": ["Used 2023 data, may be stale"],
    "needs_review": [],
    "workarounds": ["Fell back to text overlay for non-fillable fields"]
  },
  "eval_feedback": {
    "suggestions": [
      {
        "assertion": "The output includes the name 'John Smith'",
        "reason": "A hallucinated document that mentions the name would also pass"
      }
    ],
    "overall": "Assertions check presence but not correctness."
  }
}
```

**Fields:**
- `assertions[]`: Graded assertion results with evidence
- `summary`: Aggregate pass/fail counts
- `execution_metrics`: Tool usage and output size (from executor's metrics.json)
- `timing`: Wall clock timing (from timing.json)
- `claims`: Extracted and verified claims from the output
- `user_notes_summary`: Issues flagged by the executor
- `eval_feedback`: (optional) Improvement suggestions for the evals, only present when the grader identifies issues worth raising

---

## metrics.json

Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.

```json
{
  "tool_calls": {
    "Read": 5,
    "Write": 2,
    "Bash": 8,
    "Edit": 1,
    "Glob": 2,
    "Grep": 0
  },
  "total_tool_calls": 18,
  "total_steps": 6,
  "files_created": ["filled_form.pdf", "field_values.json"],
  "errors_encountered": 0,
  "output_chars": 12450,
  "transcript_chars": 3200
}
```

**Fields:**
- `tool_calls`: Count per tool type
- `total_tool_calls`: Sum of all tool calls
- `total_steps`: Number of major execution steps
- `files_created`: List of output files created
- `errors_encountered`: Number of errors during execution
- `output_chars`: Total character count of output files
- `transcript_chars`: Character count of transcript

---

## timing.json

Wall clock timing for a run. Located at `<run-dir>/timing.json`.

**How to capture:** When a subagent task completes, the task notification includes `total_tokens` and `duration_ms`. Save these immediately — they are not persisted anywhere else and cannot be recovered after the fact.

```json
{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3,
  "executor_start": "2026-01-15T10:30:00Z",
  "executor_end": "2026-01-15T10:32:45Z",
  "executor_duration_seconds": 165.0,
  "grader_start": "2026-01-15T10:32:46Z",
  "grader_end": "2026-01-15T10:33:12Z",
  "grader_duration_seconds": 26.0
}
```

---

## benchmark.json

Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.

```json
{
  "metadata": {
    "skill_name": "pdf",
    "skill_path": "/path/to/pdf",
    "executor_model": "claude-sonnet-4-20250514",
    "analyzer_model": "most-capable-model",
    "timestamp": "2026-01-15T10:30:00Z",
    "evals_run": [1, 2, 3],
    "runs_per_configuration": 3
  },

  "runs": [
    {
      "eval_id": 1,
      "eval_name": "Ocean",
      "configuration": "with_skill",
      "run_number": 1,
      "result": {
        "pass_rate": 0.85,
        "passed": 6,
        "failed": 1,
        "total": 7,
        "time_seconds": 42.5,
        "tokens": 3800,
        "tool_calls": 18,
        "errors": 0
      },
      "assertions": [
        {"text": "...", "passed": true, "evidence": "..."}
      ],
      "notes": [
        "Used 2023 data, may be stale",
        "Fell back to text overlay for non-fillable fields"
      ]
    }
  ],

  "run_summary": {
    "with_skill": {
      "pass_rate": {"mean": 0.85, "stddev": 0.05, "min": 0.80, "max": 0.90},
      "time_seconds": {"mean": 45.0, "stddev": 12.0, "min": 32.0, "max": 58.0},
      "tokens": {"mean": 3800, "stddev": 400, "min": 3200, "max": 4100}
    },
    "without_skill": {
      "pass_rate": {"mean": 0.35, "stddev": 0.08, "min": 0.28, "max": 0.45},
      "time_seconds": {"mean": 32.0, "stddev": 8.0, "min": 24.0, "max": 42.0},
      "tokens": {"mean": 2100, "stddev": 300, "min": 1800, "max": 2500}
    },
    "delta": {
      "pass_rate": "+0.50",
      "time_seconds": "+13.0",
      "tokens": "+1700"
    }
  },

  "notes": [
    "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
    "Eval 3 shows high variance (50% ± 40%) - may be flaky or model-dependent",
    "Without-skill runs consistently fail on table extraction assertions",
    "Skill adds 13s average execution time but improves pass rate by 50%"
  ]
}
```

**Fields:**
- `metadata`: Information about the benchmark run
  - `skill_name`: Name of the skill
  - `timestamp`: When the benchmark was run
  - `evals_run`: List of eval names or IDs
  - `runs_per_configuration`: Number of runs per config (e.g. 3)
- `runs[]`: Individual run results
  - `eval_id`: Numeric eval identifier
  - `eval_name`: Human-readable eval name (used as section header in the viewer)
  - `configuration`: Must be `"with_skill"` or `"without_skill"` (the viewer uses this exact string for grouping and color coding)
  - `run_number`: Integer run number (1, 2, 3...)
  - `result`: Nested object with `pass_rate`, `passed`, `total`, `time_seconds`, `tokens`, `errors`
- `run_summary`: Statistical aggregates per configuration
  - `with_skill` / `without_skill`: Each contains `pass_rate`, `time_seconds`, `tokens` objects with `mean` and `stddev` fields
  - `delta`: Difference strings like `"+0.50"`, `"+13.0"`, `"+1700"`
- `notes`: Freeform observations from the analyzer

**Important:** The viewer reads these field names exactly. Using `config` instead of `configuration`, or putting `pass_rate` at the top level of a run instead of nested under `result`, will cause the viewer to show empty/zero values. Always reference this schema when generating benchmark.json manually.

---

## comparison.json

Output from blind comparator. Located at `<grading-dir>/comparison-N.json`.

```json
{
  "winner": "A",
  "reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
  "rubric": {
    "A": {
      "content": {
        "correctness": 5,
        "completeness": 5,
        "accuracy": 4
      },
      "structure": {
        "organization": 4,
        "formatting": 5,
        "usability": 4
      },
      "content_score": 4.7,
      "structure_score": 4.3,
      "overall_score": 9.0
    },
    "B": {
      "content": {
        "correctness": 3,
        "completeness": 2,
        "accuracy": 3
      },
      "structure": {
        "organization": 3,
        "formatting": 2,
        "usability": 3
      },
      "content_score": 2.7,
      "structure_score": 2.7,
      "overall_score": 5.4
    }
  },
  "output_quality": {
    "A": {
      "score": 9,
      "strengths": ["Complete solution", "Well-formatted", "All fields present"],
      "weaknesses": ["Minor style inconsistency in header"]
    },
    "B": {
      "score": 5,
      "strengths": ["Readable output", "Correct basic structure"],
      "weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
    }
  },
  "assertions": {
    "A": {
      "passed": 4,
      "total": 5,
      "pass_rate": 0.80,
      "details": [
        {"text": "Output includes name", "passed": true}
      ]
    },
    "B": {
      "passed": 3,
      "total": 5,
      "pass_rate": 0.60,
      "details": [
        {"text": "Output includes name", "passed": true}
      ]
    }
  }
}
```

---

## analysis.json

Output from post-hoc analyzer. Located at `<grading-dir>/analysis.json`.

```json
{
  "comparison_summary": {
    "winner": "A",
    "winner_skill": "path/to/winner/skill",
    "loser_skill": "path/to/loser/skill",
    "comparator_reasoning": "Brief summary of why comparator chose winner"
  },
  "winner_strengths": [
    "Clear step-by-step instructions for handling multi-page documents",
    "Included validation script that caught formatting errors"
  ],
  "loser_weaknesses": [
    "Vague instruction 'process the document appropriately' led to inconsistent behavior",
    "No script for validation, agent had to improvise"
  ],
  "instruction_following": {
    "winner": {
      "score": 9,
      "issues": ["Minor: skipped optional logging step"]
    },
    "loser": {
      "score": 6,
      "issues": [
        "Did not use the skill's formatting template",
        "Invented own approach instead of following step 3"
      ]
    }
  },
  "improvement_suggestions": [
    {
      "priority": "high",
      "category": "instructions",
      "suggestion": "Replace 'process the document appropriately' with explicit steps",
      "expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
    }
  ],
  "transcript_insights": {
    "winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script",
    "loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods"
  }
}
```

--- references/subagent-pipeline.md ---
# Subagent Pipeline — Running eval.yaml without CLI

This reference documents the detailed procedure for running evaluations in subagent mode
(`AGENT_EVAL_MODE=subagent`, the default). The orchestrating skill dispatches `executor`
subagents to perform test cases and `grader` subagents to evaluate outputs.

Read this reference when executing Step 3 (Run and Grade) in subagent mode.

## Prerequisites

- The eval.yaml file exists and contains valid test definitions
- `agentv` CLI is installed (or run from source via `AGENTV_CLI=bun /path/to/cli.ts` in `.env`)
- Read `references/eval-yaml-spec.md` for the full schema

## Workspace Context

Some evals pass prompt files directly and don't require a specific workspace — those run fine
from anywhere. But evals that test agent behavior in a workspace (accessing skills, modifying
repos, using tools across multiple repos) require the user to be in the **target workspace**
(e.g., a multi-repo workspace set up by allagents). If the eval references workspace files or
expects the agent to use skills, check that the current directory is the target workspace, not
just the eval repo — and warn the user if it's wrong.

## Executor Subagent Eligibility

All providers except `cli` are eligible for executor subagents by default. To opt out a
specific target, set `subagent_mode_allowed: false` in `.agentv/targets.yaml`:

```yaml
# .agentv/targets.yaml
targets:
  - name: my-target
    provider: openai
    model: ${{ OPENAI_MODEL }}
    api_key: ${{ OPENAI_API_KEY }}
    subagent_mode_allowed: false  # forces CLI invocation instead of executor subagent
```

When `subagent_mode_allowed: false`, the target falls back to CLI invocation via `agentv eval`
even in subagent mode.

## CLI Targets: Single Command

For evals with CLI targets, `pipeline run` handles input extraction, target invocation, and
code grading in one step. When `--out` is omitted, the output directory defaults to
`.agentv/results/runs/<timestamp>` (same convention as `agentv eval`):

```bash
# Extract inputs and invoke all CLI targets in parallel:
agentv pipeline run evals/repro.eval.yaml

# Also run code graders inline (instead of using pipeline grade separately):
agentv pipeline run evals/repro.eval.yaml --grader-type code
```

By default, `pipeline run` extracts inputs and invokes targets only. Pass `--grader-type code`
to also run code-graders inline, or use `agentv pipeline grade <run-dir>` as a separate step.

The run directory is printed to stdout. Then continue to the grading and merge phases
described in SKILL.md Step 3.

## Non-CLI Targets: Executor Subagents

When the target provider is not `cli`, check `manifest.json` → `target.subagent_mode_allowed`.
If `true` (default for all non-CLI providers), the subagent IS the target. If `false` (user
opted out via `subagent_mode_allowed: false` in `.agentv/targets.yaml`), fall back to
`agentv eval` CLI mode instead.

### Step 1: Extract inputs

```bash
# Defaults to .agentv/results/runs/<timestamp>
agentv pipeline input evals/repro.eval.yaml
```

This creates a run directory with per-test `input.json`, `invoke.json`,
`criteria.md`, and grader configs.

### Step 2: Dispatch executor subagents

Read `agents/executor.md`. Launch one `executor` subagent **per test case**, all in parallel.
Each subagent receives the test directory path, reads `input.json`, performs the task using
its own tools, and writes `response.md`.

Example: 5 tests = 5 executor subagents launched simultaneously.

```
# Per executor subagent:
#   - Reads <run-dir>/<test-id>/input.json
#   - Performs the task
#   - Writes <run-dir>/<test-id>/response.md
```

### Step 3 onward: Grade and merge

See SKILL.md Step 3 "Grading" section for the three-phase grading process (code graders →
LLM grading → merge and validate).

## Step-by-Step Fine-Grained Control (CLI targets)

Use individual commands when you need control over each step with CLI targets:

```bash
# Step 1: Extract inputs (defaults to .agentv/results/runs/<timestamp>)
agentv pipeline input evals/repro.eval.yaml

# Step 2: run_tests.py invokes CLI targets (or use pipeline run instead)

# Step 3: Run code graders
agentv pipeline grade <run-dir>

# Step 4: Subagent does LLM grading, writes results to llm_grader_results/<name>.json per test

# Step 5: Merge scores (writes index.jsonl with full scores[] for dashboard)
agentv pipeline bench <run-dir>

# Step 6: Validate
agentv results validate <run-dir>
```

## LLM Grading JSON Format

The agent reads `llm_graders/<name>.json` for each test, grades the response using the prompt
content, and produces a scores JSON:

```json
{
  "test-01": {
    "relevance": {
      "score": 0.85,
      "assertions": [{"text": "Response is relevant", "passed": true, "evidence": "..."}]
    }
  }
}
```

## Pipeline Bench and Dashboard

`pipeline bench` merges LLM scores into `index.jsonl` with a full `scores[]` array per entry,
matching the CLI-mode schema. The web dashboard (`agentv results serve`) reads this format
directly — no separate conversion script is needed. Run `agentv results validate <run-dir>`
to verify compatibility.

## Output Structure

The path hierarchy mirrors the CLI mode: `<evalset-name>` comes from the `name` field in
the eval.yaml. The target is recorded in `manifest.json` — one run = one target.

```
.agentv/results/runs/<experiment>/<timestamp>/
├── manifest.json                    ← eval metadata, target, test_ids
├── index.jsonl                      ← per-test scores
├── benchmark.json                   ← aggregate statistics
└── <evalset-name>/                  ← eval.yaml "name" field, or eval file basename if absent (same as CLI mode)
    └── <test-id>/                   ← test case id
        ├── input.json               ← test input text + messages
        ├── invoke.json              ← target command or agent instructions
        ├── criteria.md              ← grading criteria
        ├── response.md              ← target/agent output
        ├── timing.json              ← execution timing
        ├── code_graders/<name>.json     ← grader configs written by `pipeline input`: code-grader scripts AND built-in types (contains, regex, equals, etc.)
        ├── llm_graders/<name>.json      ← LLM grader configs
        ├── code_grader_results/<name>.json  ← code grader results
        ├── llm_grader_results/<name>.json   ← LLM grader results (written by grader subagents; one file per grader)
        └── grading.json              ← merged grading (written by `pipeline bench` — do NOT write here directly)
```
