AI Harness Engineering

Guardrails and memory
for your AI coding agent.

AI coding agents skip verification, duplicate files instead of editing in place, and repeat the same mistakes every session. That's a harness problem. goat-flow is an opinionated harness for Claude Code, Codex, Gemini CLI, and Copilot CLI: an audit that validates the setup, hooks that block dangerous commands, skills that replace free-form prompting, and a learning loop that captures lessons so mistakes don't recur.

Why harness engineering?

The model isn't the product. The harness is.

Every serious practitioner has converged on the same insight: the LLM is commodity, the scaffolding around it isn't. LangChain moved their coding agent from Top 30 to Top 5 on Terminal Bench 2.0 without changing the model - only the harness. Files it can read, commands it can run, rules it must obey, memory it keeps across sessions. That's the harness. goat-flow gives you one, opinionated, out of the box.

The four pieces

Four parts working together. Audit tells you what's missing, skills give the agent workflows, hooks stop dangerous actions, and the learning loop remembers what happened.

01 · Audit

How the audit works

Three scopes, all pass/fail. No wiggle room: either the file is there, the hook is wired, the convention is followed, or it isn't.

Scope 1

GOAT Flow setup

Required directories, config, architecture docs, learning-loop folders, shared references.

Scope 2

Agent setup

Per-agent checks: instruction file, skills, settings, deny mechanism.

Scope 3

AI harness completeness

With --harness: integrity, advisory, and metric checks across 5 concerns (context, constraints, verification, recovery, feedback loop).

Run goat-flow audit --harness for 17 harness checks across five concerns: context, constraints, verification, recovery, feedback loop. These five are the common ground across the public harness engineering literature (Hashimoto, Fowler/Böckeler, Anthropic, HumanLayer). goat-flow picks them as the default audit lens.

Use quality for agent-driven assessment of the harness itself. The audit answers is the harness installed; quality answers is it doing what it should.

02 · Skills

Seven skills, one dispatcher

Free-form prompting is how agents get lost. Skills are structured slash commands with defined phases, named artifacts, and clear stopping points. Seven ship with goat-flow. Use /goat as the default entry point and it routes to the right one.

/goat Dispatcher that classifies your intent and routes to the right skill default
03 · Hooks

Safety nets that can't be skipped

Hooks fire before or after agent actions. One ships by default: deny-dangerous. Projects can add their own where their runner supports it; the audit checks that hook files and registrations stay in sync.

deny-dangerous Blocks destructive filesystem commands, all git push, .env file reads, direct literal secret-path access, eval and dangerous commands hidden inside bash -c, database DROP/TRUNCATE, file truncation, and recursive command substitution.

Add your own. Linters, custom validators, format-on-save, project-specific rules. Register them in the agent's hook config alongside deny-dangerous so the audit can verify the wiring.

04 · Learning loop

Persistent memory across sessions

Agents forget everything between runs. The learning loop gives them durable project records, with local session notes as recovery context, so the same mistake doesn't happen twice. It's the compounding bet of the whole system: every session that hits a problem makes the next session a little harder to trip.

Under the hood

The execution loop

Every agent action follows a four-step loop. Each step prevents a specific failure mode that free-running agents reliably hit.

READ Load the relevant files first. Prevents fabrication: inventing APIs that don't exist.
SCOPE Declare which files will change, and which won't. Prevents scope creep: editing files the task never asked for.
ACT Make changes within the declared scope. Prevents off-target edits: changes made because they seemed related.
VERIFY Run linters, re-read changed files, confirm nothing else drifted. Prevents silent breakage: passing the task but breaking the build.
Background

The five concerns of AI harness engineering

The common ground across the public harness engineering literature. goat-flow's --harness audit scores every installed harness against these five.

Sources: Mitchell Hashimoto, Birgitta Böckeler (martinfowler.com), Anthropic engineering, HumanLayer, and LangChain. goat-flow synthesises these into a working system with strong defaults, rather than a framework you have to assemble yourself.