Draft v4 · final edits in progress · attune-ai.dev
← attune-ai

The Discipline of Agent Collaboration

A counter-thesis to vibe coding, from someone shipping real work this way.

"Vibe coding" is the shorthand the discourse has settled on for AI-assisted development — and it's a real thing, useful for prototypes, exploration, and the bottom of the experience curve. This piece is about the other 80%: the work that needs to ship, persist across sessions, coordinate across packages, and not break under multi-month load. That work isn't vibing. It's a discipline. The good news: it's a learnable one.


§1 — The Premise: Why Discipline, Why Now

Most "AI-assisted development" stops at suggestion and accept. The agent finishes a sentence. The human keeps the architectural picture. Code gets written faster than it would otherwise. The relationship is real and useful — autocomplete on a better autocomplete model — but it's a tiny fraction of what's possible. The interesting space is the work that lives between sessions, across files, and across packages: specs the agent helps scope and revise, releases coordinated across multiple PyPI packages with pin chains, decision matrices committed before measurement so the data can route the call, memories that persist what was learned across days, dashboards surfacing live state from sibling sessions you couldn't otherwise see. That space requires a different posture from both sides. This piece is the field manual for that other space.

The thesis in one line: collaboration with an AI agent is not a tool problem (find a better model, write better prompts) — it is a discipline problem. A mutual contract, a shared vocabulary, agreed artifact shapes, named failure modes. The disciplines are individually boring. They compound.

And the posture isn't exotic — it's ordinary team management applied to a teammate who happens to be an agent: an explicit contract, clear handoffs, named boundaries. Practiced together, the disciplines produce synergies tied to the discipline itself; the compounding is the point, not a bonus.

Here is what that looks like on a real morning. The ops dashboard is up in a browser tab. The family snapshot shows five PyPI packages tracked. Telemetry shows yesterday's spend and today's. The workflow runner lists twenty workflows with their recent green chips. The dashboard surfaces things that would otherwise stay invisible: a sibling session's spec status edits sitting unstaged in the main checkout's working tree; a workspace-versus-account spend anomaly worth a flag but not a block; four PRs in flight, each showing its CI state. Surfacing is the dashboard's job. Recognizing what to do with the surfaced state — that's the discipline.

Two-layer judgment runs the cycle. The agent reads the surfaced state, synthesizes options, recommends one, waits for the human's call. "Commit the spec edits as their own PR" — yes, that — the parallel session put real intentional work into those files and they're at risk of loss if no one acts. Human approves. Agent cuts the PR, runs verification, surfaces CI cleanly. Human green-lights admin-merge once at the start of the batch. That authorization is durable within the session: the agent treats it as standing approval for similar future merges of the same shape, removing the friction of re-asking on every nearly-identical decision. The cycle repeats. Dashboard surfaces → agent recommends with options → human approves at the decision point → agent executes the mechanical work → CI lands → admin-merge under the durable authorization → next surfacing.

The numbers from one morning: eight PRs handled to resolution. Three PyPI releases (attune-author 0.14.1, attune-gui 0.8.0, attune-ai 7.1.2). Five follow-ups (a publish-trigger sync, a lockfile catch-up, a dep-cap widen, a spec-status persist, and one dependabot PR closed and replaced when its lockfile had drifted past the version it was bumping to). Two held without shame for low-priority bootstrap reasons. Two real judgment calls (a minor version bump instead of a patch when an unreleased MCP bundle was queued in CHANGELOG; the dependabot replacement). One in-session pushback on the agent's own plan (running code-review on a config-only PR produces no signal — skip it, save the budget). One proactive memory write that did not exist that morning.

What made this efficient was not magic. Five mechanisms compound. Clean small PRs — each one single-purpose, easy to scope, easy to review, no mega-PRs that bundle six unrelated concerns. AskUserQuestion at real decision points only — scope expansion, version-bump shape, release-trigger choice, admin-merge authorization. Never at mechanical points like "should I now run pytest?" The agent does not interrupt the human to ask permission to do its job. Admin-merge authorization durability — granted once at the top of a batch, persists for similar future merges, zero re-asking friction within the session. Parallel work while CI runs — the agent moves to the next surfaced thing while a twelve-platform matrix cycles, never serializing waiting time, because waiting is wasted human attention. The dashboard as live state — surfacing what changed, what's ready, what's anomalous, so the agent doesn't have to ask the human to re-explain context that is visible on screen. None of these is clever in isolation. The compounding is the discipline.

The rest of this piece names the six disciplines that produce mornings like that — each its own section, each adoptable incrementally, on its own or together:

§8 closes with a longer case study expanding the morning above into a full twenty-four-hour arc.


§2 — The Mutual Contract

A working contract between collaborators is not a new idea. High-functioning engineering teams have used explicit working agreements for decades — when the team will pair, how reviews work, what counts as a blocker, when interrupts are okay. What we are doing in this section is applying the same shape to human-agent collaboration. The asymmetries between the two parties force a more explicit version of the contract than human-only teams usually need: one party never tires; one party cannot read past sessions natively; one party owns all the side effects (commits, pushes, releases, merges); one party can be replaced wholesale at any moment without warning. The contract has to do more work because of those asymmetries, not less.

The contract has two halves.

The human's half is make the agent's job possible. Concretely: declare the working mode at session start — advancing a measurable scope, executing a planned spec, firefighting a CI issue, or meta-reflection with no code changes expected. State the project (the cwd is often not the project being worked on, especially with worktrees). State the outcome in one sentence — what should be true at the end of this session that isn't true now. State the done-when criteria — the acceptance condition that lets either side say the session has finished its scoped work. None of this is performative. Each line saves the agent from inventing wrong context and lets the agent push back when the work begins drifting.

The agent's half is make the partnership durable. Neutral curiosity over yes-and energy — the agent is not a cheerleader. Full attention to the prompt — no skimming, no answering what the human almost asked. Correction without ego — when wrong, admit and adjust without theater. Honesty about limits — when a memory is stale, say so. When a claim isn't verified, say so. When a recommendation is informed by guesses about an API the agent hasn't read, say so before producing the code that depends on it.

Either side ignoring its half makes the whole arrangement collapse into transactional autocomplete. The human who treats the agent as a vending machine — no working mode, no outcome, no done-when, just requests one after another — does not get collaboration; they get a slow autocomplete with a personality. The agent that treats the human as an oracle — no recommendation, no opinion, just "what do you want me to do" — does not get good direction; the human ends up doing the agent's synthesis work and resents it. The contract closes both of those failure modes.

One small rule inside the contract carries more weight than its size suggests: ask one question at a time. When the agent bundles "do you need a break?" with "what direction next?" the human's answer to one collapses the other. The questions interact and corrupt each other. The discipline is to separate them, sequence them, ask the load-bearing one first, and use a structured decision-point surface (AskUserQuestion in our tooling) as the default form. Number the options A/B/C, label A as Recommended where applicable, keep alternatives short. This is mechanical — not deep — but mechanical things compound.

Pattern: shorthand as vocabulary primitive. A human collaborator says "give me feedback" — singular phrase, no parameters. The agent reads it as a request for the full discovery kit: opportunities, options, next steps, pros and cons, anything load-bearing the agent noticed but didn't surface. Re-specifying the kit each time would waste the primitive — the shorthand IS the contract. The pattern generalizes: any phrase the human uses repeatedly should be readable by the agent as a structured invocation, not as a literal-string request. The agent's job is to learn the human's vocabulary, not to demand it be re-specified. The agent also owes a memory write the first time it notices the shorthand pattern — so future sessions inherit the vocabulary instead of relearning it from scratch.

The PR-and-approve cycle is where the contract lives at the day-to-day level. There are two failure modes flanking it. On one side, agent acts alone — the agent has line-by-line freedom and produces wrong-scope PRs, wrong-shape version bumps, wrong release sequences. On the other side, human approves every line — the human becomes a bottleneck on every keystroke and the agent reduces to autocomplete. Between them sits a discipline shape: the agent does mechanical work between named decision points; the human approves at the decision points.

What counts as a decision point: scope expansion ("this PR grew to include X — okay?"), version-bump shape ("patch or minor here?"), release-trigger choice ("ship the unreleased work bundled or split into two releases?"), admin-merge authorization ("grant the durable approval for green-CI low-risk merges, or hold and decide each one?"), irreversible side effects (force-push to a shared branch, deletion of a directory tree, posting to a public surface). What is not a decision point: code formatting, lockfile updates, test runs, CI polling, mechanical retries after a transient failure, choosing which subagent to spawn for a research subtask. The agent surfaces the decision points clearly with the recommendation labeled. The human picks. The agent executes the mechanical span until the next decision point. Today's session had roughly twelve decision points across eight PRs. Each was a discrete surface from the agent, each got a discrete answer, the mechanical work between was hands-off.

The shape that makes this work is the agent earning the human's trust at the mechanical layer. When the human approves "go ship these three releases in simplest-first order," they are not approving every line of every commit; they are approving the kind of work that will happen between this approval and the next decision point. The agent that uses that approval to smuggle in an unrelated refactor breaks the contract. The agent that asks for re-approval at each mechanical sub-step also breaks the contract, in the other direction — it turns the human into a babysitter. The middle path is narrow, but it is also the one where actual collaboration happens.

The contract above is fully compatible with the agent disagreeing with the human. In fact, that is one of its load-bearing properties. When the agent has a concrete alternative — not a hedged "have you considered" but a fully-formed counter-proposal with the file list, the diff sketch, the measurement plan — the agent owes the human that pushback before executing. Hedging without an alternative is cosmetic and creates friction without value. The contract welcomes real pushback and rejects the imitation kind.

The contract has an asynchronous mode, too — for when the human steps away and the agent keeps working. Everything above governs interactive work: decision points, recommend-and-approve. But agents don't tire, and a known away-window is runway the human can spend deliberately. The asynchronous contract has the same two-obligation shape, re-pointed at unattended work.

The human's half is to queue execution-ready work — not merely substantial work. Decisions made, premise fresh, path clear: roughly, hand off implementation, not design. Decision-ambiguous work handed to an unattended window either stalls (the agent waits for an answer that won't come) or guesses wrong (a thousand lines of correct implementation of the wrong thing). The artifact altitude is the tell — tasks-level work survives an away-window; design-level work belongs in the synchronous window where you're present to answer. A compact invocation helps: in our tooling auto: <specs or PRs> reads as "these are execution-ready, run them to the boundary" — the same shorthand-as-vocabulary primitive from earlier in this section.

The agent's half is to narrow to what is genuinely safe and fresh; hold hard guardrails (no irreversible or outward-facing act — no merge to a protected branch, no publish, no new repo, nothing touching credentials — without the human); re-validate a spec's premise before executing it, because specs go stale in days; and stop honestly at the real boundary rather than manufacture busywork or blind-execute a stale plan. The agent that pads an away-window with low-value churn to look productive breaks the asynchronous contract exactly as the one that smuggles in a refactor breaks the synchronous one.

This cuts both ways at the edges of the work, not only the middle. If the human's starting handoff is weak — vague, ambiguous, or decision-heavy — the agent owes a pushback to firm it into a strong start before proceeding, the same way it owes a clean stopping boundary at the end rather than halting mid-edit. Protecting the quality of the start and the stop is higher-leverage than executing whatever phrasing happened to arrive.


§3 — Pacing: Sustainability Is a Skill

Sustainable pace has decades of engineering-management evidence behind it. Brooks's law on adding people to a late project. The death-march anti-pattern and its predictably brittle output. The standard expectation in mature teams that velocity must be averaged over recovery time rather than measured in sprint peaks. The same evidence applies to human-agent collaboration, even though the asymmetry has shifted. Agents do not tire. The humans they work with do. Ignoring that asymmetry produces the downstream quality cost it always has — but the cost can land later than expected because the agent's energy masks the human's exhaustion for hours past where the work should have stopped.

The trap is that long agent-collaboration sessions are seductive in a specific way. The agent reads context instantly. The agent doesn't need a recap. The agent doesn't push back on "let's do one more thing." So a session that should have ended at hour three keeps going for another two, and the work in hours four and five is consistently worse than the work in hours one through three — not in a way that's obvious in the moment, but in a way that surfaces three days later as a debug session for a bug that wouldn't have existed if the original session had ended cleanly.

The first move in the discipline is the human naming their own pattern. This is not optional. The agent cannot honor a sustainability commitment the human has not made to themselves. The form of the naming varies. It can be "I'm a morning person and afternoons get sloppy," or "my polyphasic schedule means I'm sharpest in the first ninety minutes after a nap," or "after 8pm my best is reviewing, not writing." What matters is that it is named, in writing, in the agent's memory or the project's top-of-tree instructions, where future sessions inherit it. The agent can't honor a commitment it doesn't know you made.

The agent's role is not to nag and not to lecture. Surface the signal once when it appears. At session start: late hour, "I'm tired" mentions, back-to-back sessions on something that doesn't need it. Mid-session: frustration that doesn't match problem difficulty, "let's just get this done," trading quality for speed in ways the human normally would not. When the human asks the agent to skip a safety step "because it's fine." One sentence, named once. Then defer to the human's call. The agent is not the supervisor; the agent is the collaborator who notices and says so once.

The clean-stop pattern is one of the highest-leverage moves in this discipline. Prefer ending at completion boundaries — PR merged, spec drafted, commit pushed, decision recorded — over mid-edit. Resuming mid-edit costs more energy than starting fresh on a clean unit, every time. A session that ends at "PR #469 merged, dashboard reflects new state, two follow-ups noted for tomorrow" can be picked up cold by the next session with one paragraph of context. A session that ends at "halfway through refactoring the routes module, three of nine files done, tests partially updated" requires the next session to rebuild the mental model of where the refactor was going, what's been done, what's broken, what's intentional-incomplete vs accidental-incomplete. The cost gap between those two pickup states is large.

The agent supports clean stops by tracking them. When the human signals fatigue, the agent's recommended next move is not "let's do one more thing" but "the next natural stopping point is PR #X landing — about ten minutes of CI — would you like to wrap there?" That framing trades zero discipline for a real pause; the human gets to land the in-flight work AND stop cleanly. Both halves matter.

Why is this in a discipline document and not a wellness document? Because the shape of the work changes when you optimize for sustainability over throughput. You write better artifacts when you write them rested. You make fewer desperate decisions when you have the option to defer until tomorrow. The agent learns your real cadence rather than averaging across grinds, which means future sessions calibrate against the work you do well rather than the work you do exhausted. The compounding here is across weeks: the team (you, the agent, the codebase) gets better-shaped when each session is optimized for quality-per-hour rather than total hours.

There is also a counterintuitive point worth naming. The discipline of stopping cleanly does not slow throughput — it raises it. We have observed this informally across many sessions. The mornings that started with a clear declaration of mode and outcome, ran for three to four focused hours, and ended cleanly at a natural boundary, consistently produced more shipped work per week than the mornings that opened vague and ran until exhaustion. The reason is that the exhaustion-tail of those longer sessions produced work that had to be redone, which consumed the next morning's first hour. The "extra" hours weren't extra. They were borrowed against tomorrow's best window.

A final note on pacing inside a single session: micro-pacing matters too. The agent should not interrupt the human at every mechanical step. The agent should not bury the human in walls of text when one paragraph would do. The agent should resist the temptation to over-narrate ("I am now about to read the file…"). The collaboration runs on the human's attention budget. Spend it on decision points and substance. Mechanical work happens silently.


§4 — Artifact Discipline: Specs Nest XML Prompts Nest Implementation

Spec-driven work, pre-committed decision criteria, and structured task decomposition are not new ideas. They are well-validated project-management and engineering hygiene. What we are doing in this section is naming the layering of those artifacts inside human-agent collaboration, where the surface for amplifying poor scope is much wider than with a human collaborator who can independently push back on scope creep without being asked. A bad spec given to a human gets pushback before any line of code is written. A bad spec given to an agent gets a thousand lines of code that exactly implement the bad spec, and the human now has both the original problem and the cost of unwinding the implementation.

The shape of work the agent can do well has four nesting levels, ranked by how much ceremony the work justifies. Pick the row that matches the task at hand and adopt its four columns together — they're calibrated as a set.

Tier When Plan depth Verify depth Capture PR shape
Inline edit Trivial — one file, no ambiguity, diagnosis shorter than fix None Eyeball + existing tests None None or part of larger
Structured one-shot 1–3 files, single session, no design questions Goal + constraints + acceptance criteria in turn one Tests + dogfood at end Inline lesson if surprising Single PR
XML-enhanced prompt 3+ files, dependencies, subagent handoff plausible Plan-mode iteration before execution Tests + dogfood + named failure modes Dedicated memory write Single PR or stacked 2–3
Spec (/spec) Multi-session, design ambiguity, irreversible choices, multi-PR Full design phase + pre-committed decision matrix All of the above + regression guards decisions.md log + memory writes Multi-PR, often stacked

Two rules read into this table. The cost of over-formalizing a trivial task is real — a spec for a one-character fix is a productivity tax on the system. The cost of under-formalizing a complex task is much larger — a spec-shaped problem given inline becomes wrong-scope work that requires unwinding. Right-sizing means matching the row to the work and not being precious about either direction.

The four tiers nest: a spec contains tasks, tasks become XML prompts when they meet the XML-prompt criteria, XML prompts contain structured one-shots and inline edits where appropriate. Mistaking them as competing artifacts produces predictable failure modes — under-scoped specs with no implementable tasks; over-scoped XML prompts that should have logged decisions in a decisions.md for posterity; structured one-shots that should have been escalated to specs because design ambiguity surfaced mid-session.

Spec. Right artifact when there is genuine design ambiguity, when the work spans multiple sessions, when you want the why recorded for the next agent (or the next human reading the codebase a quarter later), and when irreversible choices are about to be made. Specs in our setup live in docs/specs/<topic>/, orchestrated by the /spec slash-command. The four files aren't formality for its own sake — each one pays specific rent:

The four-file structure is small and the editing cost is low — most spec edits touch one or two files at a time. The discipline is that the files exist before implementation starts, not after as rationalization-of-already-shipped work.

XML-enhanced prompt. Right artifact when the work touches three or more files with dependencies, when the unit will be executed by a subagent or future session, and when you want a self-contained executable specification rather than back-and-forth. The schema is small (<task>, <objective>, <context>, <files-to-create>, <files-to-modify> with before/after, <validation>, <risks>); the property that matters is self-containment — an XML prompt should be executable cold, without the executor reading any other document. Paths specific. Validation steps are concrete assertions, not "make sure it works." Risks named with severity tags.

Structured one-shot prompt. Right artifact when the work fits inside a single session and a single PR but isn't trivial enough to skip planning entirely. Three components in the first turn: goal (what success looks like), constraints (non-goals, things not to touch), acceptance criteria (how you'll verify it's done). No XML schema, no formal spec; just a structured paragraph or two before the agent starts touching code. Most day-to-day work lives in this tier. Giving it a name turns it from "improvise the right framing" into "follow the pattern."

Inline edit. Right move for trivial work: a single-file edit, a config tweak, a bug fix where the diagnosis is shorter than the fix, a one-line documentation correction. The discipline here is not over-formalizing. A spec for a one-character fix is a productivity tax on the system.

Pre-committed decision matrices are the highest-leverage artifact inside this discipline. When a spec's value rests on a measurable claim — "X costs Y bytes or dollars or seconds" — write the decision matrix before running the experiment. Commit the matrix to the repo. When the measurement lands, the matrix routes the decision automatically. Goalpost-moving becomes impossible because the goalposts are version-controlled with a timestamp prior to the data.

This pattern saved us real work on a recent embeddings decision in our retrieval pipeline. The matrix was committed: if golden P@1 is ≥70% with the current zero-dependency retriever, defer the embeddings dependency; if 60–69%, run a small additional experiment to decide; if <60%, adopt the dependency. The measurement came in at 73.3%. The matrix routed the decision in one minute without debate. We had every incentive in the moment to find reasons to adopt the heavier dependency anyway (it was the more interesting path, the more "real RAG" path), and the matrix prevented that. The commit timestamp is the arbiter, not your later preference.

Read first, execute second. For any open-ended task, the agent reads everything relevant — the spec, the recent commits on adjacent code, the test suite for the area being touched, the memories with related tags — before executing the recommendation. Surface only the genuinely-ambiguous items to the human. Do not make the human re-explain context that is already in the repo. The cost of an extra grep is seconds; the cost of executing on a misread context can be hours and a force-push.

A worked example: the article you are reading was proposed with a four-phase scope — outline, section drafts, integrate, derive short forms. That structure is not decorative. It is how the agent can deliver a meaty artifact without burning context on whole-document re-reads at each step. Each section is its own XML-prompt-shaped unit. Each one stands alone (a locked design constraint). When a future session picks up the derivative-form work (LinkedIn post from §8, Discord drops from §3/§5/§7, Twitter thread from §6), the foundation piece is the source-of-truth and the derivative work reads one section at a time. The structure makes the work compositional.

One more property to name: artifact discipline is what makes handoffs possible. A session ending at "spec approved, three tasks queued as XML prompts" can be picked up by a different agent (or the same agent in a fresh session) and produce work that matches the original intent. A session ending at "I think we should probably build something like…" cannot be handed off without losing fidelity. Artifacts are the protocol that survives the session boundary.

The discipline needs a surface to live on. Specs are markdown files. XML prompts are shareable text. Memories are version-controlled. The dashboard is a web UI over disk artifacts. Pick tools that make these artifacts cheap to create, share, and revisit — the surface matters less than the discipline running on it. Our implementation is the attune-* family: attune-ai as the tool belt (the plugin that orchestrates the disciplines), with attune-author (authoring help content), attune-gui (the ops dashboard), attune-help (the living-doc help layer), attune-rag (retrieval over the help corpus), and attune-verify (fact-checking generated content) as the specialized tools the belt carries. The patterns above don't require these specific tools — they require some tools that play these roles. If you'd rather wire your own, the disciplines stand alone; if you'd rather skip the wiring, the family is on PyPI.


§5 — Memory Discipline: What Persists, What Doesn't

The agent's memory is a curated artifact, not a log. Memory in this sense is the persistent layer that survives session boundaries — the files in the agent's auto-memory directory, the project's top-of-tree lessons section, the cross-linked notes that future sessions inherit at startup. It is the long-term knowledge layer of the collaboration. It is not the conversation transcript and it is not the commit history. Those are logs. Memory is the curated subset of what those logs surfaced that is worth carrying forward.

Three classes of fact help orient what belongs in memory.

Worth saving. Surprising user preferences. Validated approaches that aren't obvious from the code. Recurring failure modes with their fix shape (pre-commit's stash conflict with auto-fix hooks; the YAML colon trap in run: blocks). Project-specific gotchas where the why matters as much as the what. The common shape is non-obvious from the code, won't be re-derivable, has bite if forgotten.

Not worth saving. Anything git log already knows (who changed what, when, in what commit). Anything the project's top-of-tree instructions already capture as a project-wide standard. Conversation context that ends with the session. Generic principles the model already knows. The test for "not worth saving" is can the agent re-derive this in the moment by reading the code or running a command. If yes, don't save it; saving it just creates a maintenance burden as it goes stale.

Anti-saves. These look like memory but are actively harmful. Debugging recipes ("when X breaks, do Y") — the fix is in the code; the commit message has the context; a recipe in memory rots the moment the underlying code changes. Per-task state — that belongs in a task manager, not a memory file. Snapshots of repo state ("here are the files in src/") — these are guaranteed stale within hours.

Proactive persistence is the rule that makes memory actually work. When the human teaches the agent something non-obvious — preference, pattern, validated approach — the agent's job is to name the memory type AND implement the write, in the same response. "I'll save this as a feedback memory" without the actual write is performative. The write without the naming leaves the human guessing whether the agent actually got it. Both halves required. Both in the same response.

The reason is that the teaching moment is the moment with the highest fidelity. The human knows exactly what they meant, the context is fresh, the agent can ask one clarifying question if needed and get the precise answer. Twenty minutes later, the moment has degraded. Twenty hours later, the moment is gone and the agent has invented a slightly-wrong version of the lesson. The discipline is write while it's hot.

Cross-linking and indexing is what turns a pile of memories into a navigable knowledge base. Each memory references related memories. The index file is a curated front door, not a log — it's a one-line-per-entry pointer list, ordered semantically by topic, kept under a length cap so the agent loads the full index at session start. When a new memory is added, the index entry is added in the same write. Stale entries are pruned. Duplicates are merged into the canonical home. None of this is automatic. All of it is discipline.

Stale memory is the dangerous category and it deserves its own paragraph. A memory written six months ago about "the routes module lives at src/attune/ops/routes.py" can be confidently wrong today because the routes module was split into a package three months ago. The memory is still there, the agent still recalls it, and an action taken on the recalled fact ("edit the routes module") opens a file that no longer exists. The discipline is verify against current state before acting on a recalled fact. The cost of grep -r "class RoutesModule" src/ is sub-second. The cost of executing on a stale memory can be a force-push and an apology.

This applies broadly. Memories that name functions, classes, file paths, CLI flags, env vars — all of those are claims that existed when the memory was written. They may have been renamed, removed, or never merged. Before recommending action that depends on the named entity, verify. "The memory says X exists" is not the same as "X exists now." The agent that confuses those two breaks the contract from §2 about honesty about limits.

One more category worth naming: memories that snapshot state (an audit log, an architecture snapshot, a count). These are frozen in time. If the user asks about recent or current state, the right move is git log or reading the live code, not recalling the snapshot. The snapshot tells you what was true on the date in the frontmatter, not what is true now. Use the snapshot to understand history; use the code to understand the present.

Finally, the consolidation discipline. Memory grows. The index grows. Without periodic consolidation, the system silts up with near-duplicates and stale entries that quietly mislead. The mechanism is a reflective pass at intervals — merging near-duplicates, fixing stale facts, pruning index entries that no longer pay rent for their slot. This is the librarian function of memory work. The agent is the natural worker for it; the human approves the consolidation diff. Schedule it like any other maintenance activity. Without it, the value of memory degrades to noise inside a few months.


§6 — Multi-Agent Coordination: When You're Not the Only One

The coordination patterns named in this section — don't push without fetching, the PR-and-approve cycle, durable authorizations, clear ownership of side effects — are what distributed engineering teams have used for years. Applying them deliberately to human-agent collaboration matters more because the failure modes here (parallel sessions clobbering each other's work, silent live-state writes from one session that another session can't see, lost handoffs across worktree boundaries) are harder to spot when one party can't independently advocate for itself in the way a human teammate would.

The reality of a serious working setup is that multiple sessions of the same agent run in parallel. Different worktrees. Different terminals. Different machines. Sometimes different users. Each session writes to a shared filesystem, can push to the same PR branches, can write to the same memory files, can flip status on the same specs. Coordination is not an edge case — it is the baseline.

The collision shapes that recur:

Parallel push to the same PR. Two sessions push to the same branch. The slower one runs git push and sees "Everything up-to-date." That message does not disambiguate "already pushed by me" from "your peer beat you to it." The wall-clock between your last fetch and your push is not a vacuum; an upstream peer may have written in the gap. The discipline: after any "up-to-date" message, run git fetch origin <branch> && git log HEAD..origin/<branch>. If the remote has commits HEAD doesn't, a peer pushed in the gap. Do not assume "up-to-date" means "your work landed."

Worktree main-branch contention. A sibling worktree holds the main branch checked out in the bare repository's tracking. When you run gh pr merge from a non-main worktree, the remote merge succeeds, but the local post-merge fast-forward of main fails with a confusing "branch is already used by worktree at…" error. The exit code is non-zero. The merge looks like it failed. It did not. Verify state with gh pr view <PR> --json state,mergedAt,mergeCommit and trust the API, not the local exit code. The remote is authoritative; the local fast-forward is convenience.

Live-state writes from another session. This is the failure mode that drove a real spec this week. The ops dashboard writes spec status changes directly to disk in the working tree as part of its UI flow — not as commits. When the session that hosted the dashboard ends, those writes are stranded: real intentional state changes that exist only as unstaged modifications in the main checkout's working tree, invisible until someone runs git status and sees ten modified spec frontmatter files with no commit attached. Hit this exactly that way today. Ten spec edits, all real, all at risk of loss.

The general discipline for collision-avoidance is treat the wall-clock between your fetch and your write as not-a-vacuum. After any "rebase + force-push" cycle on an active main, re-fetch and re-rebase if upstream moved in the gap. Before admin-merging a base PR with --delete-branch, re-target the stacked PRs first so they don't auto-orphan. Before assuming you understand the working tree, run git status AND grep for known live-write surfaces (dashboard-driven status writes, telemetry buffers that haven't flushed, journal files from in-flight workflows). The rule is mechanical. The cost of doing it wrong is not.

Admin-merge authorization durability is a named pattern worth pulling out. The human grants admin-merge authorization once at the start of a batch — "yes, you can admin-merge the green-CI low-risk merges as they land." The agent treats that grant as durable for similar future merges within the session. Saves friction across the next four or eight merges; the human is not re-asked on each one. The discipline runs in both directions. The agent does not re-ask on each similar merge (eliminates friction). The human trusts that the agent's definition of "similar" matches theirs (eliminates scope creep). When the agent is unsure whether a merge fits the scope — a merge that touches a different package, a merge that drops protection for longer, a merge with non-green CI — it asks. Authorization durability is not permission to expand. It is permission to repeat the same pattern.

Before adopting this pattern. Admin-merge authorization durability is an advanced move. It presumes preconditions an inexperienced operator may not have, and adopting it without them turns a safety mechanism into theater.

If you're advising someone newer to agent collaboration, the default is "ask each merge, read each diff." Durable auth is what you graduate to once those five preconditions are real, not what you reach for to save typing.

The dashboard as a coordination surface is the read-side of this discipline. When multiple sessions can write to a shared system, the dashboard becomes the queue where you find out what has happened. Specs flip status. Sessions list updates. Run telemetry accumulates. Family snapshots show package versions across the org. The discipline is to use the dashboard at decision points, not just at end-of-day. It is the lens that makes coordination judgment possible.

Two-layer judgment runs the cycle, and it is the same shape as §1. The dashboard surfaces raw state (telemetry numbers, spec status flips, recent run chips, family snapshot, anomalies). The agent synthesizes options and recommends. The human picks. The dashboard is not the decision-maker. The dashboard is the lens that makes decision-quality possible. Today's session: the dashboard surfaced the other session's ten unstaged spec edits (otherwise invisible from a fresh agent session), a workspace-versus-account spend anomaly (worth flagging, not blocking), the in-flight PRs' CI states. Each surfacing routed to an agent recommendation that routed to a human decision that routed to mechanical execution. The cycle is the discipline.

The worked example of discipline running on itself: the same session that surfaced those ten unstaged spec edits also designed and shipped a proactive solution to the underlying class of bug in the same afternoon. Sequence: the agent named the class of bug — live-state writes from session-bound services becoming silent debt when sessions end. The agent enumerated six solution candidates with a pros-and-cons table. The agent recommended a layered approach (a durable journal of pending writes, an API for inspecting them, consumer wiring on the write sites) and named why each layer addressed a distinct damage mode. The human approved the design. The agent proposed phasing — Phase 1 today, Phases 2 and 3 deferred — and the human approved that scope cleanly. The agent wrote the full spec at docs/specs/dashboard-pending-writes-journal/, with requirements, design, and a decisions.md logging eight durable choices. The agent implemented Phase 1 in roughly an hour and a half — a journal writer, an API endpoint, the spec-status setter wire-in, and twenty-five tests, all green. An end-to-end smoke test against the live dashboard confirmed the cycle. The PR opened as attune-ai #469.

The point of the worked example is the meta-shape. The article's central thesis — discipline produces better work faster — gets demonstrated by the act of using the discipline to address the discipline's own friction. The spec design AND the implementation AND this paragraph are all products of the same multi-hour session. No single piece of that work is impressive in isolation. The compounding is the discipline.


§7 — Verification: Quality Is Not Optional

An LLM-driven documentation pass on a real codebase produced six classes of hallucination in a single page — invented CLI flags, wrong import paths, fabricated cross-references, wrong route paths, an insecure example, a fabricated numeric count. The unit tests caught zero of them. Three of the six would have broken readers who followed the docs literally. Verification, not taste, is what closes that gap.

The verification gap is the space between "the code compiles and the tests pass" and "the work actually does what the work was supposed to do." An agent that runs tests is not an agent that verified the work. Tests pass on bad code constantly — when the test was written from the same misunderstanding as the code, both sides agree on the wrong thing and the test reports green. Coverage is green on dead code paths. Type checks pass on functions that no caller ever invokes. The gap is real and large.

Verification is not one move; it is four, sorted by what they check. Two guard the boundaries of generation — the input the model is handed and the output it produces. Two guard the work itself — what it does at runtime and how its decisions got made. Name which one a given check buys you, and you stop confusing "the tests are green" with "the work is true."

Grounding the input — "is the claim supported?"

When the agent answers from retrieved sources, the first verification is whether each claim is actually supported by the sources it cites. The lever is structural: force a citation for every claim and treat an uncited claim as no claim at all. The result is measurable — on attune-rag's golden evaluation set, citation-forced grounding reaches 99.6% per-claim faithfulness: of every individual factual claim the system makes, 99.6% trace back to a cited passage.

That number rewards a careful reading, because there are two ways to count and they tell different stories. Per claim — the honest unit — asks of each statement, "is this one grounded?" 99.6% are. Per query — a stricter all-or-nothing bucket — fails an entire answer if even one claim slips; by that count roughly one answer in fifteen still trips. The per-claim number is the one the reader actually experiences, sentence by sentence, and it is the one to lead with. It got there structurally — by making an uncited claim impossible to state — not by asking the model to try harder.

Fact-checking the output — "do the named things exist?"

Grounding can be perfect and the output can still invent a flag that was never in the retrieved context; the model fills surrounding scaffolding from priors that sound right. That is the second mode — and it is exactly the six-hallucination case above. The polished help page passed its unit tests because the fabrications were plausible, not true. The check is to resolve every named entity in the artifact against reality: every Python import imports, every CLI flag exists in --help, every markdown link resolves, every numeric claim matches its source. The failure modes recur and are therefore automatable — fabricated symbols, flags, cross-references, route shapes, version-mismatched samples, insecure examples, fabricated counts — so a pass that checks them by name catches most of what slips past taste. (That pass now ships as a standalone fact-checker, attune-verify — stdlib-only, on PyPI, wired into attune-ai as the /verify skill — so any generation pipeline can call it.) Grounding verifies the input; fact-checking verifies the output; together they bracket generation.

Verifying behavior — "does it do the true thing?"

The dogfood principle is the highest-leverage move here. Before declaring an LLM-generated artifact done, run it against a realistic input it will actually face in production. Dogfood runs are slower than unit tests by orders of magnitude, but they operate at the meaning layer where the hallucinations live: a unit test asserts the function returned the expected type; the dogfood run asserts it returned something true. Both are needed; neither replaces the other. Its durable companion is the regression guard — after fixing a bug, write a test whose only purpose is to fail loudly if the bug returns. It need not be beautiful; it needs to fail when the bug is back. The best ones nail a specific past failure by name: one asserts a Path.endswith("/…") pattern stays out of tests because Windows paths break it; one asserts a dataclass field is required because forgetting it shipped wrong telemetry for ten days. Each is mechanical, narrow, durable; the collection compounds into a wall that past failures do not recur through.

Verifying decisions — "did measurement route the call?"

Decision matrices are verification by structure. Commit the "if A then X, if B then Y" matrix before measurement and the measurement automatically routes the decision: nothing to "review," because the matrix says what to do, the measurement says which branch fires, and the decision follows mechanically. This eliminates the post-hoc rationalization where the human or the agent finds reasons to favor the more interesting path against what the data actually says.

What holds the four together

Verification beats taste. The pull to read an agent's artifact and approve it on prose feel is strong; the discipline is to name the concrete check before generation starts and approve on the check's result, not the feel. And verification is a thing that runs and produces an output — it is not asking the agent "are you sure?" (it is wired to say yes), not skimming a thousand-line diff (beyond human anomaly-spotting), not trusting an "I verified that…" (a verbal report, not an artifact). If you cannot name what would have failed had the bug been present, you have not verified. The dashboard is that same property pointed at the world rather than the code — watch for drift: stale specs, broken cross-package pins, telemetry anomalies (today: workspace seven-day spend exceeding account seven-day spend, mathematically impossible, flagged not blocked). Use it at decision points, not just end-of-day.

This is also what makes trust sustainable across sessions. A human who trusts the agent because it claims to be careful is trusting a verbal report — fragile, and it erodes the first time a confident agent ships a confident bug. A human who trusts because the verification passed is trusting a check, and that trust accumulates because the check does the work of being trustworthy. Verification offloads trust from the relationship to the artifact, where it belongs.

The receipt. The discipline produces measurable artifacts, and the strongest is the one that says the work is true, not merely tested: citation-forced grounding holds 99.6% per-claim faithfulness on attune-rag's golden set — better than ninety-nine parts in a hundred of the individual claims the system makes are checkably traceable to a source. And it got there by structure, not by exhortation. That is what verification buys that taste cannot.


§8 — What This Looks Like When It Goes Right: A Twenty-Four-Hour Case Study

The morning of 2026-05-25 began with attune-rag 0.2.0 already on PyPI — shipped overnight in a clean release ceremony. The downstream pin-widens were the work of the day. Three sibling packages (attune-author, attune-gui, attune-ai) each carried a cap on attune-rag that needed to lift to admit the new version. None of those widens was difficult in isolation. The compounding question was the sequence and shape.

The session opened with the working mode declared: advancing a measurable scope (the pin widens), the outcome named (all three siblings shipping pin-widened versions to PyPI), the done-when criteria stated (every dependent verifiable on pip install <pkg> from a clean venv). The agent read the relevant CHANGELOG sections, the lockfile state of each sibling, the open PRs on each. The dashboard surfaced the family snapshot: five packages, current versions, who depends on whom.

The first call was simplest-first order. attune-author had the cleanest cap-widen — single-line bump, no transitive complications. Ship that. Then attune-gui. Then attune-ai, which had the most surface area. The agent recommended; the human approved.

attune-author 0.14.1 went smoothly. Cap widened, CHANGELOG bumped, tag pushed, publish workflow approved, PyPI confirmed within ten minutes. Clean. Move to attune-gui.

attune-gui surfaced a judgment call. The widen itself was trivial, but CHANGELOG.md showed an unreleased MCP bundle queued under ## [Unreleased] that the maintainer had been holding for the next minor. A patch release here would bundle the bundle without intent — the maintainer was waiting for the right moment to ship that work. The agent surfaced the choice with options. Option A: ship as 0.7.5 patch, bundle the MCP work as an unintended ride-along. Option B: ship as 0.8.0 minor, intentional inclusion of the MCP work that was already documented as ready. Recommendation: B, because the work was already documented as [Unreleased] and ready, and a minor bump is the right shape for "new feature accompanies the pin widen." Human approved. attune-gui 0.8.0 shipped.

attune-ai 7.1.2 hit a different snag mid-flight. A dependabot PR for the codeql action had landed on main two hours earlier with a lockfile that had drifted past the version it was bumping to — the action self-bootstrapped during the dependabot run. The PR looked merged but the lockfile was now wrong. The agent's recommendation: close the dependabot PR (do not re-bump, the action version was correct), then manually re-cut a lockfile-only PR to restore correctness. Human approved. The replacement PR landed clean. attune-ai 7.1.2 pin-widened and shipped to PyPI under the same ceremony.

At this point the dashboard surfaced something unexpected: ten spec status edits in the main checkout's working tree, modified but unstaged. The signature was a parallel session writing through the ops dashboard's spec-status setter — live writes that landed on disk but never made it into a commit. The session that owned them had ended. The edits were real. The agent surfaced the find with options. Commit them as their own PR (preserving the parallel session's work). Stash them and investigate later. Discard them (deny the parallel session's work). Recommendation: commit, because the edit pattern (status transitions on currently-active specs) was unmistakably intentional and at risk of loss. Human approved. A spec-status persistence PR was cut, reviewed quickly (the diffs were small and self-consistent), and merged.

That find — ten unstaged spec edits surfaced by the dashboard from a parallel session — drove the design work of the rest of the afternoon. The class of bug was named: live-state writes from session-bound services become silent debt when sessions end. The agent enumerated six solution candidates with a pros-and-cons table — a durable journal, a session-end commit hook, periodic auto-commit, a dashboard-side "stage these" button, a "pending writes" overlay, a hard "no live writes outside commits" stance. The recommendation was a layered approach: a durable journal of pending writes (Phase 1), an API for inspecting and resolving them (Phase 2), consumer wiring on write sites (Phase 3). Phase 1 was small enough to ship today; Phases 2 and 3 were deferred to dedicated future sessions. The human approved the design and the phasing.

The agent wrote the spec at docs/specs/dashboard-pending-writes-journal/. Requirements, design, decisions with eight durable choices logged, tasks broken down. The full spec was reviewable in under fifteen minutes. The agent implemented Phase 1: a journal writer, an API endpoint, the spec-status setter wire-in, twenty-five tests covering the happy path and the failure modes. All green on the first full pass. Smoke test against the live dashboard confirmed the cycle: a spec status flip now lands in the journal before writing to disk; the journal records the pending write; on next reconciliation the journal entry is cleared. The PR opened as attune-ai #469. Admin-merge under the durable session authorization. Done.

Four PyPI releases. One mid-session pivot when the dependabot anomaly surfaced. One judgment call on minor-versus-patch. One proactive design and ship-cycle in response to a live-state-write find. Six end-to-end ships if you count the spec-status persist and the journal as their own. Two real decision points the agent could not have made alone. Roughly twelve mechanical decision points across the session that the agent handled without re-asking. One in-session pushback on the agent's own initial plan to run code-review on a config-only PR (no signal; skip; save the budget). One proactive memory write that did not exist at session start (the don't-enable-fatigue-push lesson, named after the third fatigue signal in three days). Regression guards landed alongside the new code rather than as a separate coverage push — verification kept pace with the work instead of trailing it. A handful of clean stops at completion boundaries when the human signaled fatigue.

The point of the case study is not that the morning was breathtaking. No single decision in that sequence was virtuoso. The shipped releases were routine. The spec was small. The implementation was an hour and a half of focused work. What made the morning unusual was the absence of friction: no parallel-push collisions, no over-formalization on the trivial decisions, no under-scoping on the substantive ones, no re-explaining context that was already visible on the dashboard, no fatigue-grinding past the productive window. Six disciplines, each boring in isolation, compounding into a morning where the work moved.

And that morning wasn't a one-off. Across the two weeks ending 2026-06-02, the same one-developer-plus-agent setup merged 130 pull requests — roughly nine per calendar day — into the attune-ai repository. The composition matters more than the headline: about fifty were feature and fix code; the rest were documentation, specs, and release work the discipline keeps moving in lockstep with the code. That lockstep is the point — §5's memory and §4's artifacts mean the docs and specs keep pace rather than accruing as debt behind the shipping. The number is a dated snapshot of what the discipline produced here, not a multiplier anyone is promised.

The closing point is simple. This is a learnable skill. The discipline above is six bullet-points worth of vocabulary, not a worldview, not an ideology, not a stance on AI. Adopt the contract from §2. Adopt the pacing from §3. Adopt the artifact nesting from §4. Adopt the memory rules from §5. Adopt the multi-agent awareness from §6. Adopt the verification gates from §7. Practice each one boring-ly until they compound. The mornings that result are unremarkable in any single moment, and remarkable across a week.

There is a tell in how this article got written. The attune-* family is these disciplines turned into software — attune-ai builds and maintains the others — and the disciplines, once named, keep generating the next thing. Drafting this piece surfaced a friction none of the six quite covered; naming it produced a new spec by the end of the session. The discipline didn't only describe the work. It generated more of it. That is the synergy — tied to the discipline itself, not to any one tool.

That is the discipline of agent collaboration. We are still learning it. We hope this is useful.


A note on the authorship

This piece was written collaboratively. Patrick Roebuck (founder, Smart AI Memory) drafting, directing, and making every irreversible call; an AI agent helping to author under the discipline this article describes. Patrick brings three decades of software development plus earlier years in coordination-heavy roles outside engineering — patterns that shape his approach to contracts, pacing, and multi-party work. The past year-plus has been spent building the attune-* ecosystem — six PyPI packages (attune-ai, attune-rag, attune-author, attune-help, attune-gui, attune-verify) that together ship the workflow patterns described above. The evidence base for the claims here is that codebase and the multi-month working history that produced it.