agentic-collab 3.0

topics as ephemeral-spawn addresses · messaging unchanged · proxy stays thin
design diamond · rev 16 · companion to docs/v3-vision.md
north star components addresses topic delivery agent template approvals cli data model proxy role crash recovery what changes scope boundaries

North star

Inter-agent messaging is unchanged. Publishers send to an address; they don't care what's on the other side. New: a class of address — a topic declared by an ephemeral agent template — that, when delivered to, runs a host-shell prepare hook, creates a tmux session, pastes the engine start command into it, waits for the agent to signal completion, then runs cleanup.

The publisher writes the same send(addr, msg) either way.

Components

orchestrator (Docker) proxy (host) persistent agent ephemeral instance human
Human / UI approvals · topic browser Orchestrator (Docker) SQLite WAL · HTTP API · WebSocket + topic_queue · agent_instances · approvals message dispatch · cool-down · lifecycle locks + instance reaper · address resolver persona/template loader agents/*.md → agent_templates + topics Proxy (host) /command : create_session, paste, exec, kill_session, … /upload : file streaming no new commands · still thin · still stateless Persistent agent long-running tmux addr: agent:gitea-lead behaviour: unchanged Ephemeral instance worktree · tmux · engine addr: agent:tmpl/inst-id life: one message topic:<template>/<name> queue single-consumer · drains by spawning ephemeral instances examples: topic:aws-account-lead/provision · topic:gitea-lead/repo-create approvals · sends proxyDispatch paste exec start hook spawn collab complete

Addresses

unchangedagent:
agent:gitea-lead
Persistent agent's inbox.

agent:aws-account-lead/inst-7a3f
A live ephemeral instance. Messages paste into its tmux session for the duration of its run.
newtopic:
topic:aws-account-lead/provision
Sending enqueues; the orchestrator drains by spawning ephemeral instances of the declaring template.

Topic names scoped to their declaring template — two templates may each declare provision without collision.
newapproval:
approval:aws-account-provision
A human-decision channel. Approvals are CRUD records categorised by channel — not message queues that spawn workers.

Distinct prefix prevents the "topic = spawn compute" overload.
Send call is identical for agent: and topic:
collab send <addr> --payload '{...}' [--in-reply-to <id>]
Fire-and-forget. Replies arrive as separate inbound messages with optional in_reply_to. Approvals use their own CRUD subcommands.

Topic delivery sequence

Two kinds of hooks, by execution surface: prepare / cleanup are host-shell (run via proxy exec) · start is tmux paste (typed into the pane). The proxy gets no new commands.
Orchestrator Proxy Ephemeral instance Filesystem 1 allocate IPC paths, MESSAGE_ID write $MESSAGE_PATH, empty $REPLY_PATH + $STATUS_PATH 2 prepare hook → proxy exec 2b host shell: git worktree add worktree dir created at $WORKTREE_PATH 3 create_session against cwd_base 4 set-env + paste start hook 4b paste into tmux pane → engine starts 5 agent runs; collab send / approval await 6 collab complete --reply '{...}' $STATUS_PATH=ok, $REPLY_PATH HTTP notify (or reaper picks up status file) 7 send(REPLY_TO_ADDR, reply, in_reply_to) 8 kill_session 9 cleanup hook → proxy exec 9b host shell: git worktree remove worktree dir removed

Agent template anatomy

Same shape as today's persona files (YAML frontmatter + markdown body). Lives in agents/*.md. New fields are scoped to ephemeral templates: persistent, cwd_base, cwd_template, repo_root, prepare, cleanup, topics. Existing persona files load unchanged with persistent: true.
---
id: aws-account-lead
persistent: false
engine: claude
model: opus

# cwd_base is real & existing — used as create_session cwd.
cwd_base: /var/agentic/work/aws-account-lead
# cwd_template is per-message; prepare creates this directory.
cwd_template: /var/agentic/work/aws-account-lead/wt-{{message_id}}

# Host-shell hooks (run via proxy exec). NEW mechanism.
prepare: |
  git -C "$REPO_ROOT" worktree add "$WORKTREE_PATH" main
cleanup: |
  git -C "$REPO_ROOT" worktree remove --force "$WORKTREE_PATH"

# Tmux-paste hook (today's mechanism). Typed into the pane.
start: |
  cd "$WORKTREE_PATH" && claude --session-id "$MESSAGE_ID" < "$MESSAGE_PATH"

topics:
  - name: provision
    schema: ./schemas/provision.json
    reply_schema: ./schemas/provision-reply.json
    concurrency: 1
    monitor_template: aws-account-monitor
  - name: teardown
    schema: ./schemas/teardown.json
    prepare: ./teardown-prepare.sh         # optional per-topic override
---

# AWS Account Lead

You provision and tear down AWS accounts...

Two env contracts, by hook kind

prepare / cleanup · host shell (proxy exec)
start · tmux paste
Same vars exported into the session via tmux set-environment before paste, so the line can reference them and they expand as it's typed.

Approvals — CRUD + auto-notify

Approvals are categorised by channel (approval:<channel>), not by topic. A channel is a UI feed label, not a routing target — the auto-notify message goes to the requester's agent address, never to approval:.
Agent collab approval create collab approval await Orchestrator approvals table CRUD + ws event approval:<channel> UI feed · approval-changed event Human approve · reject · amend 1. create --channel 2. emit ws event 3. UI shows 4. set state (approved / rejected / amended) 5. auto-notify → agent: addr collab approval await <id> polls the orchestrator directly — doesn't depend on the notify message arriving

Approval subcommands

collab approval create --channel <name> --payload <json>    # returns approval_id
collab approval get <id>
collab approval set <id> --state approved|rejected|amended [--payload …]
collab approval withdraw <id>
collab approval await <id>                                  # blocks until terminal

CLI surface — mode-aware

The collab binary detects ephemeral context from env (presence of $MESSAGE_ID + $AGENT_TEMPLATE + $REPLY_PATH) and adapts its help text, exposed subcommands, and the system-prompt addendum injected into the engine. Persistent-mode behaviour matches today.
Persistent mode
collab send <addr> --payload …
collab approval create / get / set / withdraw / await
Banner: "You are agent:gitea-lead. Messages arrive in your inbox; reply by sending."
Ephemeral mode
collab send <addr> --payload …
collab approval create --channel … / get / set / withdraw / await
collab complete --reply <json>
collab fail --reason <text>
Banner: "You are handling message $MESSAGE_ID on topic <tmpl>/<topic>. Call collab complete when done."
The system prompt composed by persona.ts also branches on mode: ephemeral agents are told they handle exactly one message and must complete; persistent agents are told they have an ongoing inbox.

Backwards compat: today's collab send syntax (<target> --topic <category> <message>) keeps working. The --topic flag retains its 2.x meaning (message category); v3's topic: is encoded in the <addr> prefix. The client-side /api/agents target validation in bin/collab is widened to accept prefixed addresses without rejecting them as "no such agent".

Data model

new in 3.0 existing 2.x · untouched
agent_templates    (id PK, persona_path, engine, model, persistent BOOL,
                    cwd_base?, cwd_template?, repo_root?,
                    hook_start?, hook_exit?, hook_prepare?, hook_cleanup?,
                    ...other hooks as today)

topics             (agent_template FK, name,
                    hook_prepare_override?, hook_start_override?, hook_cleanup_override?,
                    monitor_template?, concurrency, schema?, reply_schema?,
                    PRIMARY KEY (agent_template, name))

topic_queue        (id PK, agent_template FK, topic_name, payload, reply_to_addr?,
                    in_reply_to?, status, claimed_by_instance?, worktree_path?, created_at)

agent_instances    (id PK, agent_template FK, spawned_from_topic?, instance_addr,
                    tmux_session, worktree_path, proxy_id, state,
                    monitor_of_instance?, started_at, completed_at?)

approvals          (id PK, requester_addr, channel, payload, state,
                    amendments_json?, created_at, updated_at, decided_by?, decided_at?)

approval_events    (approval_id FK, event_type, payload, created_at)

agents             ← unchanged · persistent agents only · NO new columns
messages           ← unchanged · today's message inboxes
locks              ← unchanged · 3-phase lifecycle locking
events             ← unchanged · audit log
Template-only fields (persistent, cwd_base, cwd_template, repo_root, prepare, cleanup, topics) are stored exclusively in agent_templates via a new template-sync routine — they never flow through field-registry.buildUpsertOptsFromFrontmatter, so no ALTER TABLE agents runs.

agent_instances is intentionally separate from agents to keep the persistent-agent state machine uncontaminated by ephemeral concerns. Health-monitor and cool-down explicitly exclude rows in agent_instances.

Proxy role

No new commands on the proxy. The existing /command vocabulary suffices: prepare/cleanup go through exec; tmux session via create_session / kill_session / has_session; start goes through today's tmux paste path.

Active-instance state (instance ↔ tmux session ↔ worktree) lives in the orchestrator's DB (agent_instances), not in the proxy. This preserves the proxy's thin-stateless design intent.

Crash recovery

ScenarioRecovery
Orchestrator restart while instances are live Walk agent_instances in non-terminal state. For each: check $STATUS_PATH — if present, process completion. Otherwise ask proxy has_session: alive → resume waiting; gone → run cleanup hook, mark failed, requeue topic_queue row per policy.
Proxy restart while instances are live All sessions on that proxy are gone. On its next registration heartbeat, orchestrator marks every live agent_instances row on that proxy failed, runs cleanup hooks where the worktree still exists on disk, requeues topic_queue entries.
Orphaned worktrees cleanup hook owns worktree removal. If it fails or never runs, a periodic sweep checks cwd_base against live agent_instances.worktree_path values and removes leftovers.
Agent forgets to call collab complete Per-instance outer timeout (configurable per topic) trips → orchestrator treats as failure: kill tmux session, run cleanup, requeue or fail.
Approval mid-flight at restart Pure DB state. No recovery needed — pending approvals just remain pending until a human resolves them.

What changes

Concern2.x3.0
Persistent agents persona file → live tmux agent file (persistent: true) → live tmux · same observable behaviour
Sending messages send(agentId, msg) send(address, msg) · agent or topic · same fire-and-forget
Ephemeral work n/a agent file declares topics → prepare (host shell) · create_session · start (paste) · run · complete · kill · cleanup (host shell)
Approvals n/a CRUD resource categorised by channel + auto-notify message to requester
Monitors per-agent health-monitor in orchestrator sidecar template paired with worker via the same lifecycle hooks
Hook kinds tmux-paste only (start, exit, …) + host-shell hooks (prepare, cleanup) via existing proxy exec
Proxy commands tmux + exec + upload unchanged — no new commands

Scope boundaries

What we don't own
What we don't touch (from 2.x)