How a message flows through.
This is the most important diagram in the guide. If you remember only one thing, remember this one: accept work durably first, then run it. The flow shown here is the target flow for M1 reliability. It does not claim app-close survival until the detached worker exists.
M1 guarantees accepted turns are recorded, reconnects can replay recent stream events, engine crashes produce visible retry state, and scheduled triggers are audited. M1 does not guarantee an in-flight CLI keeps running after the engine process dies. That needs the worker in Chapter 6.
POST /v1/agents/Sales/sessions with the user message and a client-generated turnId (UUID v4).turns with status queued. If turnId already exists in a terminal state (completed/failed/cancelled), return its outcome instead of starting a new turn. This is how retries are safe.200 OK with { sessionKey, turnId }. UI shows the message as accepted.hou_sales, in the agent's folder.turns set status running, started_at = now().turn_stream with the next seq number. Batched: a tokio interval flushes every 20-50ms so we don't drown SQLite with hundreds of single-row inserts.turns set status completed, completed_at = now().Green steps are SQLite writes. Anything rendered as durable must be written first.
The key insight
Writes to the database happen BEFORE the work, not AFTER. The turn row is inserted the moment the user's message is accepted, well before the CLI subprocess is even spawned. Each streaming chunk is assigned a sequence number and written in a small batch before broadcast. If SQLite is backpressured, the stream slows down; it does not pretend data is durable before it is.
That single ordering decision is what makes everything else work.
What this buys you
- Refresh mid-stream. The next page load replays committed chunks from
turn_stream. - Engine crash before spawn. The restarted engine sweeps
turns, finds queued/running orphans, and surfaces retry. - Network blip during streaming. The client reconnects with
sinceSeqand gets recent gaps fromevents_outbox. - Cron fires while engine is restarting.
trigger_runsrecords due work and catch-up policy decides whether to run or mark skipped. - Safe client retry. Same
turnIdcannot create duplicate work.
What this does NOT buy you on its own: engine restart mid-stream
survives the bounce. Today the CLI subprocess is a child of the
engine, so killing the engine kills the CLI. To survive that, we
need the detached houston-turn-worker from Chapter 6
or a different process-lifecycle contract.
Protocol changes you need to know
-
POST /v1/agents/:path/sessionsrequest body gainsturnId: string(UUID v4). Required. Existing clients are missing this field. Major protocol bump. -
Response body becomes
{ sessionKey, turnId, status }wherestatusis the current turn state. If the engine recognizes theturnIdas already terminal, it returns the prior outcome. -
New WebSocket event:
TurnStreamReplaywith{ turnId, fromSeq, items }for replay. -
WS subscription gains
sinceSeq: client reconnect sends the last seen global event sequence number, server replays authorized events fromevents_outbox. Replay must run the same scope checks as fresh events.
What's there today
Most of the boxes in the diagram exist. The CLI gets spawned, the
response streams, the final message gets saved to chat_feed.
What's missing:
- The four SQLite tables (
turns,turn_stream,events_outbox,trigger_runs). See Chapter 6. - The rule of writing before doing. Today persistence is async, fire-and-forget via
tokio::spawn. - Streaming deltas are explicitly dropped from persist today (
engine/houston-agents-conversations/src/session_runner.rs:405-406) because they get replaced by their finals. The redesign keeps them inturn_streamfor replay. Different problem, different table. The UI behavior stays the same. - Client-supplied
turnId+ retry semantics. - WebSocket replay-on-reconnect.
The five failure modes this prevents
Chapter 5 walks through what goes wrong because these tables aren't
there. Chapter 6 covers the fix. The one people miss: the CLI
subprocess hangs without exiting. Without started_at,
last_heartbeat_at, and a sweep job, the engine has no
way to know a turn has been running for an hour with no output.
Today's flow: engine/houston-engine-server/src/routes/sessions.rs → engine/houston-engine-core/src/sessions/mod.rs → engine/houston-agents-conversations/src/session_runner.rs. Streaming items dropped at session_runner.rs:405-406. Async persistence at session_runner.rs:226-244. Session identity model at engine/houston-engine-core/src/sessions/control.rs:6-18 (no turnId today).