Why conversations die today.

Today too much session state is either in memory or persisted after the user already saw it. In-memory state dies when the process dies. Late persistence means accepted work can disappear. Six failure modes, two fixes: durable turn records now, detached workers later.

Failure mode #1

User closes the app mid-stream. This is common and it is not a crash. The desktop exits, the supervised engine exits, the provider CLI dies with it. Four tables alone can preserve accepted input and streamed chunks already committed, but they cannot keep the CLI alive. Full survival needs the detached worker.

Failure mode #2

Engine crashes between "accept message" and "spawn CLI". Your message disappears. No record anywhere. You click send again, the engine has no memory of the first attempt. Rare but ugly when it happens.

Failure mode #3

Engine crashes mid-stream. Committed chunks can be replayed after M1. The in-flight provider process still dies unless the worker owns it. User should see "interrupted, retry" with preserved partial output, not a blank disappearance.

Failure mode #4

WebSocket reconnects after a network blip. Current clients refetch active queries after reconnect because events are not replayed. That is good as a safety net, but not enough for exact stream continuity. M1 adds an outbox and sequence-based replay.

Failure mode #5

Cron fires while engine is restarting. Current active routines scheduler is in-process. A sleeping or restarting engine has no durable due-work ledger. M1 needs an audit table and explicit catch-up/skip policy for every scheduled trigger.

Failure mode #6 (the one most people miss)

CLI subprocess hangs without exiting. Provider gateway flakes, network stalls, model returns nothing. The CLI sits there for 20 minutes. Engine has no timeout, no started_at heartbeat, no idea anything is wrong. UI shows "thinking" forever.

The pattern

Notice the common thread. Work acceptance, stream delivery, and trigger firing need durable records before side effects. Anything accepted but not recorded can disappear. Anything delivered but not recorded cannot be replayed. Anything running without heartbeat or lease cannot be diagnosed.

The fix is boring but life-changing. Flip the order. Add timestamps, leases, sequence numbers, and replay windows. Sweep orphans on boot. Make retry explicit. Chapter 6 covers how.

"Why hasn't this been fixed yet?"

Because in the happy path it works. Engines don't crash often. The team shipped features. The pain falls on users in the long tail: someone with a flaky home network, someone who closes the app to take a call, someone whose engine got OOM-killed in dev. The redesign moves us from "works most of the time" to "designed for the times it doesn't." That's the line between a beta and a product.