The four tables that make reliability explicit.

Four new SQLite tables give Houston a durable turn ledger, durable stream replay, a WS replay outbox, and a trigger audit log. They do not magically keep a dead subprocess alive. That requires the detached worker below.

The tables, with columns

turns

CREATE TABLE turns (
	  turn_id        TEXT PRIMARY KEY,        -- UUID v4, client-supplied
	  session_key    TEXT NOT NULL,           -- the conversation slot
	  agent_path     TEXT NOT NULL,           -- workspace/agent
	  principal_id   TEXT NOT NULL,           -- who initiated, see Ch7
	  provider       TEXT NOT NULL,           -- claude | codex | gemini | ...
	  model          TEXT,                    -- selected model, nullable only if provider default
	  source         TEXT NOT NULL,           -- user | routine | webhook | mobile
	  working_dir    TEXT NOT NULL,
	  app_prompt_hash TEXT,                   -- prompt version used by app, not owned by engine
	  trigger_run_id INTEGER,                 -- nullable, links automated starts
	  status         TEXT NOT NULL,           -- queued | running | completed | cancelled | failed | timed_out
	  user_message   TEXT NOT NULL,           -- the input
	  result         TEXT,                    -- final assistant text or error code
	  error_code     TEXT,                    -- when status = failed
	  created_at     INTEGER NOT NULL,        -- unix ms
	  started_at     INTEGER,                 -- when CLI spawned
	  last_heartbeat_at INTEGER,              -- worker/engine liveness
	  cancel_requested_at INTEGER,            -- cooperative cancel marker
	  completed_at   INTEGER                  -- when terminal
	);
CREATE INDEX turns_session_idx ON turns(session_key, created_at);
CREATE INDEX turns_status_idx  ON turns(status) WHERE status IN ('queued', 'running');

turn_stream

CREATE TABLE turn_stream (
  turn_id        TEXT NOT NULL REFERENCES turns(turn_id),
  seq            INTEGER NOT NULL,        -- per-turn monotonic
  kind           TEXT NOT NULL,           -- assistant_delta | thinking_delta | tool_call | tool_result | ...
  data_json      TEXT NOT NULL,
  ts             INTEGER NOT NULL,
  PRIMARY KEY (turn_id, seq)
);

events_outbox

CREATE TABLE events_outbox (
	  seq            INTEGER PRIMARY KEY AUTOINCREMENT,   -- global monotonic
	  topic          TEXT NOT NULL,           -- session:X, agent:Y, scheduler, ...
	  event_type     TEXT NOT NULL,
	  scope_json     TEXT NOT NULL,           -- replay authorization scope
	  payload_json   TEXT NOT NULL,
	  ts             INTEGER NOT NULL
	);
	CREATE INDEX events_topic_idx ON events_outbox(topic, seq);

trigger_runs

CREATE TABLE trigger_runs (
	  id             INTEGER PRIMARY KEY AUTOINCREMENT,
	  trigger_type   TEXT NOT NULL,           -- routine | webhook | mobile_push | ...
	  trigger_id     TEXT NOT NULL,           -- routine id, webhook id, etc.
	  scheduled_at   INTEGER,                 -- for routine triggers
	  received_at    INTEGER NOT NULL,        -- when Houston accepted/refused it
	  fired_at       INTEGER,                 -- when we actually fired (null if missed and we don't intend to catch up)
	  status         TEXT NOT NULL,           -- accepted | fired | missed | caught_up | skipped | rejected
	  turn_id        TEXT REFERENCES turns(turn_id),
	  error_code     TEXT,
	  notes          TEXT
	);
	CREATE INDEX trigger_runs_idx ON trigger_runs(trigger_type, trigger_id, received_at);

The rule

Write the row before doing the work. Mark it complete after. Here is what that looks like for a single chat turn.

1. Accept message       → INSERT turns (status: queued)
2. Start CLI            → UPDATE turns SET status = running, started_at
3. Stream chunk         → INSERT turn_stream row(s), then broadcast
4. CLI done             → UPDATE turns SET status = completed, completed_at
5. Engine boots         → SELECT * FROM turns WHERE status IN ('queued','running')
                          → queued turns resume
                          → stale running turns become interrupted, unless a worker heartbeat proves they still live
                          → surface retry/cancel/reattach in the UI

Step 5 is the contract. On every boot, the engine reconciles durable state against real process liveness. Queued work can run. Completed work can replay. Stale running work becomes interrupted unless the detached worker proves it still owns the turn. Retrying an interrupted turn must create a new turn_id; the old one remains an audit record.

What four tables alone fix, and what they don't

Be honest about scope. These tables alone fix:

What four tables do NOT fix on their own: engine crash MID-stream while a CLI subprocess is still running. Today the CLI subprocess is a child of the engine process, so killing the engine kills the CLI. The four tables let the engine recover state on the next boot, but the in-flight CLI is dead and the user has to retry.

The detached worker (required to fully close the gap)

To survive engine restart MID-stream without retry, the CLI must live longer than the engine. That means a small detached binary, houston-turn-worker, spawned by the engine but not parented to it. The worker writes streaming output directly to SQLite. The engine bounces, a new engine starts, attaches via Unix socket, picks up where the stream left off. The user never notices.

This is the architectural piece. Without it, "conversations never die" is false. With it, the promise becomes defensible. If product wants app-close or engine-bounce survival in M1, the worker cannot wait behind Cloud work. Make it M1b and test it directly.

Coexistence with chat_feed

Today's chat_feed holds feed items keyed by the provider session id. M1 must avoid a permanent double-write system. Pick one canonical table and make rollback explicit.

Do not ship long-lived double-write. It creates split-brain bugs and makes replay hard to trust.

Backpressure and the bounded queue

Today the event queue in engine/houston-events/src/queue.rs is an mpsc::unbounded_channel. A webhook flood or a runaway routine trigger could balloon memory. M1 switches this to a bounded channel with capacity 1024 and a documented drop policy:

Same write-before-do principle: refusing the request is fine, silently dropping it is not.

Retention

turn_stream grows fast. A two-hour Codex marathon writes thousands of rows. Default retention: keep raw deltas for 7 days, then compact to one final row per turn and drop the deltas. events_outbox retention: 24 hours (replay window for reconnecting clients). trigger_runs: keep forever, it's an audit log and it's small.

A nightly background task does the compaction. Configurable via env vars for self-hosters and Cloud admins.

Why this isn't Temporal

Temporal is a workflow engine for orchestrating long-running multi-step business processes with replay semantics across many activities. Houston's first reliability problem is narrower: accept a turn, run a CLI, capture the stream, recover honestly. SQLite plus a worker is enough until Houston has multi-step external workflows.

Migration order

SQLite migrations live in engine/houston-db/src/migrations.rs. First fix: reliability migrations must be checked and observable, not swallowed with .ok(). Add turns, turn_stream, events_outbox, and trigger_runs. Repos go in engine/houston-db/src/repo_turns.rs, repo_turn_stream.rs, repo_events_outbox.rs, and repo_trigger_runs.rs. Boot sweep belongs in server startup after migrations, before accepting new starts.