The four tables that make reliability explicit.
Four new SQLite tables give Houston a durable turn ledger, durable stream replay, a WS replay outbox, and a trigger audit log. They do not magically keep a dead subprocess alive. That requires the detached worker below.
The tables, with columns
turns
CREATE TABLE turns (
turn_id TEXT PRIMARY KEY, -- UUID v4, client-supplied
session_key TEXT NOT NULL, -- the conversation slot
agent_path TEXT NOT NULL, -- workspace/agent
principal_id TEXT NOT NULL, -- who initiated, see Ch7
provider TEXT NOT NULL, -- claude | codex | gemini | ...
model TEXT, -- selected model, nullable only if provider default
source TEXT NOT NULL, -- user | routine | webhook | mobile
working_dir TEXT NOT NULL,
app_prompt_hash TEXT, -- prompt version used by app, not owned by engine
trigger_run_id INTEGER, -- nullable, links automated starts
status TEXT NOT NULL, -- queued | running | completed | cancelled | failed | timed_out
user_message TEXT NOT NULL, -- the input
result TEXT, -- final assistant text or error code
error_code TEXT, -- when status = failed
created_at INTEGER NOT NULL, -- unix ms
started_at INTEGER, -- when CLI spawned
last_heartbeat_at INTEGER, -- worker/engine liveness
cancel_requested_at INTEGER, -- cooperative cancel marker
completed_at INTEGER -- when terminal
);
CREATE INDEX turns_session_idx ON turns(session_key, created_at);
CREATE INDEX turns_status_idx ON turns(status) WHERE status IN ('queued', 'running');
turn_stream
CREATE TABLE turn_stream (
turn_id TEXT NOT NULL REFERENCES turns(turn_id),
seq INTEGER NOT NULL, -- per-turn monotonic
kind TEXT NOT NULL, -- assistant_delta | thinking_delta | tool_call | tool_result | ...
data_json TEXT NOT NULL,
ts INTEGER NOT NULL,
PRIMARY KEY (turn_id, seq)
);
events_outbox
CREATE TABLE events_outbox (
seq INTEGER PRIMARY KEY AUTOINCREMENT, -- global monotonic
topic TEXT NOT NULL, -- session:X, agent:Y, scheduler, ...
event_type TEXT NOT NULL,
scope_json TEXT NOT NULL, -- replay authorization scope
payload_json TEXT NOT NULL,
ts INTEGER NOT NULL
);
CREATE INDEX events_topic_idx ON events_outbox(topic, seq);
trigger_runs
CREATE TABLE trigger_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
trigger_type TEXT NOT NULL, -- routine | webhook | mobile_push | ...
trigger_id TEXT NOT NULL, -- routine id, webhook id, etc.
scheduled_at INTEGER, -- for routine triggers
received_at INTEGER NOT NULL, -- when Houston accepted/refused it
fired_at INTEGER, -- when we actually fired (null if missed and we don't intend to catch up)
status TEXT NOT NULL, -- accepted | fired | missed | caught_up | skipped | rejected
turn_id TEXT REFERENCES turns(turn_id),
error_code TEXT,
notes TEXT
);
CREATE INDEX trigger_runs_idx ON trigger_runs(trigger_type, trigger_id, received_at);
The rule
Write the row before doing the work. Mark it complete after. Here is what that looks like for a single chat turn.
1. Accept message → INSERT turns (status: queued)
2. Start CLI → UPDATE turns SET status = running, started_at
3. Stream chunk → INSERT turn_stream row(s), then broadcast
4. CLI done → UPDATE turns SET status = completed, completed_at
5. Engine boots → SELECT * FROM turns WHERE status IN ('queued','running')
→ queued turns resume
→ stale running turns become interrupted, unless a worker heartbeat proves they still live
→ surface retry/cancel/reattach in the UI
Step 5 is the contract. On every boot, the engine reconciles durable
state against real process liveness. Queued work can run. Completed
work can replay. Stale running work becomes interrupted unless the
detached worker proves it still owns the turn. Retrying an interrupted
turn must create a new turn_id; the old one remains an
audit record.
What four tables alone fix, and what they don't
Be honest about scope. These tables alone fix:
- Crash before a turn spawns. The queued row is still present and can resume.
- Refresh during streaming. Already-committed chunks replay from
turn_streamandevents_outbox. - Network blips during streaming. Same path: reconnect with
sinceSeq, scope-check, replay. - Missed routines. The next boot reads
trigger_runs, sees the missed fire, and decides whether to catch up. - Client retry safety. The
turnIdis idempotent. - CLI subprocess hang. A sweep job checks
last_heartbeat_atand surfaces a visible "still running, cancel?" state.
What four tables do NOT fix on their own: engine crash MID-stream while a CLI subprocess is still running. Today the CLI subprocess is a child of the engine process, so killing the engine kills the CLI. The four tables let the engine recover state on the next boot, but the in-flight CLI is dead and the user has to retry.
The detached worker (required to fully close the gap)
To survive engine restart MID-stream without retry, the CLI must
live longer than the engine. That means a small detached binary,
houston-turn-worker, spawned by the engine but not
parented to it. The worker writes streaming output directly to
SQLite. The engine bounces, a new engine starts, attaches via Unix
socket, picks up where the stream left off. The user never notices.
This is the architectural piece. Without it, "conversations never die" is false. With it, the promise becomes defensible. If product wants app-close or engine-bounce survival in M1, the worker cannot wait behind Cloud work. Make it M1b and test it directly.
Coexistence with chat_feed
Today's chat_feed holds feed items keyed by the provider
session id. M1 must avoid a permanent double-write system. Pick one
canonical table and make rollback explicit.
- Canonical write path. New writes go only to
turn_stream. - Migration. Existing
chat_feedrows migrate intoturn_streamas committed items, with idempotent markers. - Fallback window. Reads may fall back to
chat_feedfor one release if migration markers are missing. No indefinite compatibility layer. - FTS. Search indexes rebuild from the canonical stream-derived feed, not two sources.
Do not ship long-lived double-write. It creates split-brain bugs and makes replay hard to trust.
Backpressure and the bounded queue
Today the event queue in engine/houston-events/src/queue.rs
is an mpsc::unbounded_channel. A webhook flood or a
runaway routine trigger could balloon memory. M1 switches this to a bounded
channel with capacity 1024 and a documented drop policy:
- Inputs whose source is
routinecoalesce: if two fires for the sameroutine_idare in flight, log askippedrow intrigger_runsand drop the second. - Inputs whose source is
usernever drop. If the queue is full, return503 UNAVAILABLEwithRetry-After. - Inputs whose source is
webhookget a small per-source bucket (token bucket, 10/min default) and atrigger_runsaudit row for each accepted or rejected attempt.
Same write-before-do principle: refusing the request is fine, silently dropping it is not.
Retention
turn_stream grows fast. A two-hour Codex marathon
writes thousands of rows. Default retention: keep raw deltas for 7
days, then compact to one final row per turn and drop
the deltas. events_outbox retention: 24 hours
(replay window for reconnecting clients). trigger_runs:
keep forever, it's an audit log and it's small.
A nightly background task does the compaction. Configurable via env vars for self-hosters and Cloud admins.
Temporal is a workflow engine for orchestrating long-running multi-step business processes with replay semantics across many activities. Houston's first reliability problem is narrower: accept a turn, run a CLI, capture the stream, recover honestly. SQLite plus a worker is enough until Houston has multi-step external workflows.
SQLite migrations live in engine/houston-db/src/migrations.rs. First fix: reliability migrations must be checked and observable, not swallowed with .ok(). Add turns, turn_stream, events_outbox, and trigger_runs. Repos go in engine/houston-db/src/repo_turns.rs, repo_turn_stream.rs, repo_events_outbox.rs, and repo_trigger_runs.rs. Boot sweep belongs in server startup after migrations, before accepting new starts.