The agent's brain lives on a disk that outlives the pod.
When Knative kills an idle agent pod, the agent's notes, memory, and OAuth tokens can't die with it. The pod gets thrown away. The disk doesn't. This chapter is about that disk and the handful of decisions around it.
The problem
A Houston agent stores everything in a folder called
.houston/. Inside: chat history, memory, learned
skills, OAuth tokens for connected tools, files the user has
shared. On the desktop, that folder lives in
~/.houston/workspaces/... on the user's hard drive.
It survives reboots, app crashes, anything short of the user
deleting it.
In the cloud, the agent's pod is ephemeral. Knative kills it
after a few minutes of idle. If the .houston/
folder lived inside the pod, every nap would amnesia the agent.
Bad. So the folder has to live somewhere outside the pod.
The fix: a persistent volume per agent
Kubernetes calls disks "persistent volumes." A persistent volume is a chunk of cloud storage that gets attached to a pod when the pod boots and detached when the pod dies. The data on it survives.
Every agent gets one. ~1 GB to start, grows as needed. Mounted
inside the pod at /data/.houston/. The engine
doesn't know it's running in the cloud; it sees a normal
folder.
How big is the disk?
- Sessions and memory text: a few MB even for heavy users. Text compresses well.
- OAuth tokens, settings: kilobytes.
- Files the user uploads: this is the wild card. Could be tens of MB to GB.
Start at 1 GB per agent. Costs about 10 cents per month on GCP. Auto-grow when 80% full. We're not optimizing storage cost ever — even 10,000 agents at 1 GB is $1,000 a month total. Real money is compute, not disk.
What about big files?
If an agent handles huge files (videos, datasets), the persistent volume is the wrong place. Two reasons: it's attached only when the pod is alive, and it's a more expensive kind of storage than necessary.
For big files, the agent writes to object storage (S3 on AWS, GCS on Google Cloud). Object storage is dirt cheap, designed for huge blobs, and accessible from anywhere. The agent keeps only a reference (a URL) on its persistent volume.
Same pattern as your desktop app having a shortcut to a video on Dropbox. The shortcut is small, the video lives somewhere cheap and shared.
Backups
Two layers:
- Volume snapshots: Kubernetes can snapshot a persistent volume nightly. Costs a fraction of the volume itself.
- Export to object storage: weekly tar of the
.houston/folder uploaded to S3, keyed by agent + date. Old snapshots cycle out.
Restore process: if an agent's volume gets corrupted, we mount
the last snapshot, copy in the latest .houston/ from
S3, the agent's next message wakes the pod with restored state.
User notices nothing.
Sticky scheduling
There's a subtle issue: a persistent volume is usually tied to
one cloud region or even one zone. If we have nodes in
us-east-1a and us-east-1b, an agent's
volume that was created in 1a can only mount to
pods in 1a.
Kubernetes handles this automatically with topology aware scheduling. When the pod boots, K8s asks "where does your volume live?" and only schedules the pod onto matching nodes. We don't have to think about it once it's set up.
The cold start optimization (for later)
Attaching a persistent volume on cold start adds a few hundred milliseconds. For 99% of agents this is fine. If we ever need sub-100ms cold start for some always-on agent, we'd switch to image preloaded with state or state sync from object storage instead of mount. Not v1.
What this gives us
- Agents have real memory. Their
.houston/survives sleep, restart, even cluster failures (with backups). - Compute scales to zero, storage costs pennies. An agent untouched for a year still has its memory waiting, for less than a dollar in storage.
- The engine doesn't have to change. From the engine's perspective,
/data/.houston/is just a folder. Same logic as the desktop app's~/.houston/.
Persistent volumes don't shrink. If an agent writes 5 GB of crap once and then cleans up, you're paying for 5 GB forever (until you migrate the volume). For Houston this probably doesn't matter — agent data should stay small if we're disciplined about big-file offload to object storage. Worth a periodic "agent volume size audit" cron once we're live.
StorageClass standard-ssd-zonal on GKE (cheap SSD per AZ). PersistentVolumeClaim per agent, name pattern pvc-agent-{agent-id}. Created by the control plane when the agent is provisioned. Snapshots scheduled via Velero or native VolumeSnapshot controllers. Object storage: GCS bucket per workspace for tenant data, with lifecycle rules archiving old files to colder/cheaper tiers.