Idle agents pay zero.

The economics of Houston Cloud only work if a sleeping agent costs nothing. Most users have a Sales bot they touch twice a day and a recruiter bot they touch once a week. If those pods run 24 hours a day, our bill scales linearly with seats and we lose money on every customer. Knative is the trick that makes idle agents free.

The economic problem

Imagine an enterprise with 100 employees and 10 agents per employee. That's 1,000 agents. If we charge $20 per seat per month and each agent's pod costs us $5 a month to run idle, our gross margin is negative before the agent does any work.

We need: pods that exist only when a human is actually talking to them. Pods that take less than a second to wake up. Pods that automatically die when nobody's around.

That is exactly what Knative does.

Knative, in one paragraph

Knative is a layer on top of Kubernetes that adds three superpowers: "scale to zero," "scale on demand," and "scale based on traffic." You hand Knative a container image and say "this is a service." Knative figures out the rest. When traffic arrives, it boots pods. When traffic stops, it kills them. When traffic spikes, it boots more. You never run kubectl scale by hand.

Built by Google. Same team that made Kubernetes. Used in production by tons of teams. It's the open source guts of Google Cloud Run.

How it works for one agent

01
Admin Knative
"Here's the agent image. Treat it as a service called agent-hr-acme."
02
Knative
Creates the service. Zero pods running. Nothing to bill.
03
Juan sends a message Knative
First request arrives. No pod to handle it. Knative boots one (Kata + Firecracker, ~500 ms total).
04
Pod Juan
Pod handles the message, streams the response back over WebSocket.
05
Juan
Stops chatting. Closes the tab. Goes for lunch.
06
Knative
Waits a couple of minutes of zero traffic. Kills the pod. Back to zero. Storage lives on (Chapter 7).

Cold start, the only real downside

First message after idle takes longer because the pod has to boot. With Firecracker we're talking ~500 ms total, which is shorter than the user's expectation of "agent is thinking." Subsequent messages have zero overhead.

For agents the user expects to be instant (a daily-use agent), we can configure Knative to keep one pod always warm. Costs a tiny bit, hides the cold start. The default for any new agent is "scale to zero" because most agents are touched rarely.

What about KEDA?

KEDA is the other big "scale based on events" project. Knative scales based on HTTP traffic. KEDA scales based on anything (queue depth, message count, custom metrics). For Houston, HTTP/WebSocket traffic is the signal that matters, so Knative fits more naturally. We can add KEDA later if we need to scale based on, say, "number of inbound Slack messages waiting for an agent."

Why this matters for the pitch

"A customer with 10,000 agents pays only for the ones actively in a conversation" is a real cost story, not marketing. It's what makes per agent isolation affordable. Without scale to zero, pod per agent is expensive theater. With scale to zero, it's a structural cost advantage we can't easily lose.

The combo

Kubernetes (the platform) + Kata Containers (the runtime) + Firecracker (the VM) + Knative (the autoscaler) is the full hand. Each one fixes a problem the others can't. Together they give us "pod per agent, isolated by VM, billed only when active." Nothing else gets us all three at once.

What we install on the cluster

Knative Serving from the official YAML. Configure it to use Kata's RuntimeClass for our agent services. Each agent becomes a Service resource with a unique URL. Scale-to-zero idle time defaults to ~30 seconds; can tune per agent. The control plane points at the agent's Knative URL when routing messages.