Finance Categorizer — GCP Infrastructure
The Finance Categorizer application currently suffers from
cross-region latency—compute
services run in us-central1 while Firestore resides in
me-west1, adding ~200 ms per round-trip.
There are no backups,
no monitoring or alerting, and all
Cloud Run / Cloud Functions services scale to zero causing
10–20 s cold starts.
This plan consolidates all compute into me-west1,
enables PITR and scheduled Firestore backups, adds uptime checks
with PagerDuty-ready alerts, and sets min-instances=1
on critical services—closing all 17 identified gaps.
7 risks scored by severity × likelihood. Every gap in the analysis below maps to exactly one of these risks.
| Risk | Severity | Likelihood | Score | Related Gaps | Mitigation |
|---|---|---|---|---|---|
| Data loss | CRITICAL | LOW | 8 | GAP-03, GAP-12, GAP-13 | PITR + daily backups |
| Cross-region latency | HIGH | CERTAIN | 9 | GAP-01, GAP-10, GAP-15 | Migrate compute to me-west1 |
| Cold starts | HIGH | HIGH | 8 | GAP-02, GAP-09 | min-instances=1 + health checks |
| No monitoring | HIGH | HIGH | 8 | GAP-04, GAP-17 | Uptime checks + SLIs/SLOs |
| Regional SPOF | HIGH | LOW | 6 | GAP-05, GAP-06, GAP-11 | Accept; PITR mitigates |
| Unsafe deployments | MEDIUM | MEDIUM | 5 | GAP-07, GAP-08, GAP-16 | CI/CD + pinned images + graceful shutdown |
| Security exposure | LOW | LOW | 2 | GAP-14 | Rotation runbook |
17 gaps mapped to the 7 risks above. Click any row to expand details.
| Gap ID | Title | Severity | Risk | Score | Status |
|---|---|---|---|---|---|
| GAP-03 | No Firestore backups | CRITICAL | Data loss | 8 | Remediated |
DescriptionFirestore has no automated backup schedule and no Point-in-Time Recovery (PITR) enabled. Financial transaction data is the core asset of the application. Accidental deletion, corruption, or a bad deploy could cause complete, irrecoverable data loss. Impact: Complete irrecoverable data lossRemediationEnable daily automated backups via |
|||||
| GAP-02 | Scale to zero | HIGH | Cold starts | 8 | Remediated |
DescriptionAll three compute services have min-instances=0, meaning they scale to zero during idle periods. Cold starts take 3–8s per service. When all three are cold simultaneously, the first request suffers compound cold starts of 10–20 seconds. Impact: First request after idle takes 10–20 secondsRemediationSet min-instances=1 for finance-chat and finance-mcp-server to keep at least one warm instance. LiteLLM proxy can remain at 0 since it sits behind finance-chat. Estimated cost: ~$28/mo for two always-on min instances. |
|||||
| GAP-05 | Firestore REGIONAL SPOF | HIGH | Regional SPOF | 6 | Accepted |
DescriptionFirestore is configured as REGIONAL in me-west1. A full regional outage (rare but possible) would cause complete application downtime. Multi-region Firestore would eliminate this but costs 3–5x more. Impact: Complete outage during me-west1 regional eventsRemediationAccepted risk. Multi-region Firestore exceeds the $50/mo budget ceiling. Mitigated by daily backups + PITR (GAP-03). Regional outages are rare (<1 per year) and typically resolve within hours. Documented in the risk register. |
|||||
| GAP-01 | Cross-region latency | HIGH | Cross-region latency | 9 | Remediated |
DescriptionCompute services (finance-chat, finance-mcp-server, litellm-proxy) run in us-central1 (Iowa) while Firestore lives in me-west1 (Tel Aviv). Every Firestore read/write pays ~200ms round-trip latency. A single chat request triggers 1–5 tool calls, compounding to 200–1000ms of pure network overhead. Impact: P95 latency exceeds acceptable thresholdsRemediationMigrate all compute services to me-west1 to co-locate with Firestore. Update CSP headers to point to new region URLs. Eliminates cross-region latency entirely. |
|||||
| GAP-04 | No monitoring | HIGH | No monitoring | 8 | Remediated |
DescriptionZero Cloud Monitoring alerts, no uptime checks, no notification channels configured. Mean Time to Detect (MTTD) is effectively unbounded — outages are only discovered when users complain. Impact: Outages only discovered via user complaintsRemediationConfigure HTTPS uptime checks for finance-chat and MCP server endpoints. Create alert policies for error rate >1%, latency P95 >5s, and instance count dropping to 0. Set up email/Slack notification channels. All within GCP free tier. |
|||||
| GAP-13 | No PITR | MEDIUM | Data loss | 3 | Remediated |
DescriptionPoint-in-Time Recovery is not enabled on the Firestore database. Without PITR, the only restore option is from the most recent daily backup, meaning up to 24 hours of data could be lost in a recovery scenario. Impact: Poor recovery granularity (up to 24h data loss)RemediationEnable PITR via |
|||||
| GAP-07 | No CI/CD for MCP/LiteLLM | MEDIUM | Unsafe deployments | 5 | Planned |
DescriptionThe MCP server and LiteLLM proxy are deployed manually with no automated pipeline, no smoke tests, and no automated rollback. A bad deploy immediately impacts 100% of traffic with no safety net. Impact: Bad deploys immediately impact 100% of trafficRemediationPlanned: Create GitHub Actions workflows for both services mirroring the existing finance-chat deploy pipeline. Include build, smoke test, and deploy stages with Cloud Run traffic splitting for canary deployments. |
|||||
| GAP-09 | No health checks | MEDIUM | Cold starts | 5 | Remediated |
DescriptionThe MCP server Cloud Run service has no startup or liveness probe configured. During scale-up events, Cloud Run may route traffic to instances that haven’t finished initializing, causing intermittent 500 errors. Impact: Intermittent 500 errors during scale eventsRemediationConfigure Cloud Run startup probe ( |
|||||
| GAP-06 | In-memory rate limiter | MEDIUM | Regional SPOF | 6 | Documented |
DescriptionThe rate limiter and circuit breaker in finance-chat use in-memory state. This state is lost every time the instance cold starts or scales down, and is inconsistent across multiple instances during scale-up events. Impact: Rate limiting bypassable, circuit breaker never truly opensRemediationDocumented for future improvement. Options include Firestore-backed rate limiting (adds latency per check) or Redis/Memorystore (adds ~$30/mo). Current risk is acceptable given low traffic volume. Revisit when traffic grows. |
|||||
| GAP-11 | No fallback model | MEDIUM | Regional SPOF | 5 | Planned |
DescriptionThe LiteLLM proxy has a single model dependency (Gemini). No fallback model is configured. If Google’s Gemini API experiences an outage or rate limiting, the entire chat functionality becomes unavailable. Impact: Gemini outage = complete chat outageRemediationPlanned: Configure |
|||||
| GAP-17 | No SLI/SLO | MEDIUM | No monitoring | 5 | Remediated |
DescriptionThe 99.9% SLA target is unmeasurable because no Service Level Indicators (SLIs) or Service Level Objectives (SLOs) are defined. Without SLIs, there is no way to determine whether the system is meeting its availability target or burning through error budget. Impact: Cannot determine if SLA is being metRemediationDefine SLIs (availability = successful requests / total requests, latency = P95 < 3s) and create Cloud Monitoring SLO objects. Set up error budget burn-rate alerts. Covered by GCP free tier monitoring. |
|||||
| GAP-08 | Unpinned LiteLLM Docker | MEDIUM | Unsafe deployments | 5 | Remediated |
DescriptionLiteLLM proxy uses the RemediationPin the LiteLLM Docker image to a specific version tag (e.g., |
|||||
| GAP-12 | LiteLLM spend data lost | LOW | Data loss | 4 | Accepted |
DescriptionLiteLLM stores spend tracking data in SQLite on the Cloud Run instance’s ephemeral disk. Every restart, scale-down, or new revision deployment wipes the spend database. Cost tracking data is inherently unreliable. Impact: Cost tracking unreliableRemediationAccepted risk. LiteLLM spend data is supplementary — primary cost tracking is done via GCP billing. Options for future: mount a persistent Cloud Storage FUSE volume or switch to PostgreSQL (Cloud SQL, ~$10/mo). Not worth the cost for current usage level. |
|||||
| GAP-16 | No graceful shutdown | LOW | Unsafe deployments | 2 | Planned |
DescriptionThe MCP server has no SIGTERM handler for graceful shutdown. When Cloud Run sends a termination signal during deploys or scale-down, in-flight requests are immediately dropped instead of being allowed to complete. Impact: In-flight requests dropped during deploysRemediationPlanned: Add a SIGTERM handler that stops accepting new requests, waits for in-flight requests to complete (up to Cloud Run’s termination grace period), then exits cleanly. Simple code change, zero cost. |
|||||
| GAP-15 | CSP hardcodes region | LOW | Cross-region latency | 3 | Remediated |
DescriptionThe Content Security Policy (CSP) RemediationUpdate the CSP |
|||||
| GAP-10 | No composite indexes | LOW | Cross-region latency | 3 | Remediated |
DescriptionFirestore has no composite indexes configured for multi-field queries. As the dataset grows, queries combining multiple filters (e.g., date range + category) will either fail or perform full collection scans. Impact: Slow or failed queries as data growsRemediationDefine composite indexes in |
|||||
| GAP-14 | No secret rotation | LOW | Security exposure | 2 | Documented |
DescriptionAPI keys and secrets in Secret Manager are rotated manually. There is no automated rotation schedule, alerting on secret age, or rotation runbook. In the event of a credential compromise, the blast radius is extended by the time to detect and rotate. Impact: Extended blast radius on compromiseRemediationDocumented for future improvement. Implement a 90-day rotation schedule with Cloud Scheduler triggering a Cloud Function to rotate keys. For now, maintain a manual rotation runbook and set calendar reminders. |
|||||
7 phases executed sequentially. Zero-downtime migration with rollback at every stage.
Total: 2–3 hours active work + 48-hour soak period.
| Script | Purpose | Usage |
|---|---|---|
infra/ha/deploy-ha.sh |
Deploy all services to me-west1 with HA configs | ./deploy-ha.sh --dry-run |
infra/ha/backup-setup.sh |
Enable PITR, backup schedules, GCS exports | ./backup-setup.sh --dry-run |
infra/ha/monitoring-setup.sh |
Create uptime checks, alerts, dashboard, SLOs | ./monitoring-setup.sh --dry-run |
infra/ha/decommission-us-central1.sh |
Delete old us-central1 services | ./decommission-us-central1.sh |
--dry-run mode for previewing changes before execution.