High Availability Architecture Plan

Finance Categorizer — GCP Infrastructure

17
Gaps Addressed
~200ms
Latency Saved/Call
Zero
Migration Downtime

Executive Summary

The Finance Categorizer application currently suffers from cross-region latency—compute services run in us-central1 while Firestore resides in me-west1, adding ~200 ms per round-trip. There are no backups, no monitoring or alerting, and all Cloud Run / Cloud Functions services scale to zero causing 10–20 s cold starts.

This plan consolidates all compute into me-west1, enables PITR and scheduled Firestore backups, adds uptime checks with PagerDuty-ready alerts, and sets min-instances=1 on critical services—closing all 17 identified gaps.

Current Architecture

Users / Browser
Gemini 2.0 Flash API
GitHub Actions CI/CD
Google Cloud Platform
finance-categorizer-fbbb1
us-central1 (Iowa) — Compute
finance-chat
Cloud Function gen2
Node.js 20 | HTTP | min=0
finance-mcp-server
Cloud Run
512Mi | 1 CPU | min=0 max=10
litellm-proxy
Cloud Run | min=0
All min=0 — cold starts 5-15s
me-west1 (Tel Aviv) — Data
Firestore Native
11+ collections
NO backups | NO PITR
No backups configured
No monitoring
Global Services
Firebase Hosting
CDN / SPA
Firebase Auth
Secret Manager
CROSS-REGION PROBLEM
us-central1 → me-west1
~200ms RTT per Firestore op
HTTPS
API
MCP protocol
LLM routing
API key
~200ms cross-region latency
Data lives in Tel Aviv (me-west1) | Compute runs in Iowa (us-central1) | Every DB call pays ~200ms RTT penalty
deploy
deploy
Text is not SVG - cannot display

Target Architecture

Users / Browser
Gemini 2.0 Flash API
GitHub Actions CI/CD
Google Cloud Platform
finance-categorizer-fbbb1
me-west1 (Tel Aviv) — All Services Co-Located
Compute
finance-chat
Cloud Function gen2
min=1, always-on CPU
finance-mcp-server
Cloud Run
min=1, health checks
litellm-proxy
Cloud Run
min=0
Data
Firestore Native
PITR + auto backups
Backup Bucket
Cloud Storage
Operations
Cloud Scheduler
daily backups
Cloud Monitoring
uptime + alerts
All services co-located in me-west1 — <5ms latency
Global Services
Firebase Hosting
CDN / SPA
Firebase Auth
Secret Manager
HTTPS
API
MCP <5ms local
LLM routing
API key
read/write <5ms local
backup trigger
health checks
uptime checks
canary deploy
LLM relay
Text is not SVG - cannot display

Risk Assessment

7 risks scored by severity × likelihood. Every gap in the analysis below maps to exactly one of these risks.

Risk Severity Likelihood Score Related Gaps Mitigation
Data loss CRITICAL LOW 8 GAP-03, GAP-12, GAP-13 PITR + daily backups
Cross-region latency HIGH CERTAIN 9 GAP-01, GAP-10, GAP-15 Migrate compute to me-west1
Cold starts HIGH HIGH 8 GAP-02, GAP-09 min-instances=1 + health checks
No monitoring HIGH HIGH 8 GAP-04, GAP-17 Uptime checks + SLIs/SLOs
Regional SPOF HIGH LOW 6 GAP-05, GAP-06, GAP-11 Accept; PITR mitigates
Unsafe deployments MEDIUM MEDIUM 5 GAP-07, GAP-08, GAP-16 CI/CD + pinned images + graceful shutdown
Security exposure LOW LOW 2 GAP-14 Rotation runbook
Risk → Gap mapping: The “Risk” column in the gap table below references these exact risk names, creating a direct link between risks and their individual findings.

Gap Analysis

17 gaps mapped to the 7 risks above. Click any row to expand details.

Gap ID Title Severity Risk Score Status
GAP-03 No Firestore backups CRITICAL Data loss 8 Remediated

Description

Firestore has no automated backup schedule and no Point-in-Time Recovery (PITR) enabled. Financial transaction data is the core asset of the application. Accidental deletion, corruption, or a bad deploy could cause complete, irrecoverable data loss.

Impact: Complete irrecoverable data loss

Remediation

Enable daily automated backups via gcloud firestore backups schedules create with 7-day retention. Enable PITR for granular point-in-time restores. Estimated cost: $3–5/mo.

GAP-02 Scale to zero HIGH Cold starts 8 Remediated

Description

All three compute services have min-instances=0, meaning they scale to zero during idle periods. Cold starts take 3–8s per service. When all three are cold simultaneously, the first request suffers compound cold starts of 10–20 seconds.

Impact: First request after idle takes 10–20 seconds

Remediation

Set min-instances=1 for finance-chat and finance-mcp-server to keep at least one warm instance. LiteLLM proxy can remain at 0 since it sits behind finance-chat. Estimated cost: ~$28/mo for two always-on min instances.

GAP-05 Firestore REGIONAL SPOF HIGH Regional SPOF 6 Accepted

Description

Firestore is configured as REGIONAL in me-west1. A full regional outage (rare but possible) would cause complete application downtime. Multi-region Firestore would eliminate this but costs 3–5x more.

Impact: Complete outage during me-west1 regional events

Remediation

Accepted risk. Multi-region Firestore exceeds the $50/mo budget ceiling. Mitigated by daily backups + PITR (GAP-03). Regional outages are rare (<1 per year) and typically resolve within hours. Documented in the risk register.

GAP-01 Cross-region latency HIGH Cross-region latency 9 Remediated

Description

Compute services (finance-chat, finance-mcp-server, litellm-proxy) run in us-central1 (Iowa) while Firestore lives in me-west1 (Tel Aviv). Every Firestore read/write pays ~200ms round-trip latency. A single chat request triggers 1–5 tool calls, compounding to 200–1000ms of pure network overhead.

Impact: P95 latency exceeds acceptable thresholds

Remediation

Migrate all compute services to me-west1 to co-locate with Firestore. Update CSP headers to point to new region URLs. Eliminates cross-region latency entirely.

GAP-04 No monitoring HIGH No monitoring 8 Remediated

Description

Zero Cloud Monitoring alerts, no uptime checks, no notification channels configured. Mean Time to Detect (MTTD) is effectively unbounded — outages are only discovered when users complain.

Impact: Outages only discovered via user complaints

Remediation

Configure HTTPS uptime checks for finance-chat and MCP server endpoints. Create alert policies for error rate >1%, latency P95 >5s, and instance count dropping to 0. Set up email/Slack notification channels. All within GCP free tier.

GAP-13 No PITR MEDIUM Data loss 3 Remediated

Description

Point-in-Time Recovery is not enabled on the Firestore database. Without PITR, the only restore option is from the most recent daily backup, meaning up to 24 hours of data could be lost in a recovery scenario.

Impact: Poor recovery granularity (up to 24h data loss)

Remediation

Enable PITR via gcloud firestore databases update --enable-pitr. Allows restore to any point in the last 7 days with second-level granularity. Included in the backup cost estimate ($3–5/mo).

GAP-07 No CI/CD for MCP/LiteLLM MEDIUM Unsafe deployments 5 Planned

Description

The MCP server and LiteLLM proxy are deployed manually with no automated pipeline, no smoke tests, and no automated rollback. A bad deploy immediately impacts 100% of traffic with no safety net.

Impact: Bad deploys immediately impact 100% of traffic

Remediation

Planned: Create GitHub Actions workflows for both services mirroring the existing finance-chat deploy pipeline. Include build, smoke test, and deploy stages with Cloud Run traffic splitting for canary deployments.

GAP-09 No health checks MEDIUM Cold starts 5 Remediated

Description

The MCP server Cloud Run service has no startup or liveness probe configured. During scale-up events, Cloud Run may route traffic to instances that haven’t finished initializing, causing intermittent 500 errors.

Impact: Intermittent 500 errors during scale events

Remediation

Configure Cloud Run startup probe (--startup-probe-path=/health) and liveness check. Add a /health endpoint to the MCP server that verifies Firestore connectivity. Zero additional cost.

GAP-06 In-memory rate limiter MEDIUM Regional SPOF 6 Documented

Description

The rate limiter and circuit breaker in finance-chat use in-memory state. This state is lost every time the instance cold starts or scales down, and is inconsistent across multiple instances during scale-up events.

Impact: Rate limiting bypassable, circuit breaker never truly opens

Remediation

Documented for future improvement. Options include Firestore-backed rate limiting (adds latency per check) or Redis/Memorystore (adds ~$30/mo). Current risk is acceptable given low traffic volume. Revisit when traffic grows.

GAP-11 No fallback model MEDIUM Regional SPOF 5 Planned

Description

The LiteLLM proxy has a single model dependency (Gemini). No fallback model is configured. If Google’s Gemini API experiences an outage or rate limiting, the entire chat functionality becomes unavailable.

Impact: Gemini outage = complete chat outage

Remediation

Planned: Configure gemini-1.5-flash as a fallback model in LiteLLM’s routing config. Optionally add a second provider (e.g., Claude or GPT-4o-mini) as a tertiary fallback. Minimal additional cost since fallback is only used during outages.

GAP-17 No SLI/SLO MEDIUM No monitoring 5 Remediated

Description

The 99.9% SLA target is unmeasurable because no Service Level Indicators (SLIs) or Service Level Objectives (SLOs) are defined. Without SLIs, there is no way to determine whether the system is meeting its availability target or burning through error budget.

Impact: Cannot determine if SLA is being met

Remediation

Define SLIs (availability = successful requests / total requests, latency = P95 < 3s) and create Cloud Monitoring SLO objects. Set up error budget burn-rate alerts. Covered by GCP free tier monitoring.

GAP-08 Unpinned LiteLLM Docker MEDIUM Unsafe deployments 5 Remediated

Description

LiteLLM proxy uses the main-latest Docker image tag, which tracks upstream HEAD. Any breaking change, regression, or vulnerability in the upstream image is automatically pulled into production on the next deploy or restart.

Impact: Upstream regression can break production

Remediation

Pin the LiteLLM Docker image to a specific version tag (e.g., ghcr.io/berriai/litellm:main-v1.55.8). Update deliberately after testing. Add Dependabot or Renovate for automated version update PRs.

GAP-12 LiteLLM spend data lost LOW Data loss 4 Accepted

Description

LiteLLM stores spend tracking data in SQLite on the Cloud Run instance’s ephemeral disk. Every restart, scale-down, or new revision deployment wipes the spend database. Cost tracking data is inherently unreliable.

Impact: Cost tracking unreliable

Remediation

Accepted risk. LiteLLM spend data is supplementary — primary cost tracking is done via GCP billing. Options for future: mount a persistent Cloud Storage FUSE volume or switch to PostgreSQL (Cloud SQL, ~$10/mo). Not worth the cost for current usage level.

GAP-16 No graceful shutdown LOW Unsafe deployments 2 Planned

Description

The MCP server has no SIGTERM handler for graceful shutdown. When Cloud Run sends a termination signal during deploys or scale-down, in-flight requests are immediately dropped instead of being allowed to complete.

Impact: In-flight requests dropped during deploys

Remediation

Planned: Add a SIGTERM handler that stops accepting new requests, waits for in-flight requests to complete (up to Cloud Run’s termination grace period), then exits cleanly. Simple code change, zero cost.

GAP-15 CSP hardcodes region LOW Cross-region latency 3 Remediated

Description

The Content Security Policy (CSP) connect-src directive hardcodes us-central1 Cloud Function URLs. When compute migrates to me-west1, the frontend will be unable to reach the new endpoints until CSP is updated.

Impact: Frontend outage on migration if CSP not updated

Remediation

Update the CSP connect-src in firebase.json to include me-west1 URLs before migration. After migration is verified, remove the old us-central1 URLs. Zero cost.

GAP-10 No composite indexes LOW Cross-region latency 3 Remediated

Description

Firestore has no composite indexes configured for multi-field queries. As the dataset grows, queries combining multiple filters (e.g., date range + category) will either fail or perform full collection scans.

Impact: Slow or failed queries as data grows

Remediation

Define composite indexes in firestore.indexes.json for common query patterns. Deploy via firebase deploy --only firestore:indexes. No cost impact — indexes are included in Firestore pricing.

GAP-14 No secret rotation LOW Security exposure 2 Documented

Description

API keys and secrets in Secret Manager are rotated manually. There is no automated rotation schedule, alerting on secret age, or rotation runbook. In the event of a credential compromise, the blast radius is extended by the time to detect and rotate.

Impact: Extended blast radius on compromise

Remediation

Documented for future improvement. Implement a 90-day rotation schedule with Cloud Scheduler triggering a Cloud Function to rotate keys. For now, maintain a manual rotation runbook and set calendar reminders.

Result: 10 Remediated, 3 Planned, 2 Accepted, 2 Documented — zero gaps left unaddressed.

Migration Plan

7 phases executed sequentially. Zero-downtime migration with rollback at every stage.
Total: 2–3 hours active work + 48-hour soak period.

1

Data Protection

15 minutes
  • Enable Point-in-Time Recovery (PITR) on Firestore
  • Create daily backup schedule via Cloud Scheduler
  • Provision GCS bucket for backup exports
Rollback: Disable PITR, delete schedule + bucket
2

Composite Indexes

10 minutes
  • Deploy 4 composite indexes for query optimization
  • Verify index build completes via Firebase console
  • Run query validation suite
Rollback: Delete new indexes (no impact on existing queries)
3

Deploy to me-west1

20 minutes
  • Deploy finance-chat, MCP server, litellm-proxy to me-west1
  • Services run in both regions simultaneously
  • Pin litellm Docker image to specific SHA digest
Rollback: Delete me-west1 services (us-central1 still live)
4

Validate

30 minutes
  • Run smoke tests against me-west1 endpoints
  • Compare latency metrics: us-central1 vs me-west1
  • Validate health check endpoints respond correctly
Rollback: Continue using us-central1 if validation fails
5

Cut Traffic

15 minutes
  • Update frontend CSP and API URLs to me-west1 endpoints
  • Update CI/CD pipeline region configuration
  • Deploy updated Firebase Hosting config
Rollback: Revert frontend config to us-central1 URLs
6

Monitoring

30 minutes
  • Create uptime checks for all service health endpoints
  • Configure alert policies with notification channels
  • Build Cloud Monitoring dashboard and define SLOs
Rollback: Delete monitoring resources (no service impact)
7

Decommission

10 min + 48h soak
  • Soak period: monitor me-west1 services for 48 hours
  • Delete us-central1 Cloud Run revisions and CF deployments
  • Archive migration artifacts and update documentation
Rollback: Re-deploy to us-central1 from CI/CD (~15 min)
Global Rollback Guarantee: Full rollback to us-central1 is possible in ~15 minutes at any point during the migration.

IaC Scripts Reference

Script Purpose Usage
infra/ha/deploy-ha.sh Deploy all services to me-west1 with HA configs ./deploy-ha.sh --dry-run
infra/ha/backup-setup.sh Enable PITR, backup schedules, GCS exports ./backup-setup.sh --dry-run
infra/ha/monitoring-setup.sh Create uptime checks, alerts, dashboard, SLOs ./monitoring-setup.sh --dry-run
infra/ha/decommission-us-central1.sh Delete old us-central1 services ./decommission-us-central1.sh
Note: All scripts are idempotent (safe to re-run) and support --dry-run mode for previewing changes before execution.