# mcp-data-platform

> Composable, semantic-first MCP server platform that composes mcp-datahub, mcp-trino, and mcp-s3 behind one endpoint with bidirectional cross-enrichment, OAuth 2.1 inbound and outbound, OIDC, role-based personas, audit logging, an admin and user portal, knowledge capture with DataHub write-back, a memory layer (PostgreSQL + pgvector), and a gateway toolkit that re-exposes any third-party MCP server through the same auth, persona, and audit pipeline.

mcp-data-platform is the orchestration layer for the txn2 MCP ecosystem. DataHub is the only required dependency (the semantic layer); add Trino for SQL and S3 for objects when ready. The platform turns a stack of toolkits into an enterprise-grade MCP server with cross-enrichment that injects business context into every response.

## MCP Server

- [Server Overview](https://mcp-data-platform.txn2.com/server/overview/): What the platform does, architecture, request flow
- [Installation](https://mcp-data-platform.txn2.com/server/installation/): Install via go install, Homebrew, Docker, or from source
- [Configuration](https://mcp-data-platform.txn2.com/server/configuration/): YAML configuration with environment variable expansion, config versioning (apiVersion, version lifecycle, migrate-config CLI), granular config entries (per-key database overrides for whitelisted keys with hot-reload), tool visibility filtering, tool description overrides (built-in and custom per-tool), workflow gating (session-aware enforcement that agents call DataHub discovery before Trino queries with escalation), prompts (auto-registered platform-overview from description + toolkits, operator-configured with placeholder argument substitution, conditional workflow prompts, PromptDescriber interface, operator override by name), resource templates, custom resources (static MCP resources from config), managed resources (human-uploaded files with scope-based visibility), progress notifications, client logging, elicitation (cost estimation, PII consent), icons, admin API and portal, database, audit, session configuration, and browser session configuration (OIDC login with PKCE for portal UI, cookie-based sessions)
- [The Data Stack](https://mcp-data-platform.txn2.com/concepts/components/): Why DataHub, Trino, and S3, what each component brings, how cross-enrichment makes them greater than the sum of parts
- [Operating Modes](https://mcp-data-platform.txn2.com/server/operating-modes/): Two deployment modes, standalone (no database) and file + database (with per-key config overrides via config entries). Feature availability by mode, example configurations, decision guide
- [Deployment](https://mcp-data-platform.txn2.com/server/deployment/): Docker Compose and Kubernetes/Helm deployment guides
- [Tools](https://mcp-data-platform.txn2.com/server/tools/): All tools from DataHub, Trino, S3, Knowledge, and Portal toolkits including trino_export for portal asset export
- [Multi-Provider](https://mcp-data-platform.txn2.com/server/multi-provider/): Connect multiple instances of each service
- [Audit Logging](https://mcp-data-platform.txn2.com/server/audit/): PostgreSQL-backed audit logging for tool calls. Schema, field reference (including enrichment_mode, enrichment_tokens_full, enrichment_tokens_dedup), parameter sanitization, caller-class separation via `source` (mcp = agent, rest = gateway REST shim used by NiFi/cronjobs, admin = portal-driven tool runs), monthly partition rotation with eager-ahead `CREATE TABLE IF NOT EXISTS audit_logs_YYYY_MM` and `DROP TABLE` for fully expired partitions, retention, query examples, troubleshooting
- [Observability (Metrics)](https://mcp-data-platform.txn2.com/server/observability/): OpenTelemetry-backed Prometheus metrics on a dedicated `:9090` listener (Phase 1). Instruments two chokepoints — every MCP tool call (`mcp_tool_calls_total`, `mcp_tool_call_duration_seconds`, `mcp_inflight_tool_calls` with bounded `tool` / `toolkit_kind` / `persona` / `status_category` labels) and every apigateway outbound HTTP call (`apigateway_outbound_total`, `apigateway_outbound_duration_seconds` with `connection` / `http_status_class` / `status_category`). High-cardinality fields (user id, raw URLs, request id) are deliberately kept off labels and reserved for trace spans in Phase 2. Env-only config: `OTEL_METRICS_ENABLED` (default true), `OTEL_METRICS_ADDR`. Safe to mount `/metrics` behind a NetworkPolicy; set `OTEL_METRICS_ENABLED=false` to disable.
- [Session Externalization](https://mcp-data-platform.txn2.com/server/session-externalization/): Externalize session state to PostgreSQL for zero-downtime restarts and horizontal scaling. Includes the session broadcaster that delivers `notifications/tools/list_changed` to SSE long-poll subscribers in stateless streamable HTTP mode (memory broadcaster single-replica, postgres LISTEN/NOTIFY for multi-replica), so downstream agents see live tool inventory updates without disconnecting.
- [Gateway Toolkit](https://mcp-data-platform.txn2.com/server/gateway/): Re-expose third-party MCP servers through the platform's auth, persona, and audit pipeline. Connections authored via the admin portal (DB-backed, encrypted credentials). Tools surface as connection__remote_tool. OAuth 2.1 client_credentials and authorization_code+PKCE grants supported, refresh tokens persisted encrypted (gateway_oauth_tokens) so cron jobs and scheduled prompts run untouched after one-time browser sign-in. Optional declarative cross-enrichment rules join proxied responses with Trino queries or DataHub lookups (predicates, JSONPath bindings, escaped ANSI-SQL literals, dry-run admin endpoint). Failure-isolated, unreachable upstreams never block startup or other tools
- [API Gateway Toolkit](https://mcp-data-platform.txn2.com/server/api-gateway/): Proxy arbitrary REST/HTTP APIs (kind `api`) through the same auth, persona, and audit pipeline used by the MCP gateway. Three tools (`api_invoke_endpoint`, `api_list_endpoints`, `api_get_endpoint_schema`) cover every operation on every upstream without per-endpoint tool explosion. Auth modes: none, bearer, api_key (header or query), basic (RFC 7617 `Authorization: Basic base64(username:password)` for legacy APIs like Jenkins or on-prem Jira; password may be empty for the token-in-userid pattern), oauth2_client_credentials, oauth2_authorization_code (browser sign-in with persisted refresh tokens). `static_headers` attaches operator-supplied headers to every call alongside the auth header, required for APIs that demand both an OAuth bearer AND a separate project/subscription header (Google's `x-goog-user-project` for quota billing, vendor subscription keys). Header values are encrypted at rest (AES-256-GCM); validation refuses names that collide with Authorization, the api_key header, or hop-by-hop headers, and refuses CRLF/NUL in values. Model is blocked at request time from setting or overriding any `static_headers` entry; operator config is authoritative. A REST shim at `POST /api/v1/gateway/{connection}/invoke` exposes `api_invoke_endpoint` to non-MCP HTTP clients (Apache NiFi, Airflow `HttpOperator`, `curl`); the same `Authorization`/`X-API-Key` headers, persona allowlists, route-policy gates, and audit pipeline govern REST callers identically to MCP callers, since every REST request runs through an in-memory MCP session against the assembled server. Upstream HTTP status is surfaced in the response body (`InvokeOutput.status`); the platform's own status code only signals platform-level outcomes (`200` = call ran, `401`/`403`/`404`/`400` for credential/persona/connection/validation failures).
- [API Catalogs](https://mcp-data-platform.txn2.com/server/api-catalogs/): OpenAPI 3.x specs are stored globally as versioned catalog bundles, not per-connection. A catalog has an immutable id, a name+version pair, and a list of named component specs (e.g. drive, calendar, gmail for a Google Workspace catalog). Connections reference a catalog via `config.catalog_id`; one catalog can back many connections (sandbox vs prod, multiple tenants). Component specs are ingested via paste, file upload, or HTTPS URL fetch with strict SSRF guards (HTTPS-only, private/loopback/CGNAT/link-local IPs rejected with a dial-time recheck for DNS rebinding, 10 MB body cap). `api_list_endpoints` returns operations with a `spec` field; `api_get_endpoint_schema` returns parameters, request body, and per-status response schemas (security/servers/auth-vendor-extensions stripped). Mutations to a catalog fan out to every referencing connection without a process restart. Per-operation embedding vectors used by `ranking=semantic|hybrid` are computed off the request path by a Postgres-backed job queue (`api_catalog_embedding_jobs`, `SELECT FOR UPDATE SKIP LOCKED`, `LISTEN/NOTIFY` for low-latency wake) and persisted in `api_catalog_operation_embeddings` (pgvector, keyed on catalog_id+spec_name+operation_id). Spec writes enqueue jobs atomically; workers across every pod race for them, take time-bounded leases, and write vectors. A reconciler enqueues missing vectors on pod boot and every 5 minutes thereafter, so embeddings converge to fully indexed without operator action. Operators see per-spec status badges in the portal (indexed / running / queued / failed) and never need to click anything; a Retry button surfaces only when a job has exhausted its retries.

## Cross-Enrichment

- [Overview](https://mcp-data-platform.txn2.com/cross-enrichment/overview/): How automatic context enrichment works between services
- [Trino to DataHub](https://mcp-data-platform.txn2.com/cross-enrichment/trino-datahub/): Trino results include DataHub metadata
- [DataHub to Trino](https://mcp-data-platform.txn2.com/cross-enrichment/datahub-trino/): DataHub results show query availability
- [S3 Enrichment](https://mcp-data-platform.txn2.com/cross-enrichment/s3/): S3 operations include semantic context
- [Lineage Inheritance](https://mcp-data-platform.txn2.com/cross-enrichment/lineage/): Automatic column metadata inheritance from upstream datasets via DataHub lineage
- [Session Dedup](https://mcp-data-platform.txn2.com/cross-enrichment/overview/#session-metadata-deduplication): Avoids repeating semantic metadata for previously-enriched tables within a session, saving LLM context tokens
- [Column Context Filtering](https://mcp-data-platform.txn2.com/cross-enrichment/trino-datahub/#column-level-enrichment): Limits column-level enrichment to columns referenced in the SQL query, reducing token usage for wide tables (default: enabled)
- [Schema Preview](https://mcp-data-platform.txn2.com/cross-enrichment/datahub-trino/#schema-preview): Adds bounded column-name+type preview to datahub_search query_context, eliminating intermediate datahub_get_schema calls (default: enabled, max 15 columns)

## Authentication & Security

- [Auth Overview](https://mcp-data-platform.txn2.com/auth/overview/): Fail-closed security model, stdio vs HTTP authentication
- [OIDC](https://mcp-data-platform.txn2.com/auth/oidc/): Keycloak, Auth0, Okta, Azure AD setup with required claims
- [API Keys](https://mcp-data-platform.txn2.com/auth/api-keys/): Service account authentication
- [OAuth Server](https://mcp-data-platform.txn2.com/auth/oauth-server/): Built-in OAuth 2.1 for Claude Desktop + Keycloak integration (this platform as the OAuth provider)
- [OAuth to Upstream MCPs](https://mcp-data-platform.txn2.com/auth/oauth-gateway/): OAuth-outbound from the platform to third-party MCP servers proxied through the gateway. Both client_credentials (M2M) and authorization_code + PKCE (browser sign-in) grants. Encrypted refresh tokens persist across restarts. The platform sends `oauth_scope` verbatim; operators add `offline_access` (Keycloak/Auth0/Okta) or `refresh_token` (Salesforce) themselves to get refresh tokens that survive the IdP's SSO session idle timeout. Background refresh loop (default 5m cadence) keeps tokens alive without operator touch; per-connection `oauth2_refresh_max_lifetime` handles IdPs that don't disclose a refresh deadline (Microsoft 90d, Google sliding, vendor IdPs that enforce a wall-clock max). Every connect/refresh/rotation/revocation/admin-deletion is recorded in `connection_auth_events` (90-day retention) and surfaced in the portal's OAuth History panel. Admin status card distinguishes never-connected / revoked / connected. Salesforce Hosted MCP setup walkthrough included

## Personas

- [Overview](https://mcp-data-platform.txn2.com/personas/overview/): Role-based tool access control with connection-level filtering
- [Tool Filtering](https://mcp-data-platform.txn2.com/personas/tool-filtering/): Allow/deny patterns with wildcards, distinction between persona-level filtering (security boundary) and global tool visibility (token optimization), connection-level allow/deny patterns
- [Role Mapping](https://mcp-data-platform.txn2.com/personas/role-mapping/): Map OIDC roles to personas

## Administration

- [User Portal](https://mcp-data-platform.txn2.com/server/portal-user/): User-facing portal pages including activity analytics, assets, collections, resources, shared with me, knowledge & memory, and prompts
- [Admin Portal](https://mcp-data-platform.txn2.com/server/admin-portal/): Built-in web dashboard for monitoring and managing the platform with configurable branding, public viewer two-zone header with optional implementor brand, light/dark mode toggle, expiration countdown, per-share notice text and hide-expiration option, collections with sharing, dashboard with activity timelines and percentiles, tools overview, interactive tool explorer with semantic enrichment display, searchable audit log, knowledge insight governance, settings pages for connections (multi-connection toolkit expansion, source badges), personas (source tracking, file-only delete protection), API keys, and configuration entries
- [Admin API](https://mcp-data-platform.txn2.com/server/admin-api/): REST endpoints for system info, config management (config entries CRUD for whitelisted keys, config changelog), personas, auth keys, audit (events, stats, metrics/overview, metrics/enrichment, metrics/discovery), knowledge, connection instance CRUD. Authentication, operating mode behavior, request/response reference. Interactive Swagger UI at /api/v1/admin/docs/

## Knowledge Capture

- [Overview](https://mcp-data-platform.txn2.com/knowledge/overview/): Tribal knowledge capture for data catalogs. capture_insight records domain knowledge during AI sessions (backed by memory_records); apply_knowledge provides admin review, synthesis, and DataHub write-back with changeset tracking and rollback
- [Governance Workflow](https://mcp-data-platform.txn2.com/knowledge/governance/): Active metadata management through human-in-the-loop curation with bulk review, approve/reject, synthesize change proposals, apply changes to DataHub, changeset tracking, rollback, column-level targeting, agent-driven curated query creation, context document CRUD
- [Admin API](https://mcp-data-platform.txn2.com/knowledge/admin-api/): REST endpoints for managing insights and changesets

## Memory Layer

- [Overview](https://mcp-data-platform.txn2.com/memory/overview/): Persistent memory for agent and analyst sessions. Backed by PostgreSQL + pgvector. Two scoping axes (user ownership, persona visibility). LOCOMO dimensions (knowledge, event, entity, relationship, preference). memory_manage tool (remember, update, forget, list, review_stale). memory_recall tool (entity, semantic, graph, auto strategies). Cross-enrichment middleware auto-attaches memories to toolkit responses. Staleness watcher flags memories when referenced DataHub entities change. Correction chains via metadata.superseded_by
- [Configuration](https://mcp-data-platform.txn2.com/memory/configuration/): Memory config reference, embedding provider (Ollama nomic-embed-text 768-dim or noop), staleness watcher interval and batch size, persona opt-in via memory_* in tools.allow, pgvector extension setup, migration from knowledge_insights

## Managed Resources

- [Overview](https://mcp-data-platform.txn2.com/resources/overview/): Human-uploaded reference material surfaced to AI assistants via MCP resources/list and resources/read. Three scopes (global, persona, user). PostgreSQL metadata + S3 blob storage. REST API at /api/v1/resources. URI scheme mcp://scope/category/filename. Portal Resources page for upload, browse, edit, delete

## MCP Apps

- [Overview](https://mcp-data-platform.txn2.com/mcpapps/overview/): Interactive UI panels rendered inline in the MCP host. Built-in platform-info app embedded in the binary. Branding via config. Custom apps still supported with assets_path
- [Configuration](https://mcp-data-platform.txn2.com/mcpapps/configuration/): MCP Apps enabled by default. Branding overrides via config block. Custom app registration with assets_path, CSP settings, and config injection
- [Development](https://mcp-data-platform.txn2.com/mcpapps/development/): Docker-based development with test harness. mcpapps-dev.yaml uses assets_path to override embedded HTML for live editing
- [Tutorial](https://mcp-data-platform.txn2.com/mcpapps/tutorial/): Step-by-step guide building a platform-info app

## Go Library

- [Library Overview](https://mcp-data-platform.txn2.com/library/overview/): Build custom MCP servers using the Go library
- [Quick Start](https://mcp-data-platform.txn2.com/library/quickstart/): Code examples for common patterns
- [Architecture](https://mcp-data-platform.txn2.com/library/architecture/): Package structure, MCP protocol middleware, provider interfaces
- [Extensibility](https://mcp-data-platform.txn2.com/library/extensibility/): Custom toolkits, providers, middleware

## Reference

- [Tools API](https://mcp-data-platform.txn2.com/reference/tools-api/): Complete tool specifications with parameters and responses
- [Configuration Reference](https://mcp-data-platform.txn2.com/reference/configuration/): Full YAML schema with all options including portal configuration and export settings
- [Providers](https://mcp-data-platform.txn2.com/reference/providers/): Semantic, query, and storage provider interfaces
- [Middleware](https://mcp-data-platform.txn2.com/reference/middleware/): Request processing chain including tool visibility, description override middleware, session-aware workflow gating, icon enrichment, client logging, progress notifications
- [Tuning and Scaling](https://mcp-data-platform.txn2.com/reference/tuning-and-scaling/): Resource requests and limits, Go runtime tuning (GOMEMLIMIT, GOMAXPROCS, GOGC), horizontal scaling characteristics (which subsystems are HA-safe vs per-replica), PostgreSQL connection pool sizing for multi-replica, autoscaling guidance

## Key Capabilities

- Semantic first: every data response carries DataHub business context (owners, tags, glossary terms, quality scores, deprecation warnings) injected automatically at the protocol layer.
- Composable toolkits: DataHub required, Trino and S3 optional, plus a Toolkit interface for custom integrations. Multi-instance per service with runtime selection and isolated failure domains.
- Bidirectional cross-enrichment: Trino results include DataHub metadata; DataHub searches include query availability and sample SQL; S3 operations include semantic context; lineage inheritance fills column metadata from upstream datasets.
- OAuth 2.1 inbound: built-in authorization server with PKCE and Dynamic Client Registration so Claude Desktop and any MCP client can sign in directly to the platform.
- OAuth 2.1 outbound to upstream MCPs: client_credentials (M2M) and authorization_code+PKCE (browser sign-in) grants, encrypted refresh tokens persist across restarts.
- OIDC discovery for Keycloak, Auth0, Okta, Azure AD; API keys for service accounts; fail-closed by default.
- Role-based personas: map OIDC roles to personas with allow/deny tool patterns and connection-level filtering. Per-persona description overrides steer the model differently for different audiences.
- Audit logging to PostgreSQL with user, persona, tool, sanitized parameters, duration, enrichment metrics, result hash. Searchable from the admin portal.
- Admin and user portal: connections, personas, API keys, configuration entries, knowledge governance, audit, gateway, plus user-facing assets, collections, prompts, resources, activity.
- Gateway toolkit: re-expose any third-party MCP server through the platform's auth, persona, audit pipeline. Tools surface as connection__remote_tool. Optional declarative cross-enrichment rules.
- Knowledge capture: tribal knowledge from AI sessions written back to DataHub through human-in-the-loop governance with changeset rollback.
- Memory layer (PostgreSQL + pgvector): persistent memory across sessions with semantic recall, entity recall, graph traversal, and staleness detection.
- Managed resources: human-uploaded reference material with scope-based visibility (global, persona, user), surfaced to AI assistants via MCP resources/list.
- MCP Apps: interactive UI panels rendered inline in the MCP host. Built-in platform-info app, custom apps via assets_path with CSP.
- Two transports: stdio for desktop clients, http with OAuth 2.1 for hosted clients.
- Two operating modes: standalone (no database) for stateless deployments, file + database for full feature set with per-key config overrides.

## Quick Start

```bash
# Install
go install github.com/txn2/mcp-data-platform/cmd/mcp-data-platform@latest

# Minimal config (DataHub only)
cat > platform.yaml <<EOF
server:
  name: mcp-data-platform
  transport: stdio
semantic:
  provider: datahub
  instance: primary
EOF

# Wire to Claude Code
claude mcp add data-platform \
  -e DATAHUB_URL=https://datahub.example.com/api/graphql \
  -e DATAHUB_TOKEN=$TOKEN \
  -- mcp-data-platform --config platform.yaml
```

## Optional

- [Examples Gallery](https://mcp-data-platform.txn2.com/examples/): Real-world configurations for enterprise governance, data democratization, and AI/ML workflows
- [Troubleshooting](https://mcp-data-platform.txn2.com/support/troubleshooting/): Common issues, error codes, debugging guide
- [Ecosystem](https://mcp-data-platform.txn2.com/ecosystem/): Sister MCP projects (mcp-datahub, mcp-s3, mcp-trino) and how they compose
- [GitHub Repository](https://github.com/txn2/mcp-data-platform): Source code, issues, and releases
- [Security Article](https://imti.co/mcp-defense/): MCP Defense, a case study in AI security
