# mcp-data-platform

> MCP server for AI-assisted data exploration. DataHub semantic layer with optional Trino and S3. Cross-injection automatically enriches query results with business context: owners, tags, quality scores, deprecation warnings. Implements fail-closed security with OIDC/API key authentication, TLS for HTTP transport, and prompt injection protection.

For the security architecture rationale, see: https://imti.co/mcp-defense/

---

# Overview

Your AI assistant can run SQL. But it doesn't know that `cust_id` contains PII, that the table was deprecated last month, or who to ask when something breaks.

mcp-data-platform fixes that. It connects AI assistants to your data infrastructure and adds business context from your semantic layer. Query a table and get its meaning, owners, quality scores, and deprecation warnings in the same response.

The only requirement is DataHub (https://datahubproject.io/). Add Trino (https://trino.io/) for SQL queries and S3 for object storage when you're ready.

## Key Features

- **Semantic-First**: DataHub is the foundation. Query a table, get its business context automatically: owners, tags, quality scores, deprecation warnings. No separate lookups.

- **Cross-Enrichment**: Trino results include DataHub metadata. DataHub searches show which datasets are queryable. Context flows between services automatically.

- **Enterprise Security**: Fail-closed authentication model, TLS enforcement for HTTP transport, prompt injection protection, and read-only mode enforcement.

- **Built for Customization**: Add custom toolkits, providers, and middleware. The Go library exposes everything. Build the data platform your organization needs.

- **Personas**: Define who can use which tools. Analysts get read access. Admins get everything. Map from your identity provider's roles.

- **Resource Templates**: Browse platform data as parameterized MCP resources using RFC 6570 URI templates. Three built-in templates: table schemas (`schema://catalog.schema/table`), glossary terms (`glossary://term`), and data availability (`availability://catalog.schema/table`).

- **Managed Resources**: Human-uploaded reference material (samples, playbooks, templates, references) surfaced directly to AI assistants via MCP `resources/list` and `resources/read`. Three visibility scopes: global, persona, and user. PostgreSQL metadata with S3 blob storage. REST API for CRUD, Admin Portal page for management. Auto-enabled when a database is available.

- **Progress Notifications**: Long-running Trino queries send granular progress updates (executing, formatting, complete) to clients that provide `_meta.progressToken`. Zero overhead when disabled.

- **Client Logging**: Server-to-client log messages for platform decisions (enrichment, timing) via MCP `logging/setLevel` protocol. Zero overhead if the client hasn't opted in.

- **Elicitation**: User confirmation prompts before expensive queries (EXPLAIN IO cost estimation) or PII access (sensitive column detection). Requires client-side elicitation support; gracefully degrades when unavailable.

- **Icons**: Visual metadata for tools, resources, and prompts. Upstream toolkits provide default icons; deployers can override via configuration.

- **Dynamic Prompts**: Three-tier prompt system: auto-registered `platform-overview` built from description and enabled toolkits, operator-configured prompts with `{placeholder}` argument substitution, and conditional workflow prompts (explore-available-data, create-interactive-dashboard, create-a-report, trace-data-lineage) registered when required toolkits are present. Toolkits implement `PromptDescriber` to advertise their own prompts. Operator prompts override auto-registered ones by name. All prompts included in `platform_info` response and the platform-info app Prompts tab.

---

# Ecosystem

mcp-data-platform is the orchestration layer for a broader suite of open-source MCP servers designed to work together as a composable data platform. Each component can run standalone or be combined through mcp-data-platform for unified access with cross-enrichment, authentication, and personas.

## mcp-datahub (https://github.com/txn2/mcp-datahub/)

An MCP server for DataHub, the metadata catalog. Provides AI assistants with dataset search, schema exploration, lineage graphs, glossary terms, domains, tags, and ownership information. In the platform, DataHub serves as the semantic layer: every query result is enriched with business context from DataHub before being returned to the AI assistant.

## mcp-s3 (https://github.com/txn2/mcp-s3/)

An MCP server for Amazon S3, providing AI assistants with direct access to object storage. List buckets, browse prefixes, read objects, and generate presigned URLs. Supports multi-server configurations for accessing storage across accounts and regions.

## mcp-trino (https://github.com/txn2/mcp-trino/)

An MCP server for Trino, the distributed SQL query engine. Run read-only SQL queries across any data source Trino connects to, including data lakes, warehouses, and relational databases. List catalogs and schemas, describe tables, explain query plans, and execute analytical queries with configurable timeouts and row limits.

---

# The Data Stack: DataHub + Trino + S3

Modern data platforms need three things: meaning (what does the data represent?), access (how do I query it?), and storage (where does it live?). mcp-data-platform uses DataHub for meaning, Trino for access, and S3 for storage.

## DataHub: The Semantic Layer

DataHub (https://datahubproject.io/) is an open source data catalog from LinkedIn. It stores business context: descriptions, owners, tags, glossary terms, lineage, and quality scores.

**Problem it solves**: Data exists everywhere, but understanding what it means requires tribal knowledge. Column `cid` in one system is `customer_id` in another. That deprecated table still gets queried because nobody knows it's deprecated.

**Why DataHub**: Active community (10k+ GitHub stars), rich GraphQL/REST APIs, 50+ integrations (Trino, Snowflake, dbt, Airflow), real-time ingestion, full-text search, lineage tracking.

**For AI**: Without DataHub, an AI sees columns and types. With DataHub, it sees what the data means, who owns it, and whether it's reliable.

## Trino: Universal SQL Access

Trino (https://trino.io/, formerly PrestoSQL from Facebook) is a distributed SQL query engine. It runs SQL against data where it lives, without moving it first.

**Trino connects to everything**: PostgreSQL, MySQL, Oracle, Snowflake, BigQuery, Elasticsearch, MongoDB, S3, HDFS, Delta Lake, Iceberg, Kafka, and more.

**One SQL dialect**: Query across PostgreSQL, Elasticsearch, and S3 in one statement. The AI doesn't need to know where data lives. It's all SQL.

**Why Trino**: Battle-tested at Meta, Netflix, Uber, LinkedIn. Cost-based optimizer, standard ANSI SQL, federated queries without data movement.

## S3: The Universal Data Lake

S3 (Amazon S3, MinIO, or any S3-compatible service) is object storage for any data: files, Parquet, JSON, logs, ML models.

**Problem it solves**: Not all data is structured or in databases. Data platforms must handle structured (tables), semi-structured (JSON, Parquet), and unstructured (PDFs, images) data.

**Why S3**: Infinite scale at low cost, any file format, direct querying via Trino, object versioning, S3-compatible (AWS, MinIO, Ceph).

## Cross-enrichment fills the gaps

| Component | Answers | Limitation Alone |
|-----------|---------|------------------|
| DataHub | "What does this mean?" | Can't query the data |
| Trino | "What's in this table?" | Doesn't know business context |
| S3 | "What files exist?" | Just storage, no meaning |

Cross-enrichment wires them together:
- Trino + DataHub: Query a table → Get schema + owners + tags + deprecation + quality
- DataHub + Trino: Search DataHub → See which datasets are queryable with sample SQL
- S3 + DataHub: List objects → Get matching metadata and ownership

This stack is built for OLAP: S3 stores petabytes at low cost, Trino runs analytical queries across that data, DataHub adds business context.

---

## What's Included

| Toolkit | Tools | Required |
|---------|-------|----------|
| DataHub | 11 tools | Yes |
| Trino | 7 tools | No |
| S3 | 6-9 tools | No |
| Knowledge | 1-2 tools | No |
| Memory | 2 tools | No |

---

# Installation

## Prerequisites

- Go 1.24+ (for building from source)
- An MCP-compatible client (Claude Desktop, Claude Code, or custom)
- Access to Trino, DataHub, and/or S3 services you want to connect

## Installation Methods

### Go Install

```bash
go install github.com/txn2/mcp-data-platform/cmd/mcp-data-platform@latest
```

### Homebrew (macOS)

```bash
brew install txn2/tap/mcp-data-platform
```

### Docker

```bash
docker pull ghcr.io/txn2/mcp-data-platform:latest
docker run -v /path/to/platform.yaml:/etc/mcp/platform.yaml ghcr.io/txn2/mcp-data-platform:latest --config /etc/mcp/platform.yaml
```

## Client Setup

### Claude Code

```bash
claude mcp add mcp-data-platform -- mcp-data-platform --config /path/to/platform.yaml
```

### Claude Desktop

Add to `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS):

```json
{
  "mcpServers": {
    "mcp-data-platform": {
      "command": "mcp-data-platform",
      "args": ["--config", "/path/to/platform.yaml"],
      "env": {
        "DATAHUB_TOKEN": "your-token"
      }
    }
  }
}
```

## Command Line Options

| Option | Description | Default |
|--------|-------------|---------|
| `--config` | Path to YAML configuration file | None |
| `--transport` | Transport protocol: `stdio` or `http` | `stdio` |
| `--address` | Listen address for HTTP transports | `:8080` |

---

# Configuration

Configuration uses YAML with environment variable expansion (`${VAR_NAME}`).

## Config Versioning

Every configuration file should include an `apiVersion` field as the first key. This enables safe schema evolution with deprecation warnings and migration tooling.

```yaml
apiVersion: v1

server:
  name: mcp-data-platform
```

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `apiVersion` | string | `v1` | Config schema version. Omitting defaults to `v1` for backward compatibility. |

**Supported versions**: `v1` (current)

### Migration Tool

Migrate config files to the latest version:

```bash
# From file to stdout
mcp-data-platform migrate-config --config platform.yaml

# From stdin to file
cat platform.yaml | mcp-data-platform migrate-config --output migrated.yaml

# Specify target version
mcp-data-platform migrate-config --config platform.yaml --target-version v1
```

The migration tool preserves comments and `${VAR}` environment variable references.

### Version Lifecycle

- **current**: Actively supported, no warnings
- **deprecated**: Still works, emits a warning at startup with migration guidance
- **removed**: Rejected at startup with an error pointing to the migration tool

## Minimal Configuration

```yaml
server:
  name: mcp-data-platform
  transport: stdio

toolkits:
  datahub:
    primary:
      url: https://datahub.example.com
      token: ${DATAHUB_TOKEN}

  trino:
    primary:
      host: trino.example.com
      port: 443
      user: ${TRINO_USER}
      password: ${TRINO_PASSWORD}
      ssl: true
      catalog: hive

injection:
  trino_semantic_enrichment: true
  datahub_query_enrichment: true
  unwrap_json: true                # Auto-unwrap single-row VARCHAR-of-JSON (default: true)
  column_context_filtering: true   # Only enrich columns referenced in SQL (default: true)
```

## Server Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `server.name` | string | `mcp-data-platform` | Platform identity (e.g., "ACME Corp Data Platform") - helps agents identify which business this MCP serves |
| `server.description` | string | - | Explains when to use this MCP - what business, products, or domains it covers. Agents use this to route questions to the right MCP server. |
| `server.tags` | array | `[]` | Keywords for discovery: company names, product names, business domains. Agents match these against user questions. |
| `server.agent_instructions` | string | - | Operational guidance: data conventions, required filters, unit conversions. Returned in `platform_info` response. |
| `server.prompts` | array | `[]` | Platform-level MCP prompts registered via `prompts/list`. Support `{arg_name}` placeholder substitution. Operator-defined prompts override auto-registered workflow prompts with the same name. |
| `server.prompts[].arguments` | array | `[]` | Typed arguments: name, description, required. Substituted into content as `{name}` placeholders. |
| `server.transport` | string | `stdio` | Transport: `stdio` or `http` (`sse` accepted for backward compatibility) |
| `server.address` | string | `:8080` | Listen address for HTTP transports |
| `server.streamable.session_timeout` | duration | `30m` | How long an idle Streamable HTTP session persists before cleanup |
| `server.streamable.stateless` | bool | `false` | Disable session tracking (no `Mcp-Session-Id` validation) |

## Database Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `database.dsn` | string | - | PostgreSQL connection string |
| `database.max_open_conns` | int | 25 | Maximum open database connections |

Setting `dsn` enables audit logging, knowledge capture, session externalization, and OAuth persistence. Without it, these features degrade to in-memory or noop implementations.

## Config Store

When a database is available (`database.dsn` is set), individual config entries in the `config_entries` table override file defaults for whitelisted keys. Changes via the admin API take effect immediately (hot-reload) without restart. Deleting a database entry restores the file default.

**Whitelisted keys (phase 1):** `server.description`, `server.agent_instructions`

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `config_store.mode` | string | `file` | **Deprecated.** Ignored at runtime. Accepted for backward compatibility. |

## Tool Visibility Configuration

Reduce LLM token usage by hiding tools from `tools/list` responses. This is a visibility optimization, not a security boundary — persona-level tool filtering continues to gate `tools/call`.

```yaml
tools:
  allow:
    - "trino_*"
    - "datahub_*"
  deny:
    - "*_delete_*"
  description_overrides:
    trino_query: "Custom description for trino_query..."
```

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `tools.allow` | array | `[]` | Tool name patterns to include in `tools/list` |
| `tools.deny` | array | `[]` | Tool name patterns to exclude from `tools/list` |
| `tools.description_overrides` | map | `{}` | Override tool descriptions in `tools/list` responses. Config overrides take precedence over built-in defaults. |

No patterns configured means all tools are visible. When both are set, allow is evaluated first, then deny removes from the result. Patterns use `filepath.Match` syntax (`*` matches any sequence of characters).

### Built-in Description Overrides

The platform ships with built-in description overrides for `trino_query` and `trino_execute` that instruct agents to call `datahub_search` first for business context. These are always active and require no configuration. Use `tools.description_overrides` to customize or add more overrides; config entries take precedence over built-in defaults.

## Workflow Gating Configuration

Session-aware enforcement that agents call DataHub discovery tools before running Trino queries. When enabled, query results are annotated with warnings if no discovery has occurred in the current session. Warnings escalate after a configurable number of repeated violations.

```yaml
workflow:
  require_discovery_before_query: true
  # discovery_tools: []           # Defaults to all datahub_* tools
  # query_tools: []               # Defaults to trino_query, trino_execute
  # warning_message: ""           # Custom warning (default: built-in REQUIRED message)
  escalation:
    after_warnings: 3             # Switch to escalated message after N warnings
    # escalation_message: ""      # Custom escalation (use {count} for warning number)
```

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `workflow.require_discovery_before_query` | bool | `false` | Enable session-aware workflow gating |
| `workflow.discovery_tools` | array | all `datahub_*` tools | Tool names that satisfy the discovery requirement |
| `workflow.query_tools` | array | `trino_query`, `trino_execute` | Tool names that trigger the discovery check |
| `workflow.warning_message` | string | built-in message | Custom warning prepended to query results |
| `workflow.escalation.after_warnings` | int | `3` | Number of warnings before escalation |
| `workflow.escalation.escalation_message` | string | built-in message | Custom escalation message (`{count}` placeholder for warning number) |

When `workflow.require_discovery_before_query` is disabled (default), the platform falls back to the static `tuning.rules.require_datahub_check` hint which fires on every query. The session-aware path is strictly better: it only warns when discovery hasn't happened and resets after any discovery tool call.

## Portal Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `portal.enabled` | bool | `false` | Enable portal toolkit and save_artifact/manage_artifact tools |
| `portal.s3_connection` | string | - | Name of the S3 toolkit instance for artifact storage |
| `portal.s3_bucket` | string | `portal-assets` | S3 bucket for artifact content |
| `portal.s3_prefix` | string | `artifacts/` | Key prefix within the bucket |
| `portal.public_base_url` | string | `""` | Base URL for portal links in save_artifact responses |
| `portal.max_content_size` | int | `10485760` | Maximum artifact size in bytes (10 MB) |
| `portal.implementor.name` | string | `""` | Implementor display name shown in the left zone of the public viewer header |
| `portal.implementor.logo` | string | `""` | URL to implementor SVG logo (fetched once at startup, max 1 MB) |
| `portal.implementor.url` | string | `""` | Clickable link wrapping the implementor name and logo |
| `portal.export.enabled` | bool | auto | Enable trino_export tool (auto-enabled when portal + trino configured) |
| `portal.export.max_rows` | int | `100000` | Hard row cap for exports |
| `portal.export.max_bytes` | int | `104857600` | Hard byte cap for formatted output (100 MB) |
| `portal.export.default_timeout` | string | `"5m"` | Default query timeout for exports |
| `portal.export.max_timeout` | string | `"10m"` | Maximum allowed query timeout for exports |

Portal requires `database.dsn` for metadata storage and at least one S3 toolkit instance for artifact content.

## Admin API Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `admin.enabled` | bool | `false` | Enable admin REST API |
| `admin.persona` | string | `admin` | Persona required for admin access |
| `admin.path_prefix` | string | `/api/v1/admin` | URL prefix for admin endpoints |

Branding fields have moved to the `portal:` section:

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `portal.enabled` | bool | `false` | Enable portal SPA frontend and artifact API |
| `portal.title` | string | `MCP Data Platform` | Sidebar/branding title text |
| `portal.logo` | string | `""` | Logo URL (fallback for both themes) |
| `portal.logo_light` | string | `""` | Logo URL for light theme |
| `portal.logo_dark` | string | `""` | Logo URL for dark theme |
| `portal.implementor.name` | string | `""` | Implementor display name (left zone of public viewer header) |
| `portal.implementor.logo` | string | `""` | URL to implementor SVG logo (fetched at startup, max 1 MB) |
| `portal.implementor.url` | string | `""` | Clickable link wrapping implementor name and logo |

When `portal.enabled: true`, an interactive web dashboard is served at `/portal/`. It provides audit log exploration, tool execution testing, and system monitoring. The sidebar displays a configurable logo and title. Theme-specific logos are resolved as: light theme uses `logo_light` → `logo` → built-in default; dark theme uses `logo_dark` → `logo` → built-in default. The resolved logo is also used as the browser favicon. Shared artifact links (`/portal/view/{token}`) display a two-zone header: the right zone shows the platform brand, and the optional left zone shows the implementor brand configured via `portal.implementor`.

The public viewer supports light/dark mode (system default with toggle, persisted to localStorage), an expiration countdown notice showing relative time until share expiry, and a configurable per-share notice text. The `notice_text` field defaults to "Proprietary & Confidential. Only share with authorized viewers." — set it to a custom string for different text, or to `""` (empty string) to hide the notice entirely. When creating shares, set `hide_expiration: true` to hide the expiration countdown from the viewer.

## Resource Templates Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `resources.enabled` | bool | `false` | Enable MCP resource templates |

When enabled, three RFC 6570 URI templates are registered:
- `schema://{catalog}.{schema_name}/{table}` — Table schema with semantic context
- `glossary://{term}` — Glossary term definition and related assets
- `availability://{catalog}.{schema_name}/{table}` — Data availability status and row count

## Custom Resources Configuration

Custom resources expose arbitrary static content as named MCP resources. Registered whenever `resources.custom` is non-empty, independent of `resources.enabled`. Content can be inline (`content`) or read from a file on every request (`content_file`, supports hot-reload).

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `resources.custom[].uri` | string | — | Unique resource URI, e.g. `brand://theme` (required) |
| `resources.custom[].name` | string | — | Display name shown in `resources/list` (required) |
| `resources.custom[].description` | string | `""` | Optional description for MCP clients |
| `resources.custom[].mime_type` | string | — | MIME type, e.g. `application/json`, `image/svg+xml` (required) |
| `resources.custom[].content` | string | — | Inline text/JSON/SVG content; mutually exclusive with `content_file` |
| `resources.custom[].content_file` | string | — | Absolute file path; read on every `resources/read` request |

Example — inline JSON brand theme and file-backed SVG logo:

```yaml
resources:
  custom:
    - uri: "brand://theme"
      name: "Brand Theme"
      mime_type: "application/json"
      content: '{"colors":{"primary":"#FF6B35"},"url":"https://example.com"}'
    - uri: "brand://logo"
      name: "Brand Logo"
      mime_type: "image/svg+xml"
      content_file: "/etc/platform/logo.svg"
```

Invalid entries (missing URI, name, or mime_type; both or neither content fields) are skipped with a warning; valid entries in the same list are still registered.

## Managed Resources Configuration

Managed resources are human-uploaded reference files (samples, playbooks, templates, references) stored in S3 with metadata in PostgreSQL. They are surfaced to AI assistants via the standard MCP `resources/list` and `resources/read` protocol, and managed through a REST API and the Admin Portal.

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `resources.managed.enabled` | bool* | auto | Enable managed resources. nil = auto (enabled when database is available) |
| `resources.managed.uri_scheme` | string | `"mcp"` | URI scheme prefix for resource URIs |
| `resources.managed.s3_connection` | string | `""` | Name of the S3 toolkit instance used for blob storage |
| `resources.managed.s3_bucket` | string | `managed-resources` | S3 bucket for resource file blobs |

Example:

```yaml
resources:
  managed:
    enabled: true
    uri_scheme: "mcp"
    s3_connection: "primary"
    s3_bucket: "platform-resources"
```

### Scope Model

Resources are assigned one of three visibility scopes:

| Scope | Visibility | scope_id | Example URI |
|-------|-----------|----------|-------------|
| `global` | All authenticated users | (empty) | `mcp://global/templates/query-patterns.sql` |
| `persona` | Users operating under the named persona | persona name | `mcp://persona/analyst/playbooks/data-quality-checklist.md` |
| `user` | Only the owning user | user subject ID | `mcp://user/abc123/references/my-notes.txt` |

### URI Scheme

Resource URIs follow the pattern: `{scheme}://{scope}/{scope_id?}/{category}/{filename}`

- Global: `mcp://global/{category}/{filename}`
- Persona: `mcp://persona/{persona_name}/{category}/{filename}`
- User: `mcp://user/{user_sub}/{category}/{filename}`

### Categories

Resources must be assigned a category (lowercase alphanumeric with hyphens, max 31 chars). The portal UI offers four built-in categories: `samples`, `playbooks`, `templates`, `references`.

### Permission Model

| Action | Global | Persona | User |
|--------|--------|---------|------|
| **Read** | All authenticated users | Users with matching persona | Owner only |
| **Write (create)** | Platform admins | Platform admins or persona admins (`persona-admin:{name}`) | Any authenticated user (own scope) |
| **Modify/Delete** | Original uploader or platform admin | Original uploader, platform admin, or persona admin | Original uploader or platform admin |

Platform admin roles: `admin`, `platform-admin`.

### REST API Endpoints

All endpoints are mounted at `/api/v1/resources` and require authentication (Bearer token or browser session).

| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/resources` | Upload a new resource (multipart form) |
| `GET` | `/api/v1/resources` | List resources visible to the caller |
| `GET` | `/api/v1/resources/{id}` | Get resource metadata by ID |
| `GET` | `/api/v1/resources/{id}/content` | Download resource file content |
| `PATCH` | `/api/v1/resources/{id}` | Update resource metadata (display_name, description, tags, category) |
| `DELETE` | `/api/v1/resources/{id}` | Delete resource (metadata and S3 blob) |

List supports query parameters: `scope`, `scope_id`, `category`, `tag`, `q` (text search), `offset`.

Upload (POST) accepts multipart form fields: `file` (required), `scope`, `scope_id`, `category`, `display_name`, `description`, `tags`. Maximum upload size: 100 MB. Executable file extensions and MIME types are blocked.

### MCP Protocol Integration

When managed resources are enabled, the `MCPManagedResourceMiddleware` intercepts MCP `resources/list` and `resources/read` requests:

- **resources/list**: Appends managed resources (filtered by caller's visible scopes) to the SDK's static resource list.
- **resources/read**: For URIs matching the configured scheme (default: `mcp://`), looks up the resource in the database, checks read permission, and returns the file content from S3. Text types under 1 MB are returned inline; larger or binary content is returned as a blob. Falls through to the SDK handler for non-matching URIs.

MCP resource capabilities are advertised whenever resource templates, custom resources, or managed resources are enabled.

### Portal UI

The Admin Portal includes a Resources page for uploading, browsing, editing, and deleting managed resources. Users can filter by scope, category, and tags, search by name or description, upload files via drag-and-drop, and edit metadata inline.

## Progress Notifications Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `progress.enabled` | bool | `false` | Enable progress notifications for Trino queries |

When enabled, Trino query tools send three progress notifications per query: before execution, after query returns, and after formatting. Clients must include `_meta.progressToken` in their tool call to receive notifications.

## Client Logging Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `client_logging.enabled` | bool | `false` | Enable server-to-client log messages |

When enabled, the platform sends log notifications to clients after enrichment is applied (tool name, duration). Uses MCP `logging/setLevel` protocol — zero overhead if the client hasn't called `setLevel`.

## Elicitation Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `elicitation.enabled` | bool | `false` | Enable user confirmation prompts |
| `elicitation.cost_estimation.enabled` | bool | `false` | Prompt when EXPLAIN IO estimates exceed the row threshold |
| `elicitation.cost_estimation.row_threshold` | int | `1000000` | Row estimate threshold for cost prompts |
| `elicitation.pii_consent.enabled` | bool | `false` | Prompt when query accesses PII-tagged columns |

Elicitation requires client-side support (MCP `elicitation/create` capability). When the client doesn't support it, elicitation gracefully degrades to a no-op. If a user declines the confirmation, the tool returns an informational message instead of executing the query.

## Icons Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `icons.enabled` | bool | `false` | Enable icon enrichment middleware |
| `icons.tools.<name>.src` | string | - | Icon URI for a tool |
| `icons.tools.<name>.mime_type` | string | - | Icon MIME type |
| `icons.resources.<uri>.src` | string | - | Icon URI for a resource template |
| `icons.prompts.<name>.src` | string | - | Icon URI for a prompt |

Upstream toolkits (Trino, DataHub, S3) provide default icons on all tools. This configuration overrides or adds icons for tools, resource templates, and prompts in list responses.

## Audit Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `audit.enabled` | bool | `false` | Enable audit logging |
| `audit.log_tool_calls` | bool | `false` | Log MCP tool call events |
| `audit.retention_days` | int | 90 | Days to retain audit events |

Requires `database.dsn`. Both `enabled` and `log_tool_calls` must be `true` for tool call events to be recorded.

## Session Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `sessions.store` | string | `memory` | Backend: `memory` or `database` |
| `sessions.ttl` | duration | streamable session_timeout | Session lifetime |
| `sessions.idle_timeout` | duration | streamable session_timeout | Idle eviction threshold |
| `sessions.cleanup_interval` | duration | `1m` | Cleanup routine interval |

The `database` store requires `database.dsn`. Database-backed sessions survive restarts and support multi-replica deployments.

## Toolkit Configuration

### Trino

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `host` | string | required | Trino coordinator hostname |
| `port` | int | 8080/443 | Trino coordinator port |
| `user` | string | required | Trino username |
| `password` | string | - | Trino password |
| `catalog` | string | - | Default catalog |
| `schema` | string | - | Default schema |
| `ssl` | bool | false | Enable SSL/TLS |
| `timeout` | duration | 120s | Query timeout |
| `default_limit` | int | 1000 | Default row limit |
| `max_limit` | int | 10000 | Maximum row limit |
| `read_only` | bool | false | Restrict to read-only queries |
| `descriptions` | map | {} | Override tool descriptions (key: tool name, value: description text) |

### DataHub

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `url` | string | required | DataHub GMS URL |
| `token` | string | - | DataHub access token |
| `timeout` | duration | 30s | API request timeout |
| `default_limit` | int | 10 | Default search limit |
| `max_limit` | int | 100 | Maximum search limit |
| `descriptions` | map | {} | Override tool descriptions (key: tool name, value: description text) |

### S3

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `region` | string | us-east-1 | AWS region |
| `endpoint` | string | - | Custom S3 endpoint (for MinIO) |
| `access_key_id` | string | - | AWS access key ID |
| `secret_access_key` | string | - | AWS secret access key |
| `read_only` | bool | false | Restrict to read operations |

## Cross-Enrichment Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `trino_semantic_enrichment` | bool | false | Add DataHub context to Trino results |
| `datahub_query_enrichment` | bool | false | Add Trino availability to DataHub results |
| `s3_semantic_enrichment` | bool | false | Add DataHub context to S3 results |
| `unwrap_json` | bool | true | Auto-unwrap single-row VARCHAR-of-JSON results (e.g. from raw_query) |
| `column_context_filtering` | bool | true | Limit column enrichment to SQL-referenced columns |
| `search_schema_preview` | bool | true | Add bounded column-name+type preview to datahub_search query_context |
| `schema_preview_max_columns` | int | 15 | Max columns per entity in schema preview |

## URN Mapping Configuration

When Trino catalog or platform names differ from DataHub metadata, configure bidirectional URN mapping:

```yaml
semantic:
  provider: datahub
  instance: primary
  urn_mapping:
    platform: postgres           # DataHub platform (e.g., postgres, mysql, trino)
    catalog_mapping:
      rdbms: warehouse           # Trino "rdbms" → DataHub "warehouse"
      iceberg: datalake          # Trino "iceberg" → DataHub "datalake"

query:
  provider: trino
  instance: primary
  urn_mapping:
    catalog_mapping:
      warehouse: rdbms           # DataHub "warehouse" → Trino "rdbms"
      datalake: iceberg          # DataHub "datalake" → Trino "iceberg"
```

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `semantic.urn_mapping.platform` | string | trino | Platform name in DataHub URNs |
| `semantic.urn_mapping.catalog_mapping` | map | {} | Map Trino catalogs to DataHub catalogs |
| `query.urn_mapping.catalog_mapping` | map | {} | Map DataHub catalogs to Trino catalogs (reverse) |

This translates URNs in both directions:
- **Trino → DataHub**: `rdbms.public.users` becomes `urn:li:dataset:(urn:li:dataPlatform:postgres,warehouse.public.users,PROD)`
- **DataHub → Trino**: URN with `warehouse.public.users` resolves to Trino table `rdbms.public.users`

---

# Authentication & Security

mcp-data-platform implements a **fail-closed** security model. Missing or invalid credentials deny access, never bypass. For the security architecture rationale, see: https://imti.co/mcp-defense/

## Transport Security

| Transport | Authentication | TLS | Why |
|-----------|---------------|-----|-----|
| stdio | Not needed | N/A | Local execution with your own credentials |
| HTTP | **Required** | Recommended | Shared server needs to identify users |

## Security Model

- **Required JWT Claims**: Tokens must include `sub` (subject) and `exp` (expiration)
- **Default-Deny Personas**: Users without explicit persona have no tool access
- **Prompt Injection Protection**: DataHub metadata is sanitized before exposure
- **Read-Only Enforcement**: Trino `read_only: true` blocks write queries at query level
- **Cryptographic Request IDs**: Secure random identifiers for audit trails

## stdio Transport (Local)

No MCP authentication required. The server runs on your machine using credentials you configured.

## HTTP Transport (Remote/Shared)

Authentication identifies who is making requests. Anonymous access is disabled by default.

- **OIDC**: Human users via Keycloak, Auth0, Okta
- **API Keys**: Service accounts, automation
- **OAuth 2.1 (inbound)**: Claude Desktop authentication via upstream IdP — see "OAuth 2.1 for Claude Desktop" below
- **OAuth 2.1 (outbound)**: Gateway connections to upstream MCPs — `client_credentials` (M2M) and `authorization_code` + PKCE (browser sign-in). Encrypted refresh tokens persist across restarts. See "OAuth to Upstream MCPs" later in this document.

### OAuth 2.1 for Claude Desktop

For Claude Desktop connecting to a remote MCP server, the built-in OAuth 2.1 server bridges authentication:

1. Claude Desktop calls `/oauth/authorize`
2. MCP server redirects to Keycloak
3. User logs in to Keycloak
4. Keycloak redirects back to MCP server's `/oauth/callback`
5. MCP server exchanges code with Keycloak, extracts user info
6. MCP server issues its own token to Claude Desktop

Configuration:
```yaml
oauth:
  enabled: true
  issuer: "https://mcp.example.com"
  signing_key: "${OAUTH_SIGNING_KEY}"  # Required: openssl rand -base64 32
  clients:
    - id: "claude-desktop"
      secret: "${CLAUDE_CLIENT_SECRET}"
      redirect_uris:
        - "http://localhost"
        - "http://127.0.0.1"
  upstream:
    issuer: "https://keycloak.example.com/realms/your-realm"
    client_id: "mcp-data-platform"
    client_secret: "${KEYCLOAK_CLIENT_SECRET}"
    redirect_uri: "https://mcp.example.com/oauth/callback"
```

## Browser Sessions (OIDC Login for Portal UI)

When `auth.oidc` and `auth.browser_session` are both enabled, the portal UI offers SSO login. The flow uses OIDC authorization code with PKCE. Sessions are stored as HMAC-SHA256 signed JWT cookies (stateless, no server-side store).

```yaml
auth:
  oidc:
    enabled: true
    issuer: "https://auth.example.com/realms/platform"
    client_id: "mcp-data-platform"
    client_secret: "${OIDC_CLIENT_SECRET}"
    scopes: [openid, profile, email]
  browser_session:
    enabled: true
    signing_key: "${SESSION_SIGNING_KEY}"  # openssl rand -base64 32
    ttl: 8h
    secure: true
```

Endpoints: `/portal/auth/login` (initiate; accepts optional `return_to` site-relative path), `/portal/auth/callback` (process; redirects to `return_to` when set, otherwise `PostLoginRedirect`), `/portal/auth/logout` (clear). API key fallback always available. MCP clients unaffected.

Limitations: no individual session revocation, no key rotation support (rotating invalidates all sessions), no session refresh (re-auth required after TTL).

---

# Audit Logging

Every tool call flows through the audit logging middleware, which records who called what, when, how long it took, and whether it succeeded. Audit logs are stored in PostgreSQL and automatically cleaned up based on a configurable retention period.

## Prerequisites

Audit logging requires:
1. A PostgreSQL database (version 13+)
2. The `database.dsn` configuration set
3. Both `audit.enabled` and `audit.log_tool_calls` set to `true`

## Configuration

```yaml
database:
  dsn: "${DATABASE_DSN}"

audit:
  enabled: true
  log_tool_calls: true
  retention_days: 90
```

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `database.dsn` | string | - | PostgreSQL connection string. Required for audit logging. |
| `audit.enabled` | bool | `false` | Master switch for audit logging. |
| `audit.log_tool_calls` | bool | `false` | Log every `tools/call` request. Both this and `enabled` must be `true`. |
| `audit.retention_days` | int | `90` | Days to keep audit logs before automatic cleanup. |

## What Gets Logged

Every tool call produces one row in `audit_logs` with these fields:

| Field | Type | Description |
|-------|------|-------------|
| `id` | VARCHAR(32) | Cryptographically random event ID (base64url, 16 bytes). |
| `timestamp` | TIMESTAMPTZ | When the tool call started. |
| `duration_ms` | INTEGER | Wall-clock time in milliseconds. |
| `request_id` | VARCHAR(255) | Request ID from auth middleware. |
| `user_id` | VARCHAR(255) | Authenticated user identity. |
| `user_email` | VARCHAR(255) | User email from OIDC claims. |
| `persona` | VARCHAR(100) | Resolved persona name. |
| `tool_name` | VARCHAR(255) | MCP tool name. |
| `toolkit_kind` | VARCHAR(100) | Toolkit type: `trino`, `datahub`, or `s3`. |
| `toolkit_name` | VARCHAR(100) | Toolkit instance name from config. |
| `connection` | VARCHAR(100) | Connection name from toolkit config. |
| `parameters` | JSONB | Tool call arguments with sensitive values redacted. |
| `success` | BOOLEAN | `true` if tool handler succeeded. |
| `error_message` | TEXT | Error description if `success` is `false`. |
| `created_date` | DATE | Partition key derived from `timestamp`. |

## Parameter Sanitization

Sensitive parameter keys are replaced with `[REDACTED]`: `password`, `secret`, `token`, `api_key`, `authorization`, `credentials`. Matching is case-sensitive and exact.

## Middleware Chain

The middleware sits in a layered execution chain:

```
Icons → DescOverrides → Visibility → Apps Metadata → Auth/Authz → Audit → Rules → Client Logging → Enrichment → Tool Handler
```

For `tools/call` requests: Auth creates the `PlatformContext` (user identity, persona, toolkit metadata) and records tool calls for workflow tracking. Audit reads it to build the audit event. Rules checks session workflow state and prepends warnings when needed. Enrichment adds cross-service context and optional discovery notes. Unauthorized requests are rejected before reaching audit, so only authenticated tool calls appear in audit logs.

For `tools/list` responses: Icons, DescOverrides, and Visibility modify tool metadata and filter the tool list.

Audit events are written **asynchronously** in a goroutine to avoid blocking tool responses. If the database write fails, the error is logged via `slog.Error` but the tool call still succeeds.

## Caller class via `source`

Tools on this platform are reachable through three entry points; the audit row's `source` field records which one was used:

| `source` | Caller |
|----------|--------|
| `mcp` | Real MCP transport (stdio or HTTP/SSE). Agents. |
| `rest` | Gateway REST shim at `POST /api/v1/gateway/{connection}/invoke`. Apache NiFi, cronjobs, integrations, anything HTTP that wraps the platform. |
| `admin` | Admin REST API at `POST /api/v1/admin/tools/call`. Portal-driven tool runs. |

Both REST entry points open an in-memory MCP session against the assembled server and tag the context (`middleware.WithSource`) before the call so the existing audit middleware records the originating class. Filter via `GET /api/v1/admin/audit/events?source=...` or the **All Sources** dropdown in the portal.

## Retention and partition rotation

A maintenance routine runs every 24 hours and performs three ordered steps:

1. **Ensure upcoming partitions**: `CREATE TABLE IF NOT EXISTS audit_logs_YYYY_MM PARTITION OF audit_logs FOR VALUES FROM (...) TO (...)` for the next two months. The current month is intentionally skipped on brownfield deployments so existing rows in `audit_logs_default` do not block the new named partition. Idempotent.
2. **Delete expired rows**: `DELETE FROM audit_logs WHERE timestamp < NOW() - INTERVAL '<retention_days> days'`. PostgreSQL prunes to overlapping partitions.
3. **Drop fully expired partitions**: `DROP TABLE IF EXISTS audit_logs_YYYY_MM` for any monthly partition whose entire date range ends at or before the retention cutoff. Constant-time reclamation versus row-level DELETE.

Step failures are logged and isolated; a partition operation that errors does not skip the retention DELETE on the same tick. The eager pre-tick `EnsureMonthlyPartitions` also runs once at startup. This pattern keeps audit_logs bounded even at high write volume (e.g. Apache NiFi calling the gateway REST shim at order-of-magnitude-per-second).

In multi-replica deployments, the entire maintenance tick (including the eager startup ensure) is guarded by `pg_try_advisory_lock` on a stable key. Every replica still ticks every 24h, but only the pod that wins the lock runs the work; the rest exit silently and retry next tick. This makes the DELETE scan, partition CREATE, and partition DROP run exactly once per tick across the cluster, regardless of replica count.

---

# Observability (Prometheus Metrics)

Audit answers "who did what." Metrics answer "how is the system performing." Phase 1 of the OpenTelemetry instrumentation in mcp-data-platform exposes Prometheus-format metrics on a dedicated `/metrics` HTTP listener, separate from the main MCP/HTTP listener so scrape traffic and tool traffic do not share an auth path.

## Configuration

Metrics are enabled by default. Configuration is environment-only for Phase 1 to keep the surface small. The full surface lands in subsequent phases:

| Variable | Default | Purpose |
|---|---|---|
| `OTEL_METRICS_ENABLED` | `true` | Master switch. Set to `false` to skip MeterProvider construction and not start the listener. |
| `OTEL_METRICS_ADDR` | `:9090` | Bind address for the `/metrics` listener. |

## Exposed metrics

| Name | Type | Labels |
|---|---|---|
| `mcp_tool_calls_total` | counter | `tool`, `toolkit_kind`, `persona`, `status_category` |
| `mcp_tool_call_duration_seconds` | histogram | `tool`, `toolkit_kind`, `persona`, `status_category` |
| `mcp_inflight_tool_calls` | gauge | (none) |
| `apigateway_outbound_total` | counter | `connection`, `http_status_class`, `status_category` |
| `apigateway_outbound_duration_seconds` | histogram | `connection`, `http_status_class`, `status_category` |

Plus the free Go runtime + process metrics (`go_*`, `process_*`).

## Chokepoints

Phase 1 instruments two points that cover every tool call through the platform:

1. **`MCPToolCallMiddleware`** — every tool call (MCP transport, SSE, Streamable HTTP, REST gateway shim, admin tools/call) passes through here. One histogram + one counter + one gauge here covers every toolkit at once.
2. **apigateway transport** — every outbound HTTP call from `api_invoke_endpoint`, `api_export`, and the REST gateway shim is recorded at the `http.RoundTripper` level, so adding a new apigateway code path picks up metrics automatically.

Phase 2 adds OTLP tracing on the same chokepoints. Phase 3 extends to toolkit adapters (Trino, DataHub, S3, OAuth refresh, audit, enrichment).

## Cardinality budget

`status_category` is a closed set: `ok`, `auth_err`, `authz_err`, `validation_err`, `upstream_err`, `internal_err`. `http_status_class` is `2xx` / `3xx` / `4xx` / `5xx` / `other`. High-cardinality fields (user id, email, request id, session id, raw upstream URLs, raw error messages, free-text tool arguments) are NOT recorded as Prometheus labels — they belong on trace spans (Phase 2) and on audit log rows.

Worst-case counter cardinality for `mcp_tool_calls_total` with ~40 tools × 8 toolkit_kinds × 5 personas × 6 status_categories = 9,600 series; for `apigateway_outbound_total` with ~10 connections × 5 status_classes × 6 status_categories = 300 series. Both well within typical Prometheus and managed-observability backend per-metric series budgets.

## What metrics do NOT replace

- **Audit logs** remain the source of truth for "who called what entity, with what result" — compliance and user-level analytics.
- **Application logs** (stderr / structured slog) remain the source of truth for free-text diagnostic detail and stack traces.

Metrics answer "how is the system performing" — they complement, they do not replace.

---

# Cross-Enrichment

When you query a Trino table, you get DataHub context in the response. When you search DataHub, you see which datasets are queryable. No extra calls.

## The Problem It Solves

Without cross-enrichment:
1. Query a table
2. Search DataHub for that table
3. Get entity details for owners and tags
4. Check deprecation status
5. Look up quality score

Five calls to understand one table. With cross-enrichment, step 1 gives you everything.

## What Gets Injected

| When you use | You also get |
|--------------|--------------|
| Trino | DataHub metadata (owners, tags, quality, deprecation) |
| DataHub search | Which datasets are queryable in Trino |
| S3 | DataHub metadata for matching datasets |

## Semantic Context (added to Trino/S3 results)

```json
{
  "semantic_context": {
    "description": "Customer orders with line items and payment info",
    "owners": [{"name": "Data Team", "type": "group"}],
    "tags": ["pii", "financial"],
    "domain": {"name": "Sales"},
    "quality_score": 0.92,
    "deprecation": {
      "deprecated": true,
      "note": "Use orders_v2 instead"
    }
  }
}
```

## Query Context (added to DataHub results)

When `search_schema_preview` is enabled (default), available tables also include a bounded column preview so agents can write SQL without an intermediate `datahub_get_schema` call:

```json
{
  "query_context": {
    "urn:li:dataset:orders": {
      "available": true,
      "query_table": "hive.sales.orders",
      "connection": "production",
      "estimated_rows": 1500000,
      "schema_preview": [
        {"name": "order_id", "type": "integer"},
        {"name": "customer_id", "type": "integer"},
        {"name": "order_date", "type": "date"},
        {"name": "total_amount", "type": "decimal(10,2)"},
        {"name": "status", "type": "varchar"}
      ],
      "total_columns": 42
    }
  }
}
```

Primary key columns are listed first. `total_columns` indicates when the preview is truncated. The preview is omitted (not empty) when schema lookup fails or the table is unavailable.

## Lineage-Aware Column Inheritance

Downstream datasets (Elasticsearch indexes, Kafka topics) often lack documentation even when their upstream sources (Cassandra, PostgreSQL) are well-documented. The platform automatically inherits column metadata from upstream tables via DataHub lineage.

### How It Works

1. Query a table with undocumented columns
2. Platform checks DataHub lineage for upstream sources
3. Matches columns using column-level lineage, name matching, or configured aliases
4. Inherits descriptions, glossary terms, and tags from upstream
5. Returns enriched response with provenance tracking

### What Gets Inherited

| Metadata | Inherited When |
|----------|----------------|
| Descriptions | Target column has no description |
| Glossary Terms | Target column has no glossary terms |
| Tags | Target column has no tags |

### Match Methods

| Method | Description |
|--------|-------------|
| `column_lineage` | DataHub has explicit column-level lineage edges |
| `name_exact` | Column names match exactly |
| `name_transformed` | Names match after applying transforms (strip prefix/suffix) |
| `alias` | Explicit alias configuration bypasses lineage lookup |

### Configuration

```yaml
semantic:
  provider: datahub
  instance: primary

  lineage:
    enabled: true              # Enable lineage inheritance
    max_hops: 2                # Maximum upstream traversal depth (1-5)
    inherit:                   # Metadata types to inherit
      - glossary_terms
      - descriptions
      - tags
    conflict_resolution: nearest   # "nearest" (closest wins), "all" (merge), "skip"
    prefer_column_lineage: true    # Use column-level lineage when available

    # Column transforms for nested JSON paths
    column_transforms:
      - strip_prefix: "rxtxmsg.payload."
      - strip_prefix: "rxtxmsg.header."
      - strip_suffix: "_v2"

    # Explicit aliases when lineage isn't in DataHub
    aliases:
      - source: "cassandra.prod_fuse.system_sale"
        targets:
          - "elasticsearch.default.jakes-sale-*"
          - "elasticsearch.default.pos-sale-*"
        column_mapping:
          "rxtxmsg.payload.initial_net": "initial_net"

    cache_ttl: 10m             # Cache lineage graphs
    timeout: 5s                # Timeout for inheritance operation
```

### Response Format

When lineage inheritance is active, responses include column context with provenance:

```json
{
  "columns": [
    {"name": "rxtxmsg.payload.amount", "type": "DOUBLE"}
  ],
  "semantic_context": {
    "description": "Elasticsearch index for sales data",
    "urn": "urn:li:dataset:elasticsearch.default.jakes-sale-2025"
  },
  "column_context": {
    "rxtxmsg.payload.amount": {
      "description": "Net sale amount before adjustments",
      "glossary_terms": [
        {"urn": "urn:li:glossaryTerm:NetSaleAmount", "name": "Net Sale Amount"}
      ],
      "tags": ["financial"],
      "is_pii": false,
      "inherited_from": {
        "source_dataset": "urn:li:dataset:cassandra.prod_fuse.system_sale",
        "source_column": "initial_net",
        "hops": 1,
        "match_method": "name_transformed"
      }
    }
  },
  "inheritance_sources": [
    "urn:li:dataset:cassandra.prod_fuse.system_sale"
  ]
}
```

### Use Cases

**Elasticsearch indexes**: JSON documents from Cassandra have nested paths like `rxtxmsg.payload.amount`. Configure `strip_prefix: "rxtxmsg.payload."` to match upstream column `amount`.

**Kafka topics**: Event streams derived from source tables. Column-level lineage (if available) provides precise mapping.

**Data lakes**: Parquet files derived from operational databases. Aliases provide explicit mapping when lineage isn't tracked.

## Session Metadata Deduplication

When Trino enrichment is enabled, every tool call targeting a table receives ~2KB of semantic metadata. In a typical session querying the same table multiple times, this wastes LLM context tokens with repeated information.

Session dedup tracks which tables have been enriched per client session. First call gets full metadata; repeat calls within the TTL get reduced content based on the configured mode.

### Configuration

```yaml
injection:
  trino_semantic_enrichment: true
  column_context_filtering: true    # Only include SQL-referenced columns (default)
  session_dedup:
    enabled: true          # Default: true
    mode: reference        # reference (default), summary, none
    entry_ttl: 5m          # Defaults to semantic.cache.ttl
    session_timeout: 30m   # Defaults to server.streamable.session_timeout
```

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `enabled` | bool | `true` | Whether session dedup is active |
| `mode` | string | `reference` | What to send for repeat queries |
| `entry_ttl` | duration | semantic cache TTL | How long a table stays "already sent" |
| `session_timeout` | duration | streamable session timeout | Idle session cleanup |

### Modes

- **`reference`** (default): Repeat calls return `{"metadata_reference": {"tables": [...], "note": "Full semantic metadata was provided earlier..."}}`. Minimal tokens, LLM can refer back.
- **`summary`**: Repeat calls return table-level `semantic_context` without column details. LLM gets a reminder of ownership and tags.
- **`none`**: Repeat calls return raw tool results with no enrichment. Maximum token savings.

### Behavior

- Enabled by default when `trino_semantic_enrichment: true`
- Trino-only (DataHub/S3 enrichment is not deduplicated)
- In-memory state by default; persisted to database session store if `sessions.store: database`
- Each client session has independent dedup state
- SQL parsing tracks multiple tables independently (`JOIN` queries)

## Session Externalization

Externalize MCP session state to PostgreSQL for zero-downtime restarts and horizontal scaling.

### Configuration

```yaml
database:
  dsn: "${DATABASE_URL}"

sessions:
  store: database              # "memory" (default) or "database"
  ttl: 30m                     # session lifetime
  cleanup_interval: 1m         # cleanup routine frequency
```

When `store: database`:
- Platform forces `server.streamable.stateless: true` on the SDK
- Sessions are managed in PostgreSQL via `SessionAwareHandler`
- On shutdown, enrichment dedup state is flushed to the session store
- On startup, dedup state is restored from persisted sessions
- Multiple replicas share the same session store

Session hijack prevention: each session stores a SHA-256 hash of the creation token. Requests with a different token get HTTP 403.

### Server-pushed notifications (`tools/list_changed`)

In stateless streamable HTTP mode (the production shape when `sessions.store: database`), the SDK refuses GET requests for SSE streaming and closes each session at end-of-request. Without intervention this means downstream agents (Claude.ai, Claude Desktop) never receive `notifications/tools/list_changed` — when a gateway upstream re-authenticates, agents still show the old tool list until they disconnect and reconnect.

The platform handles this with a session broadcaster:

- `pkg/session.Broadcaster` is the abstract fan-out interface; the AwareHandler subscribes per-session.
- In-memory implementation for single-replica or no-DB deployments.
- Postgres `LISTEN/NOTIFY` implementation (channel `mcp_notifications`) for multi-replica. Each replica `LISTEN`s once at startup; every replica re-publishes received events to its local SSE subscribers.
- The `AwareHandler.ServeHTTP` GET branch opens a long-lived `text/event-stream` response, subscribes to the broadcaster bound to the request's session ID, and streams every event as a JSON-RPC 2.0 notification (`data: {"jsonrpc":"2.0","method":"notifications/tools/list_changed","params":{}}\n\n`). A 25-second comment-frame heartbeat keeps the stream alive through proxy idle timeouts.
- The gateway toolkit publishes `tools/list_changed` (debounced 50 ms — a longer window than the MCP SDK's own 10 ms internal debounce, chosen to absorb the postgres LISTEN/NOTIFY round-trip cost across replicas) after every aggregate tool-inventory change that registers or removes at least one tool: a connection coming up after re-auth, a connection being removed, the SetTokenStore retry promoting a placeholder. Zero-tool changes (placeholder upstreams, removals on connections with no live tools) are silently skipped.

Broadcaster lifecycle is owned by the platform: `initSessions` builds it (postgres if a DSN is configured, in-memory otherwise) and `Close` shuts it down before the database connection closes. `WireGatewayBroadcaster` plugs it into every gateway toolkit, mirroring `WireGatewayTokenStore`.

If the postgres broadcaster fails at startup (e.g. role missing `LISTEN` privilege), the platform falls back to in-memory and continues — `tools/list_changed` propagation degrades to single-replica scope rather than failing platform startup.

---

# Tools API Reference

## Trino Tools

### trino_query

Execute a read-only SQL query against Trino. Write operations are rejected. Annotated with `ReadOnlyHint: true`.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `query` | string | Yes | - | SQL query to execute (read-only) |
| `limit` | integer | No | 1000 | Maximum rows to return |
| `connection` | string | No | first configured | Trino connection name |

### trino_execute

Execute any SQL against Trino including write operations (INSERT, UPDATE, DELETE, CREATE, DROP). Annotated with `DestructiveHint: true`.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `query` | string | Yes | - | SQL query to execute |
| `limit` | integer | No | 1000 | Maximum rows to return |
| `connection` | string | No | first configured | Trino connection name |

### trino_explain

Get the execution plan for a SQL query.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `query` | string | Yes | - | SQL query to explain |
| `connection` | string | No | first configured | Trino connection name |

### trino_browse

Browse the Trino catalog hierarchy. Omit all parameters to list catalogs. Provide `catalog` to list schemas. Provide `catalog` and `schema` to list tables (with optional `pattern` filter). Parameters: `catalog`, `schema`, `pattern`, `connection` (all optional).

### trino_describe_table

Get table schema and metadata.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `table` | string | Yes | - | Table name (can be `catalog.schema.table`) |
| `connection` | string | No | first configured | Trino connection name |

### trino_export

Export query results directly to a portal asset (CSV, JSON, Markdown, text), bypassing the LLM token budget. Use after validating the query shape with `trino_query`. Only metadata is returned to the agent. Requires portal + trino configured. SQL runs through the read-only interceptor. CSV escaping prevents formula injection. Sensitivity tags inherited from source datasets.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `sql` | string | Yes | - | SQL query (read-only enforced) |
| `format` | string | Yes | - | csv, json, markdown, or text |
| `name` | string | Yes | - | Display name for the asset |
| `connection` | string | No | default | Trino connection name |
| `description` | string | No | - | Asset description |
| `tags` | array | No | [] | Lowercase kebab-case tags (max 50 chars, max 20; `_sys-` reserved) |
| `limit` | integer | No | deployment max | Row limit (subject to deployment cap) |
| `idempotency_key` | string | No | - | Dedup key to prevent duplicate assets on retry |
| `timeout_seconds` | integer | No | deployment default | Query execution timeout |
| `create_public_link` | boolean | No | false | Generate a public share link for automation |

Configuration: `portal.export.enabled` (auto), `portal.export.max_rows` (100000), `portal.export.max_bytes` (100MB), `portal.export.default_timeout` (5m), `portal.export.max_timeout` (10m).

### trino_list_connections

List configured Trino connections. No parameters.

## DataHub Tools

### datahub_search

Search for entities in the catalog.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `query` | string | Yes | - | Search query |
| `type` | string | No | - | Entity type: dataset, dashboard, chart, dataflow |
| `platform` | string | No | - | Platform filter: trino, snowflake, s3 |
| `limit` | integer | No | 10 | Maximum results |
| `connection` | string | No | first configured | DataHub connection name |

### datahub_get_entity

Get detailed entity information.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `urn` | string | Yes | - | Entity URN |
| `connection` | string | No | first configured | DataHub connection name |

### datahub_get_schema

Get dataset schema. Parameters: `urn` (required), `connection` (optional).

### datahub_get_lineage

Get dataset or column-level lineage. Set `level=column` for column-level lineage showing which upstream columns feed each downstream column. Parameters: `urn` (required), `level` (dataset/column), `direction` (UPSTREAM/DOWNSTREAM, dataset only), `depth` (max 5, dataset only), `connection`.

### datahub_get_queries

Get popular queries for a dataset. Parameters: `urn` (required), `limit`, `connection`.

### datahub_get_glossary_term

Get glossary term details. Parameters: `urn` (required), `connection`.

### datahub_browse

Browse the DataHub catalog by category. Set `what=tags` to list tags, `what=domains` to list data domains, or `what=data_products` to list data products. Parameters: `what` (required: tags/domains/data_products), `filter` (optional, tags only), `connection`.

### datahub_get_data_product

Get data product details. Parameters: `urn` (required), `connection`.

### datahub_create

Create a new entity in DataHub. Uses `what` discriminator to select entity type: tag, domain, glossary_term, data_product, document (1.4.x+), application, query, incident, structured_property, data_contract. Parameters: `what` (required), `name` (required for most types), additional fields vary by type, `connection`. Only available when `read_only: false`.

### datahub_update

Update metadata on a DataHub entity. Uses `what` discriminator: description, column_description, tag (add/remove), glossary_term (add/remove), link (add/remove), owner (add/remove), domain (set/remove), structured_properties (set/remove), structured_property, incident_status, incident, query, document_contents (1.4.x+), document_status (1.4.x+), document_related_entities (1.4.x+), document_sub_type (1.4.x+), data_contract. Parameters: `what` (required), `urn` (varies), `action` (required for tag/glossary_term/link/owner), `connection`. Only available when `read_only: false`.

### datahub_delete

Delete an entity from DataHub. Uses `what` discriminator: query, tag, domain, glossary_entity, data_product, application, document, structured_property. Parameters: `what` (required), `urn` (required), `connection`. Only available when `read_only: false`.

### datahub_list_connections

List configured DataHub connections. No parameters.

## S3 Tools

### s3_list_buckets

List available S3 buckets. Parameters: `connection` (optional).

### s3_list_objects

List objects in a bucket.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `bucket` | string | Yes | - | Bucket name |
| `prefix` | string | No | - | Key prefix filter |
| `delimiter` | string | No | - | Delimiter for hierarchy |
| `max_keys` | integer | No | 1000 | Maximum objects to return |
| `connection` | string | No | first configured | S3 connection name |

### s3_get_object

Get object contents. Parameters: `bucket`, `key` (required), `connection`.

### s3_get_object_metadata

Get object metadata without downloading content. Parameters: `bucket`, `key` (required), `connection`.

### s3_presign_url

Generate a pre-signed URL. Parameters: `bucket`, `key` (required), `expires`, `connection`.

### s3_list_connections

List configured S3 connections. No parameters.

### s3_put_object

Upload an object. Only available when `read_only: false`. Parameters: `bucket`, `key`, `content` (required), `content_type`, `connection`.

### s3_delete_object

Delete an object. Only available when `read_only: false`. Parameters: `bucket`, `key` (required), `connection`.

### s3_copy_object

Copy an object. Only available when `read_only: false`. Parameters: `source_bucket`, `source_key`, `dest_bucket`, `dest_key` (required), `connection`.

## Knowledge Tools

### capture_insight

Record domain knowledge shared during a session.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `category` | string | Yes | - | correction, business_context, data_quality, usage_guidance, relationship, enhancement |
| `insight_text` | string | Yes | - | Knowledge to record (10-4000 chars) |
| `confidence` | string | No | medium | high, medium, low |
| `source` | string | No | user | user, agent_discovery, enrichment_gap |
| `entity_urns` | array | No | [] | Related DataHub entity URNs (max 10) |
| `related_columns` | array | No | [] | Related columns (max 20) |
| `suggested_actions` | array | No | [] | Proposed catalog changes (max 5) |

### apply_knowledge

Review, synthesize, and apply captured insights to the data catalog. Admin-only. Requires `knowledge.apply.enabled: true`.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `action` | string | Yes | bulk_review, review, synthesize, apply, approve, reject |
| `entity_urn` | string | Conditional | Required for review, synthesize, apply |
| `insight_ids` | array | Conditional | Required for approve, reject; optional for synthesize, apply |
| `changes` | array | Conditional | Required for apply |
| `confirm` | bool | No | Required when `require_confirmation` is true |
| `review_notes` | string | No | Notes for approve/reject actions |

## Memory Tools

Persistent memory for agent and analyst sessions. Requires `memory.enabled: true`. Tools are opt-in per persona (`memory_*` in `tools.allow`).

### memory_manage

CRUD operations for memory records. Commands: `remember` (create with optional embedding), `update`, `forget` (soft-delete), `list` (persona-scoped), `review_stale` (admin review of stale memories).

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `command` | string | No | remember, update, forget, list, review_stale |
| `content` | string | For remember | Memory content (10-4000 chars) |
| `id` | string | For update/forget | Memory record ID |
| `dimension` | string | No | LOCOMO: knowledge, event, entity, relationship, preference |
| `category` | string | No | correction, business_context, data_quality, usage_guidance, relationship, enhancement, general |
| `confidence` | string | No | high, medium, low |
| `entity_urns` | array | No | DataHub URNs (max 10) |
| `metadata` | object | No | Arbitrary metadata |

### memory_recall

Multi-strategy retrieval. Strategies: `entity` (URN lookup), `semantic` (vector similarity), `graph` (lineage traversal), `auto` (combined).

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `query` | string | Yes | Natural language query |
| `strategy` | string | No | entity, semantic, graph, auto (default: auto) |
| `entity_urns` | array | No | DataHub URNs for entity/graph |
| `include_stale` | bool | No | Include stale memories |
| `limit` | int | No | Max results (default 10, max 50) |

## Portal Tools

The portal toolkit persists AI-generated artifacts (JSX dashboards, HTML reports, SVG charts) to S3 with PostgreSQL metadata. Requires `portal.enabled: true`.

### save_artifact

Save an AI-generated artifact to the asset portal. Automatically captures provenance tracking which tool calls produced this artifact.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `name` | string | Yes | - | Display name (max 255 chars) |
| `content` | string | Yes | - | Artifact content |
| `content_type` | string | Yes | - | MIME type: text/html, text/jsx, image/svg+xml, text/markdown, application/json, text/csv |
| `description` | string | No | "" | Description (max 2000 chars) |
| `tags` | array | No | [] | Tags for categorization (max 20 tags, each max 100 chars) |

Response includes asset_id, portal_url (if public_base_url configured), provenance_captured status, and tool_calls_recorded count. Content stored at `{s3_prefix}{user_id}/{asset_id}/content.{ext}`.

### manage_artifact

List, retrieve, update, or delete saved artifacts. Ownership enforced — users can only modify their own artifacts.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `action` | string | Yes | - | list, get, update, delete |
| `asset_id` | string | Conditional | - | Required for get, update, delete |
| `content` | string | No | - | New content (for update — replaces S3 object) |
| `name` | string | No | - | New name (for update) |
| `description` | string | No | - | New description (for update) |
| `tags` | array | No | - | New tags (for update) |
| `content_type` | string | No | - | New content type (for update, only when replacing content) |
| `limit` | integer | No | 50 | Max results for list (max 200) |

---

# Personas

Personas control which tools a user can access. Map OIDC roles to personas for role-based access control.

## Configuration

```yaml
personas:
  definitions:
    analyst:
      display_name: "Data Analyst"
      roles: ["analyst", "data_engineer"]
      tools:
        allow: ["trino_*", "datahub_*"]
        deny: ["*_delete_*"]
    admin:
      display_name: "Administrator"
      roles: ["admin"]
      tools:
        allow: ["*"]
  default_persona: analyst
```

## Tool Filtering

Patterns support wildcards:
- `*` matches any sequence of characters
- `trino_*` matches all Trino tools
- `*_delete_*` matches any delete tool

Evaluation order: deny patterns are checked first, then allow patterns.

## Connection Access Control

Personas can restrict which toolkit connections a user may access. A tool call must pass both the tool pattern check and the connection check.

```yaml
personas:
  definitions:
    analyst:
      connections:
        allow: ["prod-*"]
        deny: ["prod-admin-*"]
```

If the `connections` block is omitted or empty, all connections are permitted (backward-compatible default). Connection patterns use the same wildcard syntax as tool patterns.

---

# Go Library

Import the platform as a library for custom MCP servers.

## When to Use

Use the library when you need to:
- Add tools - Your domain-specific operations
- Swap providers - Different semantic layer, different query engine
- Write middleware - Custom auth, logging, rate limiting
- Embed it - MCP inside a larger application

## Packages

| Package | What it does |
|---------|--------------|
| `pkg/platform` | Main entry point, orchestration |
| `pkg/toolkits/*` | Trino, DataHub, S3, Knowledge adapters |
| `pkg/semantic` | Semantic provider interface |
| `pkg/query` | Query provider interface |
| `pkg/middleware` | Request/response processing |
| `pkg/mcpcontext` | MCP session/progress context helpers |
| `pkg/persona` | Role-based tool filtering |
| `pkg/auth` | OIDC and API key validation |
| `pkg/admin` | Admin REST API for knowledge management |

## Minimal Example

```go
package main

import (
    "log"
    "os"
    "github.com/txn2/mcp-data-platform/pkg/platform"
    "github.com/txn2/mcp-data-platform/pkg/toolkits/datahub"
)

func main() {
    p, err := platform.New(
        platform.WithServerName("my-data-platform"),
        platform.WithDataHubToolkit("primary", datahub.Config{
            URL:   os.Getenv("DATAHUB_URL"),
            Token: os.Getenv("DATAHUB_TOKEN"),
        }),
    )
    if err != nil {
        log.Fatal(err)
    }
    defer p.Close()
    p.Run()
}
```

---

# Knowledge Capture

Domain knowledge shared during AI sessions -- column meanings, data quality issues, business rules -- is captured, reviewed through a governance workflow, and written back to DataHub with changeset tracking and rollback.

## The Problem

Every organization has tribal knowledge about its data that lives in the heads of experienced team members. AI-assisted data exploration surfaces this knowledge in conversations, but without capture, it's lost when the session ends. Knowledge capture persists these insights for admin review and catalog write-back.

## How It Works

The system has two MCP tools and an Admin REST API:

- `capture_insight`: Records domain knowledge during sessions. Available to all personas when enabled. Creates insights with status `pending`.
- `apply_knowledge`: Admin-only tool for reviewing, approving, synthesizing, and applying insights to DataHub. Actions: `bulk_review`, `review`, `synthesize`, `apply`, `approve`, `reject`.
- Admin REST API: HTTP endpoints for managing insights and changesets outside the MCP protocol.

## Configuration

```yaml
knowledge:
  enabled: true
  apply:
    enabled: true
    datahub_connection: primary
    require_confirmation: true
```

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `knowledge.enabled` | bool | `false` | Enable the knowledge capture toolkit and `capture_insight` tool |
| `knowledge.apply.enabled` | bool | `false` | Enable the `apply_knowledge` tool for admin review and catalog write-back |
| `knowledge.apply.datahub_connection` | string | - | DataHub instance name for write-back operations |
| `knowledge.apply.require_confirmation` | bool | `false` | When true, the `apply` action requires `confirm: true` in the request |

Prerequisites: `database.dsn` must be configured. The `apply_knowledge` tool requires the admin persona.

## Insight Categories

| Category | Description | Example |
|----------|-------------|---------|
| `correction` | Fixes wrong metadata | "The amount column is gross margin, not revenue" |
| `business_context` | Business meaning not in metadata | "MRR counts active subscriptions only, not trials" |
| `data_quality` | Quality issues or limitations | "Timestamps before March 2024 are UTC; after that, America/Chicago" |
| `usage_guidance` | Tips for querying correctly | "Always filter status='active' to avoid soft-delete duplicates" |
| `relationship` | Dataset connections not in lineage | "customer_id in orders joins to the legacy CRM export" |
| `enhancement` | Suggested metadata improvements | "Tag sales_daily with its 6 AM CT refresh schedule" |

## capture_insight Tool

Records domain knowledge shared during a session.

**Parameters:**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `category` | string | Yes | One of: `correction`, `business_context`, `data_quality`, `usage_guidance`, `relationship`, `enhancement` |
| `insight_text` | string | Yes | The knowledge to record (10-4000 characters) |
| `confidence` | string | No | `high`, `medium` (default), or `low` |
| `source` | string | No | `user` (default), `agent_discovery`, or `enrichment_gap` |
| `entity_urns` | array | No | DataHub URNs this insight relates to (max 10) |
| `related_columns` | array | No | Columns related to this insight (max 20) |
| `suggested_actions` | array | No | Proposed catalog changes (max 5) |

**Suggested action types:** `update_description`, `add_tag`, `remove_tag`, `add_glossary_term`, `flag_quality_issue`, `add_documentation`, `add_curated_query`, `set_structured_property`, `remove_structured_property`, `raise_incident`, `resolve_incident`, `add_context_document`, `update_context_document`, `remove_context_document`

**Target field:** Use `column:<fieldPath>` (e.g., `column:location_type_id`) for column-level descriptions. For `add_documentation`, the target is the URL. For other types, leave empty or omit.

**Detail field:** Description text, tag name or URN (e.g., `pii` or `urn:li:tag:pii`), glossary term name or URN, quality issue description, link description, or query name (for `add_curated_query`). Short tag/term names are auto-normalized to full URNs.

`flag_quality_issue` adds a fixed `QualityIssue` tag; the detail text is stored as context in the knowledge store, not encoded in the tag name.

`add_curated_query` creates a reusable Query entity in DataHub linked to the target dataset. Requires `query_sql` (the SQL statement) and optionally `query_description`. The `detail` field serves as the query name.

**Response:**

```json
{
  "insight_id": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
  "status": "pending",
  "message": "Insight captured. It will be reviewed by a data catalog administrator."
}
```

The tool also registers an MCP prompt `knowledge_capture_guidance` that provides AI agents with guidance on when to capture insights.

## apply_knowledge Tool

Admin-only tool for reviewing and applying captured insights. Requires the `admin` persona.

### Actions

**bulk_review**: Summary of all pending insights grouped by entity.

```json
{"action": "bulk_review"}
```

**review**: Insights for a specific entity with current DataHub metadata.

```json
{"action": "review", "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)"}
```

**approve/reject**: Transition insight status with optional review notes.

```json
{
  "action": "approve",
  "insight_ids": ["a1b2c3d4e5f67890", "f6e5d4c3b2a10987"],
  "review_notes": "Verified with data engineering team"
}
```

**synthesize**: Structured change proposals from approved insights.

```json
{"action": "synthesize", "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)"}
```

**apply**: Write changes to DataHub with changeset tracking.

```json
{
  "action": "apply",
  "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)",
  "changes": [
    {"change_type": "update_description", "target": "", "detail": "Order records with gross margin amounts (before returns)"},
    {"change_type": "add_tag", "target": "", "detail": "gross-margin"},
    {"change_type": "remove_tag", "target": "", "detail": "urn:li:tag:outdated"}
  ],
  "insight_ids": ["a1b2c3d4e5f67890"],
  "confirm": true
}
```

## Insight Lifecycle

```
pending -> approved -> applied
pending -> rejected
pending -> superseded
applied -> rolled_back
```

| Status | Description |
|--------|-------------|
| `pending` | Newly captured, awaiting review |
| `approved` | Reviewed and approved for application |
| `rejected` | Reviewed and rejected |
| `applied` | Changes written to DataHub |
| `superseded` | Replaced by newer insights for the same entity |
| `rolled_back` | Applied changes were reverted |

## Governance Workflow

1. User shares domain knowledge during an AI session
2. AI calls `capture_insight` to record it (status: pending)
3. Admin uses `bulk_review` to see pending insights
4. Admin reviews and approves/rejects via `approve`/`reject`
5. Admin calls `synthesize` to generate change proposals from approved insights
6. Admin calls `apply` to write changes to DataHub with changeset tracking
7. Source insights are marked as applied with a changeset reference
8. Changeset stores previous values for rollback if needed

This is a human-in-the-loop metadata curation workflow. Every change to the catalog goes through admin review and is tracked for auditability.

## Changeset Tracking

Every `apply` action creates a changeset record:

| Field | Description |
|-------|-------------|
| `id` | Unique changeset identifier |
| `target_urn` | The DataHub entity that was modified |
| `change_type` | Summary of change types applied |
| `previous_value` | Entity metadata before changes (for rollback) |
| `new_value` | Changes that were applied |
| `source_insight_ids` | Insights that produced this changeset |
| `applied_by` | User who applied the changes |
| `rolled_back` | Whether this changeset has been reverted |

## Persona Integration

Control access through persona tool filtering:

```yaml
personas:
  definitions:
    analyst:
      tools:
        allow: ["trino_*", "datahub_*", "capture_insight"]
        deny: ["apply_knowledge"]
    admin:
      tools:
        allow: ["*"]
    etl_service:
      tools:
        allow: ["trino_*"]
        deny: ["capture_insight", "apply_knowledge"]
```

## Admin REST API

HTTP endpoints for managing insights and changesets. All endpoints require admin authentication via API key (X-API-Key or Authorization: Bearer header) with the admin role.

### Insight Endpoints

| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/admin/knowledge/insights` | List insights with filtering and pagination |
| GET | `/api/v1/admin/knowledge/insights/{id}` | Get a single insight by ID |
| PUT | `/api/v1/admin/knowledge/insights/{id}` | Update insight text, category, or confidence |
| PUT | `/api/v1/admin/knowledge/insights/{id}/status` | Approve or reject an insight |
| GET | `/api/v1/admin/knowledge/insights/stats` | Get insight statistics |

**Query parameters for listing:**

| Parameter | Type | Description |
|-----------|------|-------------|
| `status` | string | Filter by status (pending, approved, rejected, applied, superseded, rolled_back) |
| `category` | string | Filter by category |
| `entity_urn` | string | Filter by entity URN |
| `captured_by` | string | Filter by user who captured |
| `confidence` | string | Filter by confidence level |
| `source` | string | Filter by source (user, agent_discovery, enrichment_gap) |
| `since` | RFC3339 | Filter by creation time (after) |
| `until` | RFC3339 | Filter by creation time (before) |
| `page` | integer | Page number (1-based) |
| `per_page` | integer | Results per page (default 20, max 100) |

### Changeset Endpoints

| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/admin/knowledge/changesets` | List changesets with filtering and pagination |
| GET | `/api/v1/admin/knowledge/changesets/{id}` | Get a single changeset by ID |
| POST | `/api/v1/admin/knowledge/changesets/{id}/rollback` | Rollback a changeset (restores previous metadata) |

**Query parameters for listing changesets:**

| Parameter | Type | Description |
|-----------|------|-------------|
| `entity_urn` | string | Filter by target entity URN |
| `applied_by` | string | Filter by user who applied |
| `rolled_back` | boolean | Filter by rollback status |
| `since` | RFC3339 | Filter by creation time (after) |
| `until` | RFC3339 | Filter by creation time (before) |
| `page` | integer | Page number (1-based) |
| `per_page` | integer | Results per page (default 20, max 100) |

## Database Schema

Knowledge capture uses two PostgreSQL tables (migrations 000006, 000007, 000008).

**knowledge_insights:**

| Column | Type | Description |
|--------|------|-------------|
| `id` | TEXT | Primary key (cryptographic random hex) |
| `created_at` | TIMESTAMPTZ | When the insight was captured |
| `session_id` | TEXT | MCP session that produced it |
| `captured_by` | TEXT | User who shared the knowledge |
| `persona` | TEXT | Active persona at capture time |
| `source` | TEXT | Where the knowledge came from: `user`, `agent_discovery`, `enrichment_gap` |
| `category` | TEXT | Insight category |
| `insight_text` | TEXT | The domain knowledge |
| `confidence` | TEXT | Confidence level |
| `entity_urns` | JSONB | Related DataHub entities |
| `related_columns` | JSONB | Related columns |
| `suggested_actions` | JSONB | Proposed catalog changes |
| `status` | TEXT | Current lifecycle status |
| `reviewed_by` | TEXT | Who reviewed the insight |
| `reviewed_at` | TIMESTAMPTZ | When it was reviewed |
| `review_notes` | TEXT | Reviewer comments |
| `applied_by` | TEXT | Who applied the insight |
| `applied_at` | TIMESTAMPTZ | When it was applied |
| `changeset_ref` | TEXT | Link to the changeset |

**knowledge_changesets:**

| Column | Type | Description |
|--------|------|-------------|
| `id` | TEXT | Primary key (cryptographic random hex) |
| `created_at` | TIMESTAMPTZ | When changes were applied |
| `target_urn` | TEXT | DataHub entity that was modified |
| `change_type` | TEXT | Type of changes applied |
| `previous_value` | JSONB | Metadata before changes |
| `new_value` | JSONB | Changes applied |
| `source_insight_ids` | JSONB | Insights that produced this |
| `approved_by` | TEXT | Who approved the changes |
| `applied_by` | TEXT | Who applied the changes |
| `rolled_back` | BOOLEAN | Whether changes were reverted |
| `rolled_back_by` | TEXT | Who reverted the changes |
| `rolled_back_at` | TIMESTAMPTZ | When changes were reverted |

---

# User Portal

The user portal is the day-to-day interface for analysts, engineers, and data consumers.

## Activity
Personal tool usage analytics with configurable time ranges (1h, 6h, 24h, 7d). Summary cards (total calls, avg duration, tools used), activity timeseries chart, and top tools bar chart.

## Assets
AI-generated dashboards, reports, and visualizations saved via `save_artifact`. Grid/table view with search, content type filter (HTML, JSX, SVG, Markdown, CSV), and tag filter. Asset viewer renders each type natively: HTML/JSX as interactive components, SVG as vector graphics, Markdown with GFM+mermaid, CSV as sortable tables. Preview/Source toggle, Delete/Download/Share actions.

## Collections
Curated groups of assets organized into ordered sections with markdown descriptions. Grid/table list view with search. Collection viewer renders sections with markdown, asset cards with thumbnails. Configurable thumbnail sizes (Large/Medium/Small/None). Sharing via public links (token-based, time-limited) and user shares (email, Viewer/Editor permission).

Collection API endpoints:
- CRUD: POST/GET/PUT/DELETE `/api/v1/portal/collections[/{id}]`
- Sections: PUT `/api/v1/portal/collections/{id}/sections` (full replace)
- Config: PUT `/api/v1/portal/collections/{id}/config`
- Sharing: POST/GET `/api/v1/portal/collections/{id}/shares`
- Shared with me: GET `/api/v1/portal/shared-collections`
- Public view: GET `/portal/view/{token}` (branches on collection vs asset)

Database tables: portal_collections, portal_collection_sections, portal_collection_items. portal_shares extended with collection_id (CHECK constraint: exactly one of asset_id/collection_id set).

## Resources
Human-uploaded reference materials (SQL templates, runbooks, checklists) accessible to AI via MCP resources/read. Scope tabs (My Resources, admin, Global), search, category filter, upload button. Table shows name, category, MIME type, tags, size, uploader, and date.

## Shared With Me
Assets and collections shared by other users. Two tabs: Assets (grid/table with content type, tags, sharer, permission badge, size, date) and Collections (table with name, sharer, access level, date).

## Knowledge & Memory
Personal view of captured insights and memories. Knowledge tab: summary cards, status filters, insight cards with category/status badges, entity URNs. Memory tab: summary cards, status/dimension filters, memory cards with markdown content.

## Prompts
Personal and available prompt templates. Two tabs: Personal (create/edit/delete own prompts) and Available (global/persona prompts). Search, sortable columns, expandable rows with full content and copy button. Scope badges: Personal, Global, Persona, System.

---

# Admin Portal

The Admin Portal is a built-in web dashboard for monitoring and managing the platform. Enable with `portal.enabled: true` in configuration.

## Dashboard

Real-time platform health overview with configurable time ranges (1h, 6h, 24h, 7d):

- System info bar: platform name, version, transport, config mode, enabled features
- Summary cards: total calls, success rate, avg duration, unique users/tools, enrichment rate, errors
- Activity timeline: tool call volume over time with error overlay
- Top Tools / Top Users: horizontal bar charts of most active tools and users
- Performance: response time percentiles (P50, P95, P99) and response sizes
- Recent Errors: clickable error list with detail drawer
- Knowledge Insights: summary statistics and category breakdown
- Connections: all configured toolkit connections with tool counts

## Tools

Master-detail layout: a left rail listing every registered tool (search by name, group by connection or kind, eye/eye-off icon for global-deny visibility) and a right pane with five tabs for the selected tool.

### Overview Tab
Editable description with override badge and reset (`tool.<name>.description` config entry; persists in `config_entries`, takes effect on next platform restart). Routing (toolkit, kind, connection). Persona allow/deny matrix with matched pattern and source (allow/deny/default). Raw input schema.

### Try It Tab
Dynamic parameter form generated from JSON schema. Result display with rendered markdown tables, Raw JSON toggle, and cross-enrichment blocks shown below the primary result (semantic context, query context, column context, storage context, metadata reference). Per-session execution history with replay; consumes replay intents from the audit-log Events drawer.

### Activity Tab
24-hour audit aggregate for this tool: call count, success rate, average duration. Link out to `/admin/audit?tool=<name>` for the full filtered audit log.

### Enrichment Tab (gateway-proxied tools only)
Lists cross-enrichment rules attached to this tool on its connection, with merge strategy and enabled state. Link out to the connection's enrichment drawer for rule edits.

### Visibility Tab
Global kill-switch toggle (read-modify-write of `tools.deny` config entry; cosmetic filter, persona auth still gates `tools/call`). Persona allow/deny matrix repeated for context. Preview form: enter any persona name and see its decision for this tool without editing rules (`POST /api/v1/admin/personas/{name}/test-access`).

## Audit Log

### Overview Tab
Platform-wide analytics with summary cards, activity timeline, top tools/users charts, and performance percentiles. Same metrics as Dashboard focused on audit data.

### Events Tab
Searchable, filterable event log with export (CSV/JSON). Detail drawer showing:
- Identity: user email, persona, session ID
- Execution: tool name, toolkit, connection, duration
- Status: success/failure, enrichment status
- Transport: HTTP/stdio, request/response sizes
- Parameters: full request parameters as JSON

## Knowledge & Memory

### Overview Tab
Knowledge capture cards (total, pending, approved, applied, rejected, approval rate) with insight status donut chart. Memory cards (total, active, stale, dimensions) with status and dimension distribution charts.

### Knowledge Capture Tab
All captured insights with filters (status, category, confidence). Sortable table with detail drawer: full metadata, entity URNs, suggested actions, related columns, review notes, approve/reject actions.

### All Memory Tab
All memory records across users. Filters (dimension, category, status, source). Table with detail drawer: full markdown content, entity URNs, metadata, stale reason, archive action.

### Changesets Tab
Catalog changes from approved knowledge. Target URN, change type, applied by, status. Rollback support.

## Assets (Admin)
All platform assets across users. Searchable table with name, owner, type, size, sharing, date. Detail view with full asset viewer.

## Resources (Admin)
All managed resources across scopes. Scope tabs (All, Global, per-persona). Search, category filter, upload. Table with name, scope badge, category, type, tags, size, uploader, date.

## Prompts (Admin)
Global prompt management. Scope filter, search, new prompt creation with scope/persona/enabled state. Table with scope badges (Global, Persona, Personal, System).

## Connections
Split-pane layout. Left: connection list grouped by kind (DataHub, S3, Trino) with source badges (file/database). Right: detail with metadata, configuration (sensitive toggle), edit/delete. Add Connection button for database-only connections.

## Personas
Split-pane layout. Left: persona list with display name, slug, roles, tool count. Right: detail with priority, resolved tools, roles, tool access rules (allow/deny pattern badges), context overrides (description prefix, agent instructions suffix).

## API Keys
Key table with name, source badge, email, description, roles, expiration, actions. Add Key form (name, email, roles, expiration preset). Plaintext shown once at creation. Expired keys shown dimmed with badge.

## Change Log
Audit trail of configuration changes. Config key, action (set), timestamp.

## Local Development

Run with mock data (no backend): `cd ui && VITE_MSW=true npm run dev`
Full-stack: `make dev-up` + `go run ./cmd/mcp-data-platform --config dev/platform.yaml` + `psql -f dev/seed.sql` + `cd ui && npm run dev`
Screenshots: `cd ui && npm run screenshots && npm run screenshots:convert`

---

# MCP Apps

MCP Apps provide interactive UI panels that enhance tool results. Instead of raw JSON, users see rich interactive UI rendered inline in the MCP host. The platform ships the `platform-info` app embedded in the binary — it registers automatically with zero configuration.

## Overview

1. User calls a tool (e.g., `platform_info`, `trino_query`)
2. MCP server returns results with a UI resource reference
3. Host fetches the HTML app and renders it in an iframe
4. App receives tool results via `postMessage` and displays interactive UI

## Built-in App: platform-info

`platform-info` is embedded in the binary and registers at startup automatically. No `assets_path`, volume mount, or `enabled: true` is required.

Minimal branding override (all fields optional):

```yaml
mcpapps:
  apps:
    platform-info:
      config:
        brand_name: "ACME Data Platform"
        brand_url: "https://data.acme.com"
        logo_svg: "<svg ...>"
```

To replace the embedded HTML entirely: set `assets_path` pointing to your custom directory.

## Configuration

```yaml
mcpapps:
  # enabled defaults to true
  apps:
    # platform-info is built-in; only branding config is needed
    platform-info:
      config:
        brand_name: "ACME"

    # Custom app example
    query_results:
      enabled: true
      assets_path: "/etc/mcp-apps/query-results"
      tools:
        - trino_query
        - trino_execute
      csp:
        resource_domains:
          - "https://cdn.jsdelivr.net"
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `enabled` | bool | No | Enable MCP Apps infrastructure (default: `true`) |
| `apps.<name>.assets_path` | string | No* | Absolute path to app directory. *Required for custom apps; optional for `platform-info` |
| `apps.<name>.tools` | array | Yes | Tools this app enhances |
| `apps.<name>.csp.resource_domains` | array | No | Allowed CDN origins |
| `apps.<name>.config` | object | No | Config injected as `<script id="app-config">` JSON |

## Example App: query-results

The repository includes a community example app at `apps/query-results/` demonstrating sortable tables, charts, dark mode, and search/filter for `trino_query` output. It is not built into the binary — deploy it as a custom app by mounting the assets directory.

## Development

Prerequisites: Docker (only)

```bash
docker compose -f docker-compose.dev.yml up
```

This starts: Test Harness (localhost:8000), MCP Server (localhost:3001), MCP Inspector (localhost:6274), Trino (localhost:8090).

The dev config (`configs/mcpapps-dev.yaml`) uses `assets_path` to override the embedded `platform-info` HTML, enabling live editing without rebuilding the binary.

Workflow:
1. Open http://localhost:8000/test-harness.html
2. Edit `./apps/platform-info/index.html` (or `./apps/query-results/index.html`)
3. Click Reload App, then Send Test Data
4. Changes appear immediately

## MCP Apps Protocol

- App initiates with `ui/initialize` (not host)
- Communication via `window.parent.postMessage()`
- Tool results delivered via `ui/notifications/tool-result`
- Resources use MIME type `text/html;profile=mcp-app`

### Protocol Messages

1. **ui/initialize** (Request): App sends on load to establish connection
   ```javascript
   window.parent.postMessage({
       jsonrpc: '2.0',
       id: 1,
       method: 'ui/initialize',
       params: {
           protocolVersion: '2025-01-09',
           appInfo: { name: 'My App', version: '1.0.0' },
           appCapabilities: {
               availableDisplayModes: ['inline', 'fullscreen', 'pip']
           }
       }
   }, '*');
   ```

2. **ui/notifications/initialized**: App confirms ready after receiving init response

3. **ui/notifications/tool-result**: Host sends tool results to app
   ```javascript
   // Structure:
   {
       method: 'ui/notifications/tool-result',
       params: {
           toolName: 'platform_info',
           content: [{ type: 'text', text: '{"name":"mcp-data-platform",...}' }]
       }
   }
   // Note: Tool output is JSON-encoded inside params.content[0].text
   ```

4. **ui/notifications/size-changed**: App tells host preferred dimensions

### Display Modes

| Mode | Description |
|------|-------------|
| `inline` | Rendered within chat flow (default) |
| `fullscreen` | Takes over entire viewport |
| `pip` | Picture-in-picture floating window |

## Tutorial

For a complete step-by-step guide to building an MCP App, see the [MCP Apps Tutorial](mcpapps/tutorial.md). The tutorial walks through building a `platform-info` app and covers:

- MCP Apps protocol and message flow
- Understanding tool result structure
- Using the test harness for development
- Styling with mock data and dark mode support
- Connecting to real MCP tools

---

# Troubleshooting

## Quick Diagnosis

| Symptom | Likely Cause | Solution |
|---------|--------------|----------|
| Server exits immediately | Configuration error | Validate YAML syntax |
| 401 Unauthorized | Invalid credentials | Check token/key |
| 403 Forbidden | Persona/tool mismatch | Check persona tool rules |
| No enrichment data | Enrichment misconfigured | Enable enrichment settings |
| Slow responses | Performance bottleneck | Enable caching |
| Connection refused | Service unreachable | Check connectivity |

## Common Error Codes

| Code | Meaning | Solution |
|------|---------|----------|
| `AUTH_ERROR` | Authentication failed | Check credentials, token expiration |
| `AUTHZ_ERROR` | Authorization failed | Check persona tool rules |
| `TOOLKIT_ERROR` | Toolkit operation failed | Check service connectivity |
| `PROVIDER_ERROR` | Provider operation failed | Check DataHub/Trino config |
| `CONFIG_ERROR` | Configuration invalid | Validate YAML, check env vars |
| `TIMEOUT_ERROR` | Operation timed out | Increase timeout, check service |

## Enable Debug Logging

```bash
export LOG_LEVEL=debug
mcp-data-platform --config platform.yaml
```

---

# Operating Modes

mcp-data-platform supports two operating modes based on whether `database.dsn` is set.

## Mode Comparison

| Aspect | Standalone | File + Database |
|--------|-----------|-----------------|
| `database.dsn` | empty | set |
| Config source | YAML file only | YAML file + per-key DB overrides |
| Config entry CRUD | blocked (409) | available (whitelisted keys) |
| Knowledge tools | hidden (not registered) | registered |
| Knowledge admin API | 409 Conflict | available |
| Audit logging | noop (silent) | PostgreSQL |
| Audit admin API | 409 Conflict | available |
| Sessions | memory | database |
| OAuth | available (memory store) | available (DB store) |
| Persona/auth key CRUD | read-only | enabled |

**Standalone (No Database)**: Lightest deployment. No external database. Features requiring persistence (audit, knowledge) run in noop mode, tools are hidden. Admin API database-dependent endpoints return 409 Conflict.

**File + Database**: Production default. Complete YAML config with database for persistence. Audit, knowledge, session externalization all available. Individual config entries for whitelisted keys (server.description, server.agent_instructions) can be overridden at runtime via the admin API with hot-reload. File defaults are preserved as fallback.

## Config Store

When `database.dsn` is set, the granular config entries system activates automatically. Per-key overrides are stored in the `config_entries` table and take effect immediately without restart. Deleting an entry restores the file default.

```yaml
config_store:
  mode: file      # Deprecated — ignored. Accepted for backward compatibility.
```

## Settings

The admin portal includes settings pages for managing Connections, Personas, API Keys, and Configuration entries. Each resource displays a source badge indicating whether it originates from the config file, the database, or both.

**Source badges:** "file" (YAML config, read-only), "database" (admin UI managed, fully editable), "both" (file with database override, database version active, file version is fallback).

**Connections:** Multi-connection toolkits are expanded into individual entries. File connections are read-only; editing creates a database override. Deleting a "both" connection reverts to file. DataHub Integration section shows datahub_source_name and catalog_mapping.

**Personas:** File personas cannot be deleted (they reappear on restart). Editing creates a database override (source becomes "both"). Deleting a "both" persona removes the override and immediately reverts to the file version.

**API Keys:** File keys cannot be deleted. Database keys are created via the admin UI with the plaintext shown once. Same-name keys in file and database are deduplicated with source "both".

**Migration from file to database:** Edit the resource in the admin UI (creates database override) → verify → remove from YAML → restart.

---

# Admin API

REST endpoints for managing the platform outside the MCP protocol. Mounted under a configurable path prefix (default: `/api/v1/admin`).

## Authentication

All endpoints require admin persona authentication via `X-API-Key` or `Authorization: Bearer` headers. Errors use RFC 9457 Problem Details format.

## Interactive API Documentation

Swagger UI is served at `GET /api/v1/admin/docs/index.html`. The OpenAPI spec is auto-generated from source code annotations and available at `/api/v1/admin/docs/doc.json`. Regenerate with `make swagger`.

## System Endpoints

- `GET /system/info` — Platform identity, version, runtime feature availability, config mode
- `GET /tools` — All registered tools across toolkits
- `GET /connections` — Toolkit connections with their tools

## Config Endpoints

- `GET /config` — Current config as JSON (sensitive values redacted)
- `GET /config/mode` — Config store mode and read-only status
- `GET /config/export` — Downloadable YAML config (`?secrets=true` to include sensitive values)
- `GET /config/effective` — Merged view: file defaults + DB overrides with source indicator
- `GET /config/entries` — List all database config entries (per-key overrides)
- `GET /config/entries/{key}` — Get single config entry
- `PUT /config/entries/{key}` — Set config entry (whitelisted keys only, hot-reload)
- `DELETE /config/entries/{key}` — Remove database override (restores file default, 204 No Content)
- `GET /config/changelog` — Audit log of config entry changes

## Persona Endpoints

- `GET /personas` — List personas with tool counts
- `GET /personas/{name}` — Single persona with resolved tool list
- `POST /personas` — Create persona (database config mode only)
- `PUT /personas/{name}` — Update persona (database config mode only)
- `DELETE /personas/{name}` — Delete persona (database config mode only, cannot delete admin persona)

## Auth Key Endpoints

- `GET /auth/keys` — List API keys (values never exposed)
- `POST /auth/keys` — Generate new API key (database config mode only, key shown once)
- `DELETE /auth/keys/{name}` — Delete API key (database config mode only)

## Connection Instance Endpoints

Database-managed toolkit connections. Read endpoints always available; write endpoints require database config mode.

- `GET /connection-instances` — List all connection instances (ordered by kind, name)
- `GET /connection-instances/{kind}/{name}` — Get single connection instance
- `PUT /connection-instances/{kind}/{name}` — Create or update connection instance (database mode only, valid kinds: trino, datahub, s3)
- `DELETE /connection-instances/{kind}/{name}` — Delete connection instance (database mode only)

## Audit Endpoints

Requires `audit.enabled: true` and database. Returns 409 without database.

- `GET /audit/events` — Paginated audit events with filtering (user_id, tool_name, session_id, success, time range)
- `GET /audit/events/{id}` — Single audit event
- `GET /audit/stats` — Aggregate counts (total, success, failures)
- `GET /audit/metrics/overview` — Aggregated audit metrics (tool/user/toolkit breakdowns, timeseries, performance)
- `GET /audit/metrics/enrichment` — Enrichment statistics: mode breakdown (full/summary/reference/none), token usage, dedup savings, enrichment rate
- `GET /audit/metrics/discovery` — Session-level discovery patterns: discovery-before-query rates, top discovery tools, sessions without catalog exploration

### Enrichment Tracking Fields

Audit events include enrichment instrumentation:
- `enrichment_applied` (boolean) — whether semantic enrichment was applied
- `enrichment_mode` (string) — `full`, `summary`, `reference`, `none`, or empty
- `enrichment_tokens_full` (int) — estimated tokens for full enrichment content
- `enrichment_tokens_dedup` (int) — estimated tokens for dedup content (0 when full sent)

## Knowledge Endpoints

Requires `knowledge.enabled: true` and database. Returns 409 without database.

- `GET /knowledge/insights` — List insights with filtering
- `GET /knowledge/insights/stats` — Insight statistics
- `GET /knowledge/insights/{id}` — Get single insight
- `PUT /knowledge/insights/{id}` — Update insight text/category
- `PUT /knowledge/insights/{id}/status` — Approve or reject
- `GET /knowledge/changesets` — List changesets
- `GET /knowledge/changesets/{id}` — Get single changeset
- `POST /knowledge/changesets/{id}/rollback` — Rollback changes

---

# Gateway Toolkit

The gateway toolkit lets the platform act as an MCP **client** against arbitrary upstream MCP servers and re-expose their tools through the platform's own MCP server. Every proxied tool inherits the platform's authentication, persona enforcement, and audit logging — operators get one security envelope across every MCP they integrate.

## Architecture

The forwarder dials each configured upstream once at startup, discovers its tool catalog, and re-registers every tool under a connection-namespaced local name (`<connection>__<remote_tool>`). Persona rules and audit middleware see proxied tools the same way they see native tools.

## Configuring connections

Gateway connections live in the database, not in `platform.yaml`. Operators add and authenticate them through the admin portal (or directly via the admin REST API). Required for the kind to be active in YAML:

```yaml
toolkits:
  mcp:
    enabled: true
    # No instances here — connections are managed via the admin portal.
```

Create a connection through the admin REST API. Connection kind is `mcp` (each connection is a remote MCP server); the gateway is the platform-side feature that proxies them.

```bash
curl -X PUT \
  -H "X-API-Key: $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "endpoint": "https://vendor.example.com/mcp",
      "auth_mode": "bearer",
      "credential": "your-vendor-token",
      "connection_name": "vendor"
    },
    "description": "Vendor analytics MCP"
  }' \
  https://platform.example.com/api/v1/admin/connection-instances/mcp/vendor
```

The credential field is encrypted at rest (AES-256-GCM) when `ENCRYPTION_KEY` is set. The connection name (`vendor` above) becomes the prefix for every proxied tool.

### Authentication modes

- `none` — no auth header
- `bearer` — `Authorization: Bearer <credential>`
- `api_key` — `X-API-Key: <credential>`
- `oauth` — `Authorization: Bearer <token>`, with the token acquired via OAuth 2.1 and refreshed automatically

Connections use a shared service credential per connection (one upstream identity for every platform user). User-level attribution still appears in the audit log.

### OAuth 2.1

Two grants are supported:

- `client_credentials` — for machine-to-machine credentials. Set `oauth_token_url`, `oauth_client_id`, `oauth_client_secret`, `oauth_scope`. The platform fetches and refreshes tokens automatically.
- `authorization_code` — for upstreams that require browser sign-in (e.g. Salesforce Hosted MCP). Adds `oauth_authorization_url` and uses PKCE. The operator clicks Connect in the admin portal, signs in, and the platform persists the tokens (encrypted at rest in `gateway_oauth_tokens`, migration `000035`). The refresh token keeps the access token alive automatically thereafter — including for cron jobs and scheduled prompts.

Reauthentication is only required if the upstream invalidates the refresh token. The admin UI surfaces a Connect button when reauth is needed.

#### Salesforce Hosted MCP setup

Salesforce's Hosted MCP (Beta as of Dreamforce 2025) requires `authorization_code` + PKCE through an External Client App with the Web Server Flow enabled. Configure the ECA with callback URL `https://<host>/api/v1/admin/oauth/callback`, scopes `api refresh_token <mcp scope>`, and use the consumer key/secret as `oauth_client_id` / `oauth_client_secret`. Authorization URL is `https://login.salesforce.com/services/oauth2/authorize`; token URL is `https://login.salesforce.com/services/oauth2/token`.

### Test and refresh endpoints

- `POST /api/v1/admin/gateway/connections/{name}/test` — dial without saving, return discovered tools
- `POST /api/v1/admin/gateway/connections/{name}/refresh` — re-dial a stored connection, re-register tools

## Persona enforcement

Proxied tools follow standard persona rules. The double-underscore separator (`__`) makes gateway tools easy to target by pattern:

```yaml
personas:
  marketer:
    roles: ["marketing_team"]
    tools:
      allow:
        - "trino_query"
        - "vendor__list_*"
        - "vendor__send_*"
      deny:
        - "vendor__delete_*"
```

## Cross-enrichment rules

The gateway can run declarative enrichment rules that augment a proxied tool's response with context fetched from another platform source (Trino query, DataHub lookup). Rules are stored in `gateway_enrichment_rules` (migration 000034).

Rule shape:

```json
{
  "tool_name": "vendor__get_contact",
  "when_predicate": { "kind": "response_contains", "paths": ["$.email"] },
  "enrich_action": {
    "source": "trino",
    "operation": "query",
    "parameters": {
      "connection": "warehouse",
      "sql_template": "SELECT lifetime_value FROM mart.customers WHERE email = :email",
      "email": "$.response.email"
    }
  },
  "merge_strategy": { "kind": "path", "path": "warehouse_signals" },
  "enabled": true
}
```

### Predicates

- `always` (default) — fires on every successful tool call
- `response_contains` — fires only when every JSONPath in `paths` resolves in the response

### Sources and operations

- `trino` — `query` (parameters: `connection`, `sql_template`, plus bindings)
- `datahub` — `get_entity`, `get_glossary_term` (parameter: `urn`)

Bindings (string parameters starting with `$.` or `$[`) are JSONPath expressions resolved against `{ args, response, user }`. SQL `:name` placeholders are substituted with safely-quoted ANSI-SQL literals; single-quoted regions in templates are skipped so timestamp literals are never mangled.

### Merge strategies

- `path` (default) — attaches the source result to `response[merge.path]`. Default path: `enrichment`.

### Failure mode

Rule failures never break the parent tool call. Warnings are appended as `TextContent` entries prefixed `warning:` and the unaltered response content is returned.

### Rule admin endpoints

- `GET    /api/v1/admin/gateway/connections/{name}/enrichment-rules`
- `POST   /api/v1/admin/gateway/connections/{name}/enrichment-rules`
- `GET    /api/v1/admin/gateway/connections/{name}/enrichment-rules/{id}`
- `PUT    /api/v1/admin/gateway/connections/{name}/enrichment-rules/{id}`
- `DELETE /api/v1/admin/gateway/connections/{name}/enrichment-rules/{id}`
- `POST   /api/v1/admin/gateway/connections/{name}/enrichment-rules/{id}/dry-run`

The dry-run endpoint accepts a sample `{ args, response, user }` and returns the merged response plus per-rule traces (timing, errors).

## Failure isolation

A gateway upstream that's unreachable at startup logs a warning, records zero tools for that connection, and does not block platform startup. Other connections keep working. A connection that becomes unhealthy at runtime returns tool-error results prefixed `upstream:<connection>:` so the LLM can self-correct, and the audit log captures the failure with the same event shape as a successful call.

---

# OAuth to Upstream MCPs

When a gateway connection requires OAuth, the platform acts as the OAuth client to the upstream MCP. This is distinct from the inbound OAuth 2.1 server that lets Claude Desktop authenticate to the platform.

## Supported grants

- **client_credentials** (machine-to-machine): set `oauth_token_url`, `oauth_client_id`, `oauth_client_secret`, optional `oauth_scope`. Platform fetches and refreshes tokens automatically. No browser involved.
- **authorization_code + PKCE** (browser sign-in): adds `oauth_authorization_url`. Operator clicks Connect in the admin portal once; the platform walks the PKCE flow per RFC 7636 (S256), persists encrypted refresh tokens, and refreshes the access token automatically thereafter — including for cron jobs and scheduled prompts.

## Refresh token longevity (operator-controlled `oauth_scope`)

The platform sends `oauth_scope` to the IdP verbatim — there is no automatic augmentation. To get a refresh token that survives platform restarts, add the IdP-specific offline-token scope to your connection's `oauth_scope`:

- Keycloak / Auth0 / Okta: add `offline_access`. Example: `oauth_scope: "openid profile email offline_access"`. The IdP must also be configured to grant that scope to your client (see Keycloak setup below).
- Salesforce: add `refresh_token`. Salesforce does not recognize `offline_access`. Example: `oauth_scope: "openid api refresh_token"`.
- Custom OAuth servers: consult your IdP's documentation for the scope name that grants long-lived refresh tokens.

Without an offline-token scope, most IdPs tie the refresh token's lifetime to the user's interactive SSO Session Idle (default ~30 min on Keycloak), so any platform restart longer than that idle window forces the operator to re-authorize through the browser.

When the IdP rejects a stored refresh token with `invalid_grant`, the admin status surface surfaces a generic actionable hint pointing operators at their IdP's offline-token configuration. Without that hint, the only visible cause was the cryptic upstream message ("Token is not active") with no clue what to do.

### Keycloak setup

For `offline_access` to be granted, two things must be true:

1. The `offline_access` client scope must be assignable (Optional or Default) to the gateway's client. In Keycloak admin: Realm → Clients → <gateway client> → Client Scopes → Setup → ensure `offline_access` is listed under Optional or Default.
2. The user completing the Connect flow must have the `offline_access` realm role (granted by default to all users in most realms).

Without (1), the IdP rejects the authorization request with `invalid_scope: Invalid scopes: ... offline_access`. Drop `offline_access` from your `oauth_scope` until you've added it to the client's assignable scope set.

## Storage

- `gateway_oauth_tokens` — access_token, refresh_token, expires_at, authenticated_by. AES-256-GCM via `ENCRYPTION_KEY`.
- `oauth_pkce_states` — code_verifier (paired secret), state, redirect_uri. AES-256-GCM via `ENCRYPTION_KEY`. Rows expire after 10 min via server-side `NOW() + INTERVAL`.

`ENCRYPTION_KEY` is required for production. Without it, the platform stores values in plaintext and logs a warning.

## Multi-replica

PKCE state must be visible across replicas (oauth-start may land on replica A, callback on replica B). The platform automatically uses the Postgres-backed PKCE store when a database is configured. Single-replica deployments fall back to in-memory.

## Salesforce Hosted MCP

Standard `authorization_code` + PKCE via an External Client App (ECA) with Web Server Flow enabled. Callback URL `https://<your-platform-host>/api/v1/admin/oauth/callback`, scopes `api refresh_token <mcp scope>`, authorization URL `https://login.salesforce.com/services/oauth2/authorize`, token URL `https://login.salesforce.com/services/oauth2/token`. Consumer key/secret become `oauth_client_id`/`oauth_client_secret`.

---

# Gateway Admin REST Endpoints

In addition to the standard `/api/v1/admin/connection-instances/mcp/*` CRUD (used for create/read/update/delete with `kind=mcp`), the gateway exposes:

- `POST /api/v1/admin/gateway/connections/{name}/test` — dial the upstream without saving, return discovered tools. Used by the admin UI to validate credentials before persisting.
- `POST /api/v1/admin/gateway/connections/{name}/refresh` — re-dial a saved connection and re-register its tool catalog on the live MCP server.
- `POST /api/v1/admin/gateway/connections/{name}/oauth-start` — for `authorization_code` connections; returns the upstream's authorization URL with PKCE state. Operator opens this in a browser tab.
- `GET /api/v1/admin/oauth/callback?code=…&state=…` — **public** endpoint (state token authenticates the callback). Upstream provider redirects here after sign-in. Platform exchanges the code for tokens and persists them encrypted.
- `GET/POST/PUT/DELETE /api/v1/admin/gateway/connections/{name}/enrichment-rules[/{id}]` — cross-enrichment rule CRUD.
- `POST /api/v1/admin/gateway/connections/{name}/enrichment-rules/{id}/dry-run` — preview a rule against sample data without side effects.

---

# API Gateway Toolkit

The API gateway toolkit (`kind: api`) proxies arbitrary REST/HTTP APIs through the platform's auth, persona, and audit pipeline, the HTTP/JSON sibling of the MCP gateway. Three tools (`api_invoke_endpoint`, `api_list_endpoints`, `api_get_endpoint_schema`) handle every operation on every configured upstream. The model uses `api_list_endpoints` for discovery, `api_get_endpoint_schema` for precise per-endpoint detail, and `api_invoke_endpoint` for the call itself. No tool catalog explosion: ten upstreams = three tools total, regardless of endpoint count.

OpenAPI specs are stored separately in **API catalogs** (see the next section), globally-owned and versioned bundles that many connections can reference.

## When to use

Use the API gateway for REST/HTTP upstreams (Salesforce, Google APIs, GitHub, Stripe, internal HTTP services). For upstream MCP servers, use the MCP gateway (`kind: mcp`) instead.

## Connection config

Stored in the database, authored through the admin portal or the admin REST API. Required field: `base_url`. Auth modes: `none`, `bearer`, `api_key` (header or query placement), `basic` (RFC 7617 `Authorization: Basic base64(username:password)` for legacy APIs that never moved to bearer or OAuth, like Jenkins or on-prem Jira / Confluence Server / DC; `password` may be empty for the token-in-userid pattern; `username` is rejected if it contains `:` per RFC 7617 §2 or CR/LF/NUL header-smuggling vectors), `oauth2_client_credentials`, `oauth2_authorization_code`. The authorization-code grant completes via the platform's shared `/api/v1/admin/oauth/callback` endpoint (same as MCP gateway).

## Static headers

`static_headers` is the second auth slot for APIs that require BOTH an OAuth bearer AND a separate per-call header. `auth_mode` only contributes ONE thing; `static_headers` adds the rest. Operator-supplied, encrypted at rest (AES-256-GCM via the platform's FieldEncryptor), redacted by the admin API.

Motivating cases:

- **Google APIs** quota-billing requires `x-goog-user-project` alongside the access token for some endpoints.
- **Vendor APIs with a subscription/tenant header** (common in healthcare, payments, and SaaS-platform APIs) demand that header on every call in addition to the user OAuth bearer.
- **Internal API gateways** in front of upstream services often inject a subscription/tenant header that the underlying service authenticates against.

### Validation

- Header names must use the RFC 7230 token character set; no spaces, no colons.
- Values cannot contain CR/LF/NUL (header-smuggling defense).
- Cannot set `Authorization` (use `auth_mode`).
- Cannot set the API-key header chosen by `auth_mode: api_key`.
- Cannot set hop-by-hop / net/http-managed headers: `Host`, `Content-Length`, `Connection`, `Transfer-Encoding`, `Upgrade`, `Keep-Alive`, `Proxy-Authenticate`, `Proxy-Authorization`, `TE`, `Trailer`.

### Header precedence on the wire

1. Per-call headers from the tool input (`api_invoke_endpoint.headers`).
2. `static_headers` (operator-configured): overrides per-call headers with the same name.
3. `auth_mode` contribution: runs last, always wins.

The model is additionally blocked at request time from supplying any header name that collides with `static_headers`, so the operator's value is authoritative even before auth runs.

## Google API example

```json
{
  "base_url": "https://www.googleapis.com",
  "auth_mode": "oauth2_authorization_code",
  "oauth2_authorization_url": "https://accounts.google.com/o/oauth2/v2/auth",
  "oauth2_token_url":         "https://oauth2.googleapis.com/token",
  "oauth2_client_id":         "your-google-client-id",
  "oauth2_client_secret":     "your-google-client-secret",
  "oauth2_scopes":            ["https://www.googleapis.com/auth/drive.readonly"],
  "static_headers": {
    "x-goog-user-project": "your-quota-project-id"
  }
}
```

## Salesforce example

```json
{
  "base_url": "https://your-instance.my.salesforce.com",
  "auth_mode": "oauth2_authorization_code",
  "oauth2_authorization_url": "https://login.salesforce.com/services/oauth2/authorize",
  "oauth2_token_url":         "https://login.salesforce.com/services/oauth2/token",
  "oauth2_client_id":         "your-connected-app-consumer-key",
  "oauth2_client_secret":     "your-connected-app-consumer-secret",
  "oauth2_scopes":            ["api", "refresh_token"]
}
```

Salesforce needs `refresh_token` in `oauth2_scopes` to issue a refresh token; without it the connection cannot survive access-token expiry.

## REST gateway for non-MCP clients

`api_invoke_endpoint` is also reachable over plain HTTP for clients that cannot speak MCP/JSON-RPC (Apache NiFi, Airflow `HttpOperator`, `curl`):

```
POST /api/v1/gateway/{connection}/invoke
```

Auth uses the same headers as every other REST surface on the platform (`Authorization: Bearer ...` or `X-API-Key: ...`). The request goes through an in-memory MCP session against the assembled server, so the authenticator, persona authorization, route-policy, and audit middleware all apply identically to MCP and REST callers — no auth or audit logic is forked. Persona allowlists for `api_invoke_endpoint` govern access; tool-error envelopes from the toolkit map to HTTP status codes (`401` for auth failure, `403` for not authorized, `404` for unknown connection, `400` for validation).

Request body mirrors `InvokeInput` (the connection is taken from the URL):

```json
{ "method": "GET", "path": "/v1/things", "query_params": {"limit": 50}, "headers": {"X-Trace": "abc"}, "body": null, "timeout_seconds": 30 }
```

Response is `InvokeOutput` JSON: `status` is the upstream HTTP status, `body` is the parsed upstream response, `duration_ms` is the per-call timing. Platform HTTP `200` means the upstream call ran (outcome in `body.status`), not that the upstream succeeded — keeping platform-side and upstream-side failures distinguishable for NiFi's response-code routing. The route is only mounted when at least one `kind: api` toolkit instance is loaded.

---

# API Catalogs

OpenAPI specs describe the API itself, not any one credential pointed at it. An organization with a Salesforce sandbox and a Salesforce production org has two connections (one set of credentials each) but both talk to the same sObjects, query, and bulk-job endpoints. A multi-tenant SaaS deployment can have N customer connections all referencing one vendor catalog. Pasting the same documentation into every connection record is duplication that drifts. The catalog model fixes that.

## Data model

Two new tables (migration 000042):

- `api_catalogs(id PK, name, version, display_name, description, created_by, timestamps)` with `UNIQUE(name, version)`. Each `(name, version)` is its own row; cloning to a new version creates a new row. IDs are operator-chosen slugs (lowercase alphanumeric + hyphens, ≤100 chars), immutable after creation.
- `api_catalog_specs(catalog_id FK, spec_name, content TEXT, source_kind, source_url, etag, last_fetched_at)` with `(catalog_id, spec_name)` as primary key and `ON DELETE CASCADE` from `api_catalogs`. Content is plain TEXT; specs are public API documentation, no field-level encryption.

A connection's catalog reference lives in its existing `connection_instances.config` JSONB at key `catalog_id`. Application-layer validation rejects unknown ids at save time; the catalog handler refuses to delete a catalog while any connection still references it.

## Spec ingestion

The admin handler exposes three ingestion paths, all on the same `/api-catalogs/{id}/specs/{spec_name}` family of routes:

- **Inline** (`PUT` with `{source_kind: "inline", content}`): paste YAML or JSON directly. Validated against OpenAPI 3.x at save.
- **Upload** (`PUT .../upload` multipart): same as inline but the content arrives as a multipart `file` field. 10 MB body cap. Allowed MIME types: `application/json`, `application/yaml`, `application/x-yaml`, `application/octet-stream`, `text/yaml`, `text/x-yaml`, `text/plain`.
- **URL** (`PUT` with `{source_kind: "url", source_url}`): operator-provided HTTPS URL. The server fetches once at save time, captures the ETag, validates the content. Operators can later `POST .../refresh` to re-fetch.

URL-fetch SSRF guards (all required, none configurable in v1):

- `https://` only; `http://` rejected.
- DNS pre-resolution rejects any A/AAAA in private (`10/8`, `172.16/12`, `192.168/16`, `100.64/10`), loopback (`127/8`, `::1`), link-local (`169.254/16`, `fe80::/10`), or unspecified ranges.
- `Dialer.Control` re-checks the resolved IP at connect time so DNS rebinding (public IP at preflight, private IP at connect) trips the same guard.
- 10 MB body cap via `http.MaxBytesReader`.
- Redirects refused so an attacker-controlled upstream cannot 302 to a private URL.

External `$ref` resolution in OpenAPI is disabled at the parser level (kin-openapi's `IsExternalRefsAllowed: false`), so a malicious spec containing `$ref: "http://169.254.169.254/..."` never triggers an outbound HTTP call from the parser either.

OpenAPI validation runs structurally: the document must be valid 3.x JSON/YAML, operation IDs unique within a spec, `$ref` targets resolved internally, parameter and response shapes well-formed. Three categories of strict-validation checks are intentionally relaxed to match what Swagger UI / Postman / Insomnia accept and what real vendor specs commonly ship with: (1) example-vs-schema drift (e.g. a property typed `object` with a string example), (2) schema patterns using ECMA regex constructs Go's `regexp` doesn't support (lookahead, named backrefs), and (3) default-value drift. Examples and defaults are documentation hints, not part of the wire contract; structural integrity that matters for invocation still fails the upload.

## Live reload

Any catalog or spec mutation calls `Toolkit.ReloadConnectionsByCatalog(catalogID)` on every live api-gateway toolkit. The toolkit iterates its registered connections, finds those whose `cfg.CatalogID == catalogID`, and atomically rebuilds each one: re-parses the catalog's specs, retains the new `*openapi3.T` per spec on the conn, and lets the embedding cache invalidate naturally (a new conn instance starts with empty embeddings). The model sees the new spec content on the next `api_list_endpoints` / `api_get_endpoint_schema` call without a process restart.

## Operation index

`OperationSummary` gains a `Spec` field set to the component spec name (`drive`, `calendar`, `gmail`, etc. for a Google Workspace catalog). When the catalog has a single anonymous spec the field stays empty and is omitted from JSON. When two component specs in one catalog define the same `operation_id`, both summaries appear in `api_list_endpoints` with distinct `spec` values. `api_invoke_endpoint` and `api_get_endpoint_schema` accept an optional `spec` argument for disambiguation; ambiguous calls return a structured error listing the candidates.

## api_get_endpoint_schema

New MCP tool that returns full per-endpoint detail:

- **Input**: `connection` (required), `operation_id` (required), `spec` (optional — required when ambiguous).
- **Output**: `spec`, `operation_id`, `method`, `path`, `summary`, `description`, `parameters[]`, `request_body`, `responses[]`, `examples`.
- **Stripped**: top-level `security`, `securitySchemes`, `servers`, and vendor extensions whose keys start with `x-amazon-`, `x-google-`, `x-azure-`, `x-apigateway-`. These describe how an external client should authenticate, which is irrelevant — the connection is pre-authenticated and the same catalog might back connections using different auth modes.
- **Truncation**: response capped at ~50 KB after marshal. Deeper schemas elide parameters/request_body/responses with a `note` field; the model can still fall back to `api_invoke_endpoint` to probe shape directly. Schema walking caps at depth 8 to keep recursive types bounded.

## Backward compatibility

The legacy `connection_instances.config.openapi_spec` JSONB key is no longer read by the toolkit. The migration leaves existing values in place — operators see a banner in the portal connection editor prompting them to create a catalog and move the content over. Until they do, `api_list_endpoints` returns an empty list with a note.

## Admin REST routes

| Method | Path | Purpose |
|---|---|---|
| `GET` | `/api/v1/admin/api-catalogs` | List catalogs with spec_count and ref_count |
| `POST` | `/api/v1/admin/api-catalogs` | Create catalog header |
| `GET` | `/api/v1/admin/api-catalogs/{id}` | Catalog detail |
| `PUT` | `/api/v1/admin/api-catalogs/{id}` | Partial update |
| `DELETE` | `/api/v1/admin/api-catalogs/{id}` | Delete (refused while any connection references) |
| `POST` | `/api/v1/admin/api-catalogs/{id}/clone` | Clone catalog + all specs to new id |
| `GET` | `/api/v1/admin/api-catalogs/{id}/specs` | List specs (metadata only) |
| `GET` | `/api/v1/admin/api-catalogs/{id}/specs/{spec}` | Spec content |
| `PUT` | `/api/v1/admin/api-catalogs/{id}/specs/{spec}` | Upsert (inline or URL source) |
| `PUT` | `/api/v1/admin/api-catalogs/{id}/specs/{spec}/upload` | Multipart upload |
| `POST` | `/api/v1/admin/api-catalogs/{id}/specs/{spec}/refresh` | Re-fetch URL-sourced |
| `POST` | `/api/v1/admin/api-catalogs/{id}/specs/{spec}/reembed` | Manual retry: enqueue a manual_retry embedding job |
| `GET`  | `/api/v1/admin/api-catalogs/{id}/embedding-health` | Catalog-level roll-up (total / indexed / pending / running / failed) |
| `GET`  | `/api/v1/admin/api-catalogs/{id}/embedding-status` | Per-spec embedding state (operation_count, embedding_count, embedded_so_far, last job state). `embedded_so_far` is the worker's in-flight chunk counter that ticks up during a running embed pass; `embedding_count` jumps from 0 to N only on the final atomic upsert. |
| `GET`  | `/api/v1/admin/api-catalogs/{id}/embedding-jobs` | Recent embedding job history (filterable by status / spec_name) |
| `GET`  | `/api/v1/admin/embedding/status` | Platform-wide embedding provider status: `{kind, model, dimension, status}`. `status="unconfigured"` indicates the noop placeholder is in use (memory.embedding.provider unset); the portal renders an amber banner in this state and the apigateway embed-job queue refuses to start. |
| `DELETE` | `/api/v1/admin/api-catalogs/{id}/specs/{spec}` | Delete one spec |

### Persisted operation embeddings + job queue

Semantic and hybrid ranking modes on `api_list_endpoints` need a vector per operation. The toolkit stores these in PostgreSQL: migration `000044_api_catalog_operation_embeddings` creates `api_catalog_operation_embeddings(catalog_id, spec_name, operation_id, text_hash, embedding vector(768), model, dim, created_at, updated_at)` with `ON DELETE CASCADE` from `api_catalog_specs`. Embedding work runs through a Postgres-backed job queue introduced in migration `000045_api_catalog_embedding_jobs`: `api_catalog_embedding_jobs(id, catalog_id, spec_name, kind, status, attempts, last_error, next_run_at, worker_id, lease_expires_at, ...)` with a partial unique index on `(catalog_id, spec_name) WHERE status IN ('pending','running')` that makes producer enqueues idempotent.

Four components run alongside the admin server:

- **Producer**: every spec write (`upsert`, `upload`, `refresh`, `clone`) inserts a job row alongside the spec row and issues `pg_notify(api_catalog_embedding_jobs)`. ON CONFLICT against the partial unique index collapses duplicate enqueues to a no-op.
- **Worker**: pulls one job at a time via `SELECT id FROM api_catalog_embedding_jobs WHERE status='pending' AND next_run_at <= NOW() FOR UPDATE SKIP LOCKED`, takes a 10-minute lease, calls `apigateway.ComputeOperationEmbeddings`, writes vectors to `api_catalog_operation_embeddings`, marks the job succeeded. SKIP LOCKED makes multi-pod operation lock-free.
- **Reaper**: sweeps every 30s; rows in status=running with elapsed `lease_expires_at` flip back to pending. Pod-crash recovery is automatic.
- **Reconciler**: runs on every pod boot and every 5 min. One SQL statement enqueues jobs for any spec where `api_catalog_specs.operation_count <> COUNT(*) FROM api_catalog_operation_embeddings`. This is the convergence backstop.

Retryable failures back off exponentially (5s, 10s, 20s, 40s, 80s) up to 5 attempts; terminal failures move to status=failed with the last error on the row, visible in the portal as a red badge with a Retry button. The portal polls embedding-health and embedding-status every 5s while the catalog editor is open; operators see indexing progress live and never need to click anything to populate embeddings (the Retry button only appears on failed rows).

---

# Links

- GitHub: https://github.com/txn2/mcp-data-platform
- Documentation: https://mcp-data-platform.txn2.com/
- DataHub: https://datahubproject.io/
- MCP Specification: https://modelcontextprotocol.io/
- Security Article: https://imti.co/mcp-defense/
