# GoldenMatch

> Entity resolution toolkit — deduplicate records and match across datasets using fuzzy, probabilistic, and LLM-powered scoring.

## Interfaces
- MCP Server: `goldenmatch mcp-serve` (14 agent tools + 17 data tools + 5 memory tools = 36 total)
- Remote MCP: https://goldenmatch-mcp-production.up.railway.app/mcp/ (36 tools, Smithery: https://smithery.ai/servers/benseverndev-oss/goldenmatch)
- A2A Server: `goldenmatch agent-serve --port 8200` (12 skills)
- CLI: `goldenmatch dedupe`, `goldenmatch autoconfig`, `goldenmatch match`, `goldenmatch memory ...`, + more
- Python API: `import goldenmatch` -- `dedupe_df()`, `match_df()`, `score_strings()`, `evaluate()`, `AgentSession.autoconfigure()`, `add_correction()`, `learn()`, `memory_stats()`, `get_memory()`, ~106 exports
- REST API: `goldenmatch serve` on port 8000 (incl. `POST /autoconfig`, `GET /controller/telemetry`)
- SQL: Postgres extension + DuckDB UDFs at `packages/rust/extensions/` (`goldenmatch_autoconfig`, `goldenmatch_dedupe_full`, `gm_telemetry`)

## AutoConfigController telemetry (v1.7-v1.12, cross-surface)

Every interface above returns the same JSON shape from `goldenmatch.web.controller_telemetry.serialize_telemetry`: `{stop_reason, health, scoring, blocking, cluster, column_priors, decisions, committed_matchkeys, negative_evidence}`. Write one parser, reuse across web / TUI / CLI / SQL / MCP / A2A / REST.

## Install
- `pip install goldenmatch`
- Quality scanning: `pip install goldenmatch[quality]`
- Data transforms: `pip install goldenmatch[transform]`
- Embeddings: `pip install goldenmatch[embeddings]`

## Quick Examples

### Deduplicate a CSV (zero-config)
```python
import goldenmatch as gm
result = gm.dedupe("customers.csv")
result.golden.write_csv("deduped.csv")
print(f"{result.total_clusters} clusters, {result.match_rate:.1%} match rate")
```

### Deduplicate with explicit config
```python
result = gm.dedupe("customers.csv",
    exact=["email"],
    fuzzy={"name": 0.85, "address": 0.80},
    blocking=["zip"],
)
```

### Match across two files
```python
result = gm.match("file_a.csv", "file_b.csv", fuzzy={"name": 0.85})
```

### Privacy-preserving linkage (no raw data shared)
```python
result = gm.pprl_link("hospital_a.csv", "hospital_b.csv",
    fields=["first_name", "last_name", "dob", "zip"])
```

### Evaluate accuracy
```python
metrics = gm.evaluate("data.csv", config="config.yaml", ground_truth="gt.csv")
print(f"F1: {metrics['f1']:.1%}, Precision: {metrics['precision']:.1%}")
```

## Config Template (YAML)

```yaml
matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name
    type: weighted
    threshold: 0.85
    fields:
      - field: first_name
        scorer: jaro_winkler
        weight: 0.5
        transforms: [lowercase, strip]
      - field: last_name
        scorer: jaro_winkler
        weight: 0.3
      - field: zip
        scorer: exact
        weight: 0.2

blocking:
  strategy: adaptive
  keys:
    - fields: [zip]

golden_rules:
  default_strategy: most_complete
```

## Key Types

- `DedupeResult` — `.golden` (DataFrame), `.dupes`, `.unique`, `.clusters` (dict), `.scored_pairs` (list), `.stats`, `.total_clusters`, `.match_rate`
- `MatchResult` — same shape as DedupeResult for cross-file matching
- `GoldenMatchConfig` — Pydantic model, loadable from YAML via `gm.load_config("config.yaml")`

## Performance Limits
- In-memory: up to ~500K records. Use DuckDB backend or chunked mode for larger datasets
- 1M exact dedupe: ~7.8s. 100K fuzzy: ~12.8s
- LLM scorer: ~$0.04 per dataset (budget-capped, opt-in)
- PPRL auto-config: 92.4% F1 on FEBRL4

## Scorers
exact, jaro_winkler, levenshtein, token_sort, ensemble, dice, jaccard, soundex_match, embedding, record_embedding, name_freq_weighted_jw, given_name_aliased_jw

## Transforms
lowercase, uppercase, strip, soundex, metaphone, digits_only, alpha_only, normalize_whitespace, token_sort, first_token, last_token, substring:start:end, legal_form_strip, address_normalize, naics_normalize

## Bundled Reference Data
Five OSS packs ship with the wheel; auto-config swaps the matching scorer/transform in when the column name pattern matches AND the profiled `col_type` agrees:
- Surnames (US Census 2010, top 10K) → `name_freq_weighted_jw` on last_name/surname columns. Lifts F1 0.667→0.915 on the common-name FP fixture.
- Given-name aliases (~140 pairs: William↔Bill, Katherine↔Kate, ...) → `given_name_aliased_jw` on first_name/given_name columns.
- Business legal forms (Inc, LLC, Ltd, GmbH, S.A., ...) → prepends `legal_form_strip` on company/business/org/firm/legal_name columns.
- USPS Pub. 28 addresses → prepends `address_normalize` on address/street/addr_line/mailing_address columns. Handles `#5`→`apt 5`, `P.O. Box`→`PO Box`.
- NAICS 2022 industries (2,125 codes, all 5 hierarchy levels) → prepends `naics_normalize` on naics/sic/industry_code/business_type columns.

The `col_type` gate (PR #224) skips the refinement when column-name regex matches but profiled shape disagrees — a `last_name` column holding numeric IDs keeps its caller-specified scorer. See `docs/reference-data`.

## Learning Memory (v1.6.0)
Persistent corrections + threshold learning. Off by default; enable with `memory.enabled = true`.
- Store: SQLite (default) or Postgres. Path: `.goldenmatch/memory.db`.
- Collection points: review queue, boost tab, unmerge_record/cluster, LLM scorer, MCP `agent_approve_reject`, REST `/reviews/decide`, Python `add_correction()`.
- Re-anchors via `record_hash`; ambiguous rehydrations report `stale_ambiguous`. Postflight reports `Memory: N applied, M stale, K stale-ambiguous, J unanchorable`.
- CLI: `goldenmatch memory stats|learn|export|import|show`.
- Python: `goldenmatch.add_correction(...)`, `learn()`, `memory_stats()`, `get_memory()`. Result objects expose `result.memory_stats`.
- MCP tools: `list_corrections`, `add_correction`, `learn_thresholds`, `memory_stats`, `memory_export`.
- Learner runs at `learning.threshold_min_corrections` (default 10) per matchkey via trust-weighted grid search.

## Docs
- [Learning Memory](https://benseverndev-oss.github.io/goldenmatch/learning-memory)
- [Full docs](https://benseverndev-oss.github.io/goldenmatch/): 23 guides
- [Full API reference](https://benseverndev-oss.github.io/goldenmatch/python-api): 101 exports
- [PyPI](https://pypi.org/project/goldenmatch/)
- [GitHub](https://github.com/benseverndev-oss/goldenmatch)
