# gsc-mcp Architecture Quick Reference for AI Assistants
# Source: 15 ADRs + code archaeology (2026-06-26, v1.0.0)
# Load in CLAUDE.md: @docs/machine-readable/llms.txt

## What This Project Is

Python MCP server exposing 57 Google Search Console / GA4 / CrUX / IndexNow / drift-monitoring / content-audit / AI-visibility tools to AI agents.
Package: gsc-mcp-tools on PyPI. Entrypoints: gsc-mcp and gsc-mcp-tools (both → gsc_mcp.server:main).
Framework: FastMCP. Python 3.11+. MIT license.

## Module Map

ENTRY POINT: src/gsc_mcp/server.py : FastMCP("gsc-mcp") + iterates registry.TOOLS to register all 57 tools.
REGISTRY: src/gsc_mcp/registry.py : Single source of truth. Dict[str, Callable] keyed by function name.
  assert set(TOOLS) == set(_ALL_TOOLS) at import time; any mismatch fails loudly.
  CLI (gsc-cli) + MCP server both read TOOLS. Adding a tool = update registry.py + properties._ALL_TOOLS only.

AUTH: src/gsc_mcp/auth.py : Three API service helpers:
  get_searchconsole_service() → GSC Search Console + Indexing API
  get_ga4_service()           → GA4 Data API v1 (gRPC)
  get_alpha_ga4_service()     → GA4 Data API v1alpha (for ga4_funnel only)
  Resolution order: GSC_SERVICE_ACCOUNT_PATH env → OAuth JSON token at platformdirs user_data_dir.
  Token: JSON (NOT pickle), chmod 0o600, written atomically via tempfile + os.replace() (TOCTOU fix).
  Directory: chmod 0o700.

OUTPUT CONTRACT: src/gsc_mcp/meta.py : with_meta(data, tool, params). Every tool returns:
  json.dumps(with_meta(data, tool="tool_name", params={...}))
  Data keys spread at top level. _meta block appended with tool name and call params.

RETRY: src/gsc_mcp/retry.py : @with_retry(max_retries=3, base_delay=1.0). Apply to any direct Google API call.
  Retries on: HttpError [429, 500, 502, 503, 504], ServiceUnavailable, ResourceExhausted, InternalServerError, BadGateway, RetryError.
  Does NOT retry other 4xx.

QUOTA: src/gsc_mcp/quota.py : QuotaTracker singleton. Indexing API limit: 200 req/day, warn at 180.
  In-memory only (resets on restart).

TOOLS:

  analytics.py (10 tools):
    get_search_analytics, get_advanced_search_analytics, get_performance_overview,
    get_search_by_page_query, compare_search_periods, analytics_anomalies,
    discover_performance, news_performance, search_type_breakdown, ai_overviews_impact
    Core: _fetch_rows() wrapped with @with_retry : shared by all analytics + SEO tools.

  seo.py (7 tools):
    quick_wins, traffic_drops, seo_striking_distance, seo_cannibalization,
    seo_lost_queries, check_alerts, parasite_risk
    Built on _fetch_rows from analytics.py.
    parasite_risk: pure URL-path analysis (no HTTP fetch). Regex patterns for
      Google 2024-11-19 parasite SEO policy: sponsored/affiliate/partner paths (high),
      /advisor/ /underscored/ /select/ /commerce/ (medium), affiliate query params (low).
      site_risk = max risk across URLs. Verdicts: clean | at_risk | high_risk.

  inspection.py (3 tools): inspect_url, batch_url_inspection, check_indexing_issues

  indexing.py (3 tools): submit_url, submit_batch, indexnow_submit
    submit_batch: true multipart HTTP batch via svc.new_batch_http_request(), 100 URLs/request.
    indexnow_submit: POST to https://api.indexnow.org/indexnow (not Google, targets Bing/Yandex/Seznam/Naver).
      validate_url_strict per URL (SSRF-safe). No @with_retry (not a Google API).
      Verdicts: ok (200/202, 0 skipped) | partial (200/202, some skipped) | error.

  sitemaps.py (4 + 1 special):
    list_sitemaps, sitemaps_get, sitemaps_delete, submit_sitemap
    sitemap_audit: defusedxml + SSRF origin check + follow_redirects=False + 90 days GSC cross-ref.
    Verdicts: empty | fetch_error | partial (>20% URLs absent) | healthy.

  properties.py (3 tools): list_properties, get_site_details, get_capabilities
    _ALL_TOOLS list lives here. Update it when adding tools.

  ga4.py (7 tools):
    ga4_traffic_sources, ga4_organic_landing_pages, ga4_page_performance,
    ga4_user_behavior, ga4_realtime, ga4_conversion_funnel, ga4_funnel
    ga4_funnel uses get_alpha_ga4_service() (v1alpha). All accept hostname + country filters.
    Filter helper: _build_dimension_filter(hostname, country, base_filter).
    env: GA4_PROPERTY_ID (validated lazily, per-call override supported).

  cross.py (4 tools):
    traffic_health_check, page_analysis, content_brief, page_health_score
    Compose analytics.py + ga4.py. Join via _normalize_url() (strips scheme/host/query/trailing-slash).
    engagement_rate = engaged_sessions / sessions.

  crux.py (3 tools): crux_page_vitals, crux_history, crux_lcp_subparts
    HTTP client: httpx (POST to Chrome UX Report API). Auth: CRUX_API_KEY (plain Google API key).
    HTTP 404 = verdict "not_enough_data" (not an error condition).
    crux_lcp_subparts: requests 5 metrics (lcp + 4 subparts). Missing CRUX_API_KEY = verdict "missing_key".
      Subparts: ttfb_ms, resource_load_delay_ms, resource_load_duration_ms, render_delay_ms.
      dominant_phase = short key name of subpart with highest p75 value.
      Verdicts: good | needs_improvement | poor | not_enough_data | missing_key | fetch_error.

  technical.py (5 tools): schema_validate, schema_generate, ai_visibility_audit, gbp_deprecation_lint, pagespeed_audit
    schema_validate: httpx fetch + html.parser (stdlib) extraction of <script type="application/ld+json">.
    No auth needed. Validates: Article, LocalBusiness, FAQPage, Product, WebSite, BreadcrumbList,
    SoftwareApplication, BlogPosting. SSRF-checked via url_safety.validate_url_strict.
    _DEPRECATED_RICH_RESULTS: FAQPage (May 2026), HowTo (Sep 2023), ClaimReview/EstimatedSalary/
      VehicleListing/SpecialAnnouncement (June 2025). Each schema dict includes deprecated_rich_result field.
    schema_generate: builds JSON-LD blocks for Reservation, OrderAction, DiscussionForumPosting, ProfilePage.
    No network calls. No auth needed.
    ai_visibility_audit: fetches {origin}/robots.txt via safe_fetch_html, parses with urllib.robotparser.
      Checks 9 AI crawlers: GPTBot, Anthropic-ai, Claude-User, PerplexityBot, CCBot, Google-Extended,
      cohere-ai, Bytespider, OAI-SearchBot. Also checks {origin}/llms.txt for MCP discoverability.
      Verdicts: open | partial | closed | fetch_error.
      Mock targets: gsc_mcp.tools.technical.safe_fetch_html, gsc_mcp.tools.technical.validate_url_strict.
    gbp_deprecation_lint: safe_fetch_html + 5 regex patterns for deprecated GBP features.
      Patterns: .business.site links (sunset Mar 2024), Reserve with Google (deprecated Jun 2025),
      appointments widget, Maps Reserve flow, GBP chat widget.
      Verdicts: clean | deprecated_found | fetch_error. No auth needed.
    pagespeed_audit(url, strategy="mobile"): calls PSI API v5 via httpx.
      Returns Lighthouse performance score (0-100), 6 CWV metrics (fcp/lcp/tbt/cls/speed_index/tti),
      top 3 opportunities sorted by score ascending.
      Requires GOOGLE_API_KEY env var. Missing key = verdict "missing_key".
      Verdicts: good (>=90) | needs_improvement (50-89) | poor (<50) | missing_key | fetch_error.
      Mock target: gsc_mcp.tools.technical.httpx.Client (context manager pattern).

  url_safety.py (module, no tool): SSRF/DNS-rebinding protection.
    Public API: validate_url(), validate_url_strict(), safe_httpx_get(), safe_fetch_html().
    Blocks: private/loopback/reserved IPs, obfuscated IPv4 (decimal/hex/octal), link-local (169.254.x.x),
    multi-cloud metadata endpoints (AWS IMDS, Azure, GCP, Oracle, Alibaba).
    DNS pinning via _pin_dns() context manager (socket.getaddrinfo patch under lock).
    Use safe_fetch_html() for any tool that fetches an arbitrary user-supplied URL.
    Adapted from claude-seo (agricidaniel, MIT).

  drift.py (3 tools): drift_baseline, drift_compare, drift_history
    drift_baseline: fetch page via safe_fetch_html + store SEO snapshot in SQLite.
    drift_compare: fetch live + apply 17 rules (8 CRITICAL, 6 WARNING, 3 INFO). Dan Colta methodology.
    drift_history: read stored comparisons for a URL.
    Storage: sqlite3 in platformdirs.user_data_dir("gsc-mcp")/drift/baselines.db (WAL mode).
    CWV comparison optional, requires CRUX_API_KEY. No Google API calls otherwise.
    Adapted from claude-seo (Dan Colta, MIT).

  content.py (4 tools): content_quality, hreflang_audit, page_technical_audit, preload_audit
    content_quality: safe_fetch_html + stdlib TextExtractor (strips script/style/nav/footer). Scores:
      filler_score (hits per 1000 tokens * 25), information_density (entities+numbers/100 tokens / 10),
      repetition_score (bigram repetition * 100), overall_quality (weighted). Flags: filler, low-density,
      repetitive, thin-content. Verdicts: good | needs_work | thin_content | fetch_error.
      Filler phrase list adapted from claude-seo (agricidaniel, MIT). _AI_PATTERNS excluded (CC BY-SA 4.0).
    hreflang_audit: safe_fetch_html + MetaParser. Checks: self-ref, x-default, ISO 639-1 lang codes,
      ISO 3166-1 Alpha-2 region codes (jp→ja, UK→GB), mixed HTTP/HTTPS protocols.
      Verdicts: valid | issues_found | no_hreflang | fetch_error.
      Rules adapted from claude-seo skills/seo-hreflang/SKILL.md (agricidaniel, MIT).
    page_technical_audit: validate_url_strict + httpx.Client(follow_redirects=False). Checks:
      title length (30-60), meta description length (50-160), meta robots (noindex=critical), canonical
      presence + match, viewport, HTML lang, security headers (x-frame-options, x-content-type-options,
      referrer-policy), redirect detection (3xx), robots.txt Googlebot access via safe_fetch_html +
      urllib.robotparser.parse(). Verdicts: healthy | issues_found | fetch_error. No auth needed.
    preload_audit: safe_httpx_get (returns full httpx.Response including headers, SSRF-safe).
      Checks: <script type="speculationrules"> blocks + body JSON parsed for prefetch/prerender actions,
      Speculation-Rules HTTP response header, <link rel="preload"> with as/href/fetchpriority attrs,
      deprecated <link rel="prerender">, cache-control: no-store bfcache blocker.
      Verdicts: optimised | improvements_available | not_implemented | fetch_error.
      Mock target: gsc_mcp.tools.content.safe_httpx_get. Adapted from claude-seo (agricidaniel, MIT).

## Security Rules (always apply)

1. XML PARSING: Use defusedxml.ElementTree, NEVER stdlib xml.etree.ElementTree, for external XML.
   Reason: XXE + billion-laughs vulnerability on untrusted XML.

2. TOKEN STORAGE: Write tokens as JSON (not pickle). Atomic write (tempfile + os.replace()).
   chmod 0o600 on file, 0o700 on directory.

3. SSRF (sitemap_audit + schema_validate + drift tools): All arbitrary URL fetches go through
   url_safety.validate_url_strict() or safe_fetch_html() from src/gsc_mcp/url_safety.py.
   Blocks private IPs, cloud metadata endpoints, obfuscated IPv4, localhost, link-local addresses.
   sitemap_audit additionally enforces origin match on child URLs + follow_redirects=False.

5. REAL PROPERTY IDs: Never hardcode real GSC property IDs or domains in tests or docs.
   Use sc-domain:example.com or https://example.com/ as fixtures.

6. RETRY COVERAGE: All direct Google API calls (execute(), runReport(), etc.) need @with_retry().
   Missing decorator = unhandled transient failures.

7. OUTPUT CONTRACT: All tools must return json.dumps(with_meta(...)).
   No bare json.dumps(data).

8. CI SECRETS: PYPI_API_TOKEN lives in GitHub Actions secrets, never in pyproject.toml or code.

## Adding a New Tool (checklist)

1. Implement in tools/<module>.py : @with_retry() if Google API call; no decorator for non-API tools.
2. Return json.dumps(with_meta(data, tool="name", params={...})).
3. If the tool fetches an arbitrary user-supplied URL, use safe_fetch_html() from url_safety.py.
4. Import + add to TOOLS tuple in registry.py (server.py and gsc-cli pick it up automatically).
5. Add name to _ALL_TOOLS in properties.py. Update count in get_capabilities() docstring.
6. Write tests in tests/test_<module>.py : mock all Google API calls (never make real requests).

## Test Patterns

SETUP: pytest tests/ : 545 tests, fully mocked.
  pytest tests/test_analytics.py -v
  pytest tests/ -k "test_submit_batch" -v

FIXTURE CHEAT SHEET:
  mock_gsc_service     → MagicMock wired: .sites().list, .searchanalytics().query, .sitemaps(), .urlInspection()
  mock_indexing_service → MagicMock with working new_batch_http_request() (fires callbacks synchronously)
  GA4_PROPERTY_ID      → autouse fixture sets env var to "12345678"

PATCH AT CALL SITE (not in module declaration):
  gsc_mcp.tools.analytics.get_searchconsole_service
  gsc_mcp.tools.ga4.get_ga4_service
  gsc_mcp.tools.crux.httpx.Client              ← context manager pattern
  gsc_mcp.tools.sitemaps.httpx.Client          ← context manager pattern
  gsc_mcp.tools.sitemaps.get_search_analytics  ← module-level import, patch here
  gsc_mcp.url_safety.socket.getaddrinfo        ← DNS mock for schema_validate + drift tools
  gsc_mcp.tools.drift.safe_fetch_html          ← patch directly for drift tests
  gsc_mcp.tools.drift._fetch_cwv               ← patch to return None or a CWV dict
  gsc_mcp.tools.content.safe_fetch_html        ← patch for content_quality, hreflang_audit, page_technical_audit robots.txt
  gsc_mcp.tools.content.validate_url_strict    ← patch for page_technical_audit SSRF test
  gsc_mcp.tools.content.httpx.Client          ← context manager pattern for page_technical_audit main fetch
  gsc_mcp.tools.content.safe_httpx_get        ← patch for preload_audit (returns MagicMock with .text, .status_code, .headers)
  gsc_mcp.tools.indexing.httpx.Client         ← context manager pattern for indexnow_submit
  gsc_mcp.tools.indexing.validate_url_strict  ← patch for indexnow_submit SSRF validation
  gsc_mcp.tools.technical.safe_fetch_html     ← patch for ai_visibility_audit + gbp_deprecation_lint
  gsc_mcp.tools.technical.validate_url_strict ← patch for ai_visibility_audit + gbp_deprecation_lint + pagespeed_audit SSRF
  gsc_mcp.tools.technical.httpx.Client        ← context manager pattern for pagespeed_audit

## Environment Variables

GSC_SERVICE_ACCOUNT_PATH  : GSC + Indexing + GA4 tools. Path to service account JSON (preferred for automation).
GSC_CREDENTIALS_PATH      : OAuth flow. Path to OAuth Desktop client JSON.
GSC_SKIP_OAUTH            : Set to "true" to skip OAuth fallback entirely (requires SA path).
GA4_PROPERTY_ID           : All GA4 + cross tools. Numeric (e.g. 12345678), validated lazily, per-call override supported.
CRUX_API_KEY              : crux_page_vitals, crux_history, crux_lcp_subparts. Plain Google API key (not service account).
                            Missing CRUX_API_KEY: crux_page_vitals + crux_history raise RuntimeError; crux_lcp_subparts returns verdict="missing_key".
GOOGLE_API_KEY            : pagespeed_audit only. Plain Google API key with PageSpeed Insights API enabled in GCP.
                            Missing GOOGLE_API_KEY: returns verdict="missing_key" (no exception).

## Quick Decision Tree

- New GSC analytics tool → analytics.py, use _fetch_rows, @with_retry, with_meta
- New SEO intelligence tool → seo.py, builds on _fetch_rows
- New GA4 tool → ga4.py, add hostname+country filter via _build_dimension_filter, with_meta
- New GA4 alpha API tool → ga4.py, use get_alpha_ga4_service()
- New cross-platform tool → cross.py, use _normalize_url for GSC+GA4 join
- New tool fetching external URLs → use safe_fetch_html() from url_safety.py (SSRF, DNS rebinding, metadata blocklist)
- New tool parsing external XML → defusedxml.ElementTree
- Any Google API call → @with_retry()
- Any tool return → json.dumps(with_meta(...))
- New tool registered → update _ALL_TOOLS in properties.py + get_capabilities count
- New test → mock get_*_service at call site, never make real API calls
