Pull Request Story

Offline Mode & Remote Config Cache

Making Pipelex setup and dry-run work without a network — by adding a primed, last-resort cache for the Gateway remote config, with honest provenance tracking and no silent fallbacks.

branch · fix/Offline-mode 7 phases · TDD throughout ~3,300 LOC added 6 new test suites

Why we did this

Pipelex's Gateway needs a remote config — the catalogue of which models exist and on which backends. On every setup() that touches model specs, Pipelex fetched that config over HTTP. If the network was down, setup crashed — even for commands that never make an inference call.

That hurt in two real places: agent CLI commands like validate, inputs, and run --dry-run all pass needs_model_specs=True, and sandboxed environments (Codex local sandbox) have no outbound network at all. None of these run a model — they just need to know the catalogue. Yet a missing network broke them.

Diagnosed failure point

pipelex.py:235 unconditionally called RemoteConfigFetcher.fetch_remote_config() whenever the Gateway was enabled and model specs were needed. Any network failure raised RemoteConfigFetchError straight out of setup. The only escape hatch was a Codex-Cloud-specific short-circuit — local sandboxes and plain offline use were not covered.

The three objectives

01

BYOK works fully offline

Gateway disabled → no remote fetch is ever attempted. Setup completes with zero network.

02

Dry-run survives an outage

Gateway enabled but remote temporarily unreachable → fall back to a cache primed at init time.

03

No silent fallbacks

Unknown Gateway models fail loudly against fresh or cached specs. Stale data is always labelled.

Non-objective: we did not make inference itself work offline. A real model call still needs the network. This PR is strictly about setup and dry-run.

The design in one picture

The whole thing hinges on one idea: the fetcher returns a result that knows where it came from. Everything downstream branches on that provenance.

setup()Gateway enabled & needs specs
fetch_remote_config()HTTP GET, with retry (tenacity, 5 attempts)
successparse + validate → write raw JSON to cache → source = FRESH
network fails, cache hitload ~/.pipelex/cache/remote_config.jsonsource = CACHED
network fails, no cacheraise RemoteConfigUnavailableError
│   the source flows downstream   │
membership check: every referenced Gateway model must exist in the specs → else GatewayUnknownModelError (message branches on source)
if CACHEDemit RemoteConfigStaleWarning · disable telemetry · surface a warnings field on the agent-CLI JSON envelope

Why a schema-break does NOT fall back to cache

If the remote responds with JSON that fails validation, we raise RemoteConfigValidationError and stop — no cache fallback. We control both ends of that URL (it's versioned in pipelex-back-office), so a schema-rejecting payload is a real server bug, not an operational state. Failing loudly is correct. The cache is for network failures only.

The pieces we added

New module — remote_config_cache.py

RemoteConfigCache.store() / .load(). Stores the raw JSON dict from the response — not a re-serialised Pydantic dump — so the cache survives minor schema drift. Atomic writes (temp file + os.replace). Schema-versioned: a CACHE_SCHEMA_VERSION bump rejects old caches cleanly.

New leaf enum — RemoteConfigSource

FRESH | CACHED. Lives in its own tiny types.py so cogt/ code can import it without dragging in httpx + tenacity. Exposes .is_cached so callers never write == CACHED.

New exceptions & warning

The critical code change

The behavioural heart of the PR. The fetcher stopped returning a bare RemoteConfig and started returning a RemoteConfigResult that carries provenance:

pipelex/system/pipelex_service/remote_config_fetcher.py
- def fetch_remote_config(cls) -> RemoteConfig:
+ def fetch_remote_config(cls, require_fresh: bool = False) -> RemoteConfigResult:
      url = PipelexDetails.remote_config_url()
      try:
          payload, config = cls._fetch_fresh(url)
+         RemoteConfigCache.store(payload)            # opportunistic refresh
+         return RemoteConfigResult(config=config, source=RemoteConfigSource.FRESH)
      except RemoteConfigFetchError as fetch_error:   # network / HTTP only
+         if require_fresh:                            # doc generators refuse stale data
+             raise cls._build_unavailable_error(fetch_error, cache_refused=True) from fetch_error
+         cached = RemoteConfigCache.load()
+         if cached is None:
+             raise cls._build_unavailable_error(fetch_error) from fetch_error
+         return RemoteConfigResult(
+             config=cached.to_remote_config(),
+             source=RemoteConfigSource.CACHED,
+             cached_at=cached.cached_at,
+         )

And setup() consumes the provenance — warning, and tightening telemetry, when the data is stale:

pipelex/pipelex.py
- remote_config = RemoteConfigFetcher.fetch_remote_config()
+ remote_config_result = RemoteConfigFetcher.fetch_remote_config()
+ remote_config = remote_config_result.config
+ gateway_config_source = remote_config_result.source
  ...
+ if gateway_config_source.is_cached:
+     warnings.warn(f"Pipelex Gateway is running off a cached remote config "
+                   f"(snapshot: {cached_at_iso}). Run `pipelex init` while online.",
+                   RemoteConfigStaleWarning, stacklevel=2)
  # stale specs imply stale model identities — don't phone home in that state
- is_pipelex_telemetry_enabled = is_pipelex_service_enabled and needs_inference
+ is_pipelex_telemetry_enabled = (is_pipelex_service_enabled and needs_inference
+                                 and not gateway_source_is_cached)

One deliberate design call: the warning is emitted in setup(), not in the fetcher. That keeps the fetcher a pure data-returning function, so test fixtures that swap in a cached fetcher don't have to replay warnings.

The behaviour matrix we validated

ScenarioGatewayNetworkCacheOutcome
BYOK offlineoffdown setup + validate + dry-run all OK — fetch never attempted
Gateway dry-run, freshonup fresh fetch, cache written
Gateway dry-run, cachedondownpresent cache fallback + stale warning, dry-run OK
Gateway dry-run, coldondownabsent RemoteConfigUnavailableError with remediation
Unknown model referencedonanyany GatewayUnknownModelError, message branches on source
Doc generator offlineondownpresent require_fresh=True refuses cache — no stale docs committed
The subtle one — doc generators

gateway_models_generator.py and preprocess_test_models_cmd.py regenerate committed reference docs and test fixtures. If they silently used the cache, they'd bake stale data into the repo. They pass require_fresh=True — any fallback becomes an immediate error instead.

How init primes the cache

The cache is only useful if it exists before the network goes down. So when pipelex init accepts the Gateway terms, it does one fetch and persists the result. If that is offline, init logs a yellow warning and continues — the cache stays empty, but the user has been told. The agent-CLI init mirrors this and surfaces cache_primed / cache_priming_error on its JSON envelope.

What landed

+ remote_config_cache.py + pipelex_service/types.py remote_config_fetcher.py pipelex_service/exceptions.py cogt/exceptions.py cogt/models/model_manager.py model_manager_abstract.py pipelex.py cli/commands/init/command.py agent_cli/commands/init_cmd.py agent_cli/commands/agent_output.py agent_cli/commands/agent_cli_factory.py cli/error_handlers.py cli/cli_factory.py cli/commands/doctor_cmd.py dev_cli/.../gateway_models_generator.py dev_cli/.../preprocess_test_models_cmd.py pipelex_details.py + test_remote_config_cache.py + test_remote_config_fetcher.py + test_gateway_unknown_model.py + test_setup_with_cache.py + test_cache_priming.py + test_offline_run_dry.py (E2E) + test_offline_baseline.py

Built TDD across 7 phases — baseline lock-down → cache module → fetcher fallback → provenance plumbing → init priming → real-.mthds-bundle E2E → verification. Each phase is its own commit with a checkpoint block. make agent-check and make agent-test green throughout.

Deferred — not in this PR