Co-developer Briefing

The Error-Handling Overhaul

A single, structured error model that travels intact from a raw provider SDK exception, through every wrapping layer, across the Temporal boundary, and out to humans, agents, and HTTP APIs — classified the same way at every stop.

branch refactor/ECR  ←  feature/Error-handling-2  ←  feature/Temporal-merge-3
TL;DR for the busy reviewer.

Before this branch, an error meant whatever the layer that caught it decided to do — some swallowed it, some re-wrapped it losing the cause, some had no SDK handling at all. After this branch:

Independent tracks across metadata, worker classification, retry & resilience, CLI/HTTP delivery, the Temporal boundary, testing, and the Extract / Classify / Render decomposition — all landed.

01Why we did this

The state we started from, and the principle we settled on.

On feature/Temporal-merge-3, error handling was inconsistent. A tier review classified inference workers into three buckets — and a large share of them (Google, Mistral, Azure img-gen, FAL, HuggingFace, Docling, Linkup, pypdfium2) had no SDK exception handling at all: a raw provider error propagated up as-is, untyped, uncategorized. Other layers caught errors and re-wrapped them, sometimes dropping the __cause__. There was a PipeRouter retry loop that only ever ran on the direct path, re-ran at the wrong granularity, and carried a config bug. And except Exception appeared throughout business logic, silently swallowing bugs.

The principle An error is data, not a control-flow accident. It is classified once, at the layer that knows the most (the worker), and that classification must reach every consumer unchanged — the human reading a Rich panel, the agent parsing JSON, the Temporal retry engine, the HTTP adapter picking a status code. No layer in between is allowed to lose it.

Three cross-cutting rules now hold everywhere and the tracks build on them:

02The layer model

Errors flow up through six layers. Each one has exactly one job.

Layer 5 · CLI entry points catch + format for human (Rich) / agent (JSON·MD) / HTTP status Layer 4 · CLI factories catch setup errors, route to handlers Layer 3 · Pipeline runner catch + wrap as PipelineExecutionError Layer 2 · Pipe router / operators catch + wrap with pipe context (pipe_code, pipe_stack…) Layer 1 · Workers / SDK calls ◀ classification happens here catch third-party exception → CogtError + InferenceErrorCategory Layer 0 · Third-party SDKs raw OpenAI / Anthropic / Google / Mistral / … exceptions error propagates up — classification preserved at every layer
Errors originate at Layer 0, are classified at Layer 1, and accumulate context (pipe code, stack, run mode) as they rise. No layer drops the classification.

03The error's journey

From a raw SDK exception to four different rendered outputs — one schema throughout.

RateLimitError raw SDK exception Worker classifies is_quota_exhaustion_*() → CAPACITY vs TRANSIENT raise CogtError(…) from exc ErrorReport error_category · error_domain retryable · user_action model · provider provider_metadata via to_error_report() wrapped by Pipe / Router / Runner layers… __cause__-chain enrichment keeps the classification Human CLI Rich error panel Agent CLI JSON / Markdown HTTP adapters .http_status → 4xx/5xx Temporal retry decision + details one schema · four renderers · identical classification at every stop
The same ErrorReport drives all four delivery surfaces. Wrapper exceptions inherit the underlying classification through __cause__-chain enrichment.

The tracks at a glance

TrackStatusWhat it delivers
Metadata modelLandedThe data contract — error_category, error_domain, user_action, ProviderErrorMetadata — carried on the exception class.
Worker classificationLandedEvery LLM / img-gen / extract / search worker maps SDK exceptions → categorized CogtError.
Retry & resilienceLandedRemoved the fake PipeRouter loop; explicit transport retry; bounded PipeBatch fan-out.
CLI deliveryLandedAgent CLI markdown/JSON; error_domain → HTTP-status mapping; shared Rich panel helper.
Temporal integrationLandedErrorReport carried across activity → workflow → submitter; category-aware retry.
TestingLandedPer-worker classification, full-chain snapshot, local/Temporal parity, dict drift-detection.
Extract / Classify / RenderLandedPer-worker pipeline decomposed into one per-provider Extract + shared Classify + shared Render. Branch refactor/ECR.

04ErrorReport — the schema

One frozen dataclass. pipelex/base_exceptions.py.

ErrorReport is a frozen pydantic dataclass with extra="forbid". It is the single source of truth for error serialization — to_dict() drops None fields, from_dict() is its strict inverse (used to recover a report that crossed the Temporal boundary), and http_status maps it to an HTTP code for downstream APIs.

pipelex/base_exceptions.py — ErrorReport gains classification + domain fields
 @dataclass(frozen=True, config={"extra": "forbid"})
 class ErrorReport:
     error_type: str
     message: str
     error_category: str | None = None
+    error_domain: str | None = None          # input / config / runtime
     retryable: bool | None = None
-    user_action: str | None = None
+    user_action: UserAction | None = None    # typed: kind + detail
     model: str | None = None
     provider: str | None = None
+    provider_metadata: ProviderErrorMetadata | None = None

The most important behaviour change is on PipelexError itself. A wrapper exception (PipeRunErrorPipeRouterErrorPipelineExecutionError) carries no category of its own. to_error_report() now enriches the report from the __cause__ chain, so the inference classification survives every wrapping layer:

pipelex/base_exceptions.py — cause-chain enrichment (new)
+def _enrich_error_report_from_cause(self, report: ErrorReport) -> ErrorReport:
+    """Fill the None classification fields of `report` from the __cause__ chain."""
+    cause = self.__cause__
+    if not isinstance(cause, PipelexError):
+        return report
+    # ... cyclic-chain guard omitted ...
+    cause_report = cause.to_error_report()
+    return ErrorReport(
+        error_type=report.error_type,                                  # keep own identity
+        message=report.message,
+        error_category=report.error_category or cause_report.error_category,
+        error_domain=report.error_domain or cause_report.error_domain,
+        retryable=report.retryable if report.retryable is not None else cause_report.retryable,
+        user_action=report.user_action or cause_report.user_action,
+        model=report.model or cause_report.model,
+        provider=report.provider or cause_report.provider,
+        provider_metadata=report.provider_metadata or cause_report.provider_metadata,
+    )
Watch for A to_error_report() override on a subclass must end with self._enrich_error_report_from_cause(report) — otherwise that subclass becomes a black hole that drops the cause's classification. The cyclic-__cause__ guard exists because a malformed chain must never turn the error-reporting path into a RecursionError.

Two enums you'll see everywhere

InferenceErrorCategory

Drives retry decisions. Only TRANSIENT is retryable.

  • TRANSIENT — retry it
  • CONFIGURATION — fix the setup
  • CONTENT — fix the input/prompt
  • CAPACITY — quota / billing
  • AMBIGUOUS — outcome unknown, unsafe to retry
  • UNKNOWN — could not classify
ErrorDomain

Drives HTTP status. Class-level on the exception.

  • INPUT → HTTP 422 — caller can fix it
  • CONFIG → HTTP 500 — env change needed
  • RUNTIME → HTTP 500 — failure during execution

A provider 429 in provider_metadata overrides this so the API can emit Retry-After.

05Worker classification — Extract / Classify / Render

Every inference worker collapses to one three-line shape.

After the ECR (refactor/ECR) sweep, every provider worker under pipelex/plugins/*/ handles SDK exceptions with the same three steps. Extract turns the provider's SDK exception into a provider-blind ProviderErrorMetadata envelope. Classify is a single shared, provider-blind function mapping that envelope to a ClassificationResult(category, user_action_kind, is_model_not_found). Render is a single shared function per worker family that picks the right CogtError subclass from an InferenceErrorFamily tag plus the is_model_not_found flag (e.g. LLMModelNotFoundError vs LLMCompletionError).

pipelex/plugins/*/…_llm_worker.py — the post-ECR uniform shape
except (APIStatusError, APIConnectionError, APITimeoutError) as exc:
+   metadata = extract_openai_metadata(exc)                                       # provider-specific
+   classification = classify_inference_error(metadata)                           # shared, provider-blind
+   raise render_llm_error(                                                       # shared
+       classification=classification,
+       metadata=metadata,
+       family=InferenceErrorFamily.LLM,
+       model_desc=self.inference_model.desc,
+   ) from exc

Only the per-provider extract_*_metadata functions stay plugin-local — one per SDK family, registered against ProviderName. Classify and Render live once, in pipelex/cogt/inference/error_classify.py and error_render.py. A parity meta-test (test_provider_classification_parity.py) walks every ProviderName against the extract-fn registry and the worker-family map, so adding a new provider without wiring it fails fast.

This sweep lifted every worker kind — LLM, img-gen, extract, search, plus AWS Bedrock — to the same standard. The workers that previously had zero SDK handling (Google, Mistral, Azure img-gen, FAL, HuggingFace, Docling, Linkup, pypdfium2) now all classify. Refinements that landed alongside:

06Retry & resilience

An honest model — and removing the loop that pretended otherwise.

The headline change: the PipeRouter transient-retry loop was removed. It only ever ran on the direct (non-Temporal) path, re-ran at the wrong granularity, and carried a per-run/global config bug. Rather than fix a retry loop for a path that is deliberately not the resilient one, it was deleted (PR #909).

The product line Direct execution for simplicity. Temporal for resilience. Nothing fake in between. Direct execution makes one pipeline-level attempt on top of a transport that already shrugs off brief blips — and then surfaces the error. That is the honest contract.
Direct execution — simplicity Tier 1 · Transport retry SDK retries conn / 408 / 409 / 429 / 5xx · Retry-After — nothing here, by design — (the removed PipeRouter loop lived here) single pipeline attempt → surface the error no durability, no crash survival Temporal execution — resilience Tier 1 · Transport retry same SDK-level retry as direct Tier 2 · Temporal durability activity RetryPolicy keyed off InferenceErrorCategory.is_retryable workflow-level durability + redelivery survives a worker crash
Two tiers, two paths. The gap on the direct path between Tier 1 and Tier 2 is intentional — that gap is the difference between the two products.

Supporting changes that landed under this track:

07The Temporal boundary

Keeping the classification intact across process boundaries.

When a pipe runs on a Temporal worker, the error has to survive serialization across the activity → workflow → submitter boundary. Temporal's default failure converter would auto-wrap a raw PipelexError — losing both our structured ErrorReport and the category-aware retry decision. The bridge closes that gap.

Activity process worker raises CogtError (error_category set) @convert_pipelex_errors → TemporalError Workflow RetryPolicy reads non_retryable flag from_app_error() recovers details Submitter process recover_error_report() ErrorReport.from_dict() WorkflowExecutionError (error_report=…) ErrorReport travels in ApplicationError.details — serialized, recovered, re-classified a Temporal-run pipe failure reaches the CLI with the SAME classification as a local run
The activity-side decorator converts the error; the submitter side recovers the full report. Worker/submitter version skew is tolerated — unknown keys are dropped.
pipelex/temporal/tprl/activity_error_boundary.py — the activity-side decorator (new)
+@functools.wraps(func)
+async def wrapper(*args, **kwargs):
+    try:
+        return await func(*args, **kwargs)
+    except PipelexError as exc:                # ONLY PipelexError — never bare Exception
+        raise TemporalError.from_message_exception(exc=exc) from exc

Applied beneath @activity.defn on every in-scope activity. from_message_exception() derives non_retryable from InferenceErrorCategory.is_retryable (the class-name list is now only a fallback for category-less errors) and packs to_error_report().to_dict() into ApplicationError.details. On the way back, recover_error_report() walks the __cause__ chain, pulls the dict, and rebuilds the ErrorReport — tolerating version skew so the error path never crashes on a rolling deploy.

Net effect A pipe that fails on a Temporal worker now reaches the agent CLI JSON, the Rich human panel, and the HTTP-status mapping with the same error_category / retryable / model / provider / user_action as the identical failure run locally. An integration parity pair locks this by construction.

08CLI & HTTP delivery

Where the classified error finally gets rendered.

Human CLI — Rich panels

Every handle_* function builds its panel through one shared display_error_panel() helper — red banner, structured fields, user_action tip, doc/Discord links. Exception-specific logic stays in the handler; the panel shape lives in one place.

Agent CLI — JSON / Markdown

run / validate / init / models / doctor / check-model default to markdown, with --format json available. The format is set once per invocation via a ContextVar, so the error path inherits it without threading an argument through.

The agent JSON error payload is the structured shape downstream agents parse:

agent CLI — JSON error payload
{
  "error": true,
  "error_type": "LLMCompletionError",
  "message": "...",
  "hint": "...",
  "retryable": true,
  "error_domain": "runtime",
  "error_category": "transient",
  "model": "gpt-4o",
  "provider": "openai",
  "error_source": ["LLMCompletionError @ .../worker.py:152 (in _gen_text)"]
}

HTTP-status mapping lives in the library. pipelex has no API server, but pipelex-relay and pipelex-back-office both render an ErrorReport as an HTTP response — and were reinventing the mapping. It now lives once in base_exceptions.py: error_domain_to_http_status() is the pure table, ErrorReport.http_status layers the provider-429 passthrough on top. Downstream FastAPI handlers become a trivial adapter and must not redefine the contract.

09Testing

What proves the system behaves — and what stops it rotting.

Test layerWhat it locks in
Per-worker classificationMocks each SDK to raise every typed exception; asserts the resulting CogtError has the expected error_category, user_action, provider_metadata.
instructor unwrapOne end-to-end test per provider against real instructor.from_*(…) so the unwrap can't silently rot if instructor's wrapping shape changes.
Full-chain snapshotBuilds a pipeline where one pipe fails, runs the agent CLI, asserts the JSON carries the classification + an error_source chain in order. Catches a wrapper silently swallowing error_category.
Local / Temporal parityRuns the same pipe locally and on the in-process Temporal server; asserts an identical ErrorReport. Parity holds by construction.
Dict drift-detectionWalks the PipelexError hierarchy and asserts every subclass is covered by class-level metadata or a fallback dict entry — closes the "silent breakage when a class is renamed" failure mode.

10What's still open

Deliberately scoped out — flagged here so review doesn't treat it as missing.

The bulk of the work has landed. One item remains, intentionally left for after this branch:

Metadata-model long tail Partial

A handful of CogtError subclasses still carry no class-level error_category, and several non-inference PipelexError subclasses still rely on the agent_output.py fallback dicts for hint / error_domain rather than class-level metadata. The drift-detection test already enforces coverage one way or the other — the migration just moves entries onto the classes. See the Followups in track-metadata-model.md.

Everything else — the metadata model, worker classification, retry & resilience, CLI delivery, the Temporal bridge, and the test coverage — is landed and described in current-state terms in the track docs.

Suggested review path Start with wip/error-handling/README.md (status table + read order), then architecture.md (layer model, hierarchy, ErrorReport shape). Each track-*.md is a self-contained concern — read them in any order. Completed plans are archived under archive-*.md for their running notes.