Co-developer Briefing
A single, structured error model that travels intact from a raw provider SDK exception, through every wrapping layer, across the Temporal boundary, and out to humans, agents, and HTTP APIs — classified the same way at every stop.
branch refactor/ECR ← feature/Error-handling-2 ← feature/Temporal-merge-3Before this branch, an error meant whatever the layer that caught it decided to do — some swallowed it, some re-wrapped it losing the cause, some had no SDK handling at all. After this branch:
InferenceErrorCategory + structured metadata).ErrorReport is the one serialization schema — CLI JSON, Rich panels, HTTP status, and Temporal details all read from it.__cause__-chain enrichment, and survives the Temporal boundary activity → workflow → submitter.except Exception is banned in business logic — ruff BLE001 enforces it.Independent tracks across metadata, worker classification, retry & resilience, CLI/HTTP delivery, the Temporal boundary, testing, and the Extract / Classify / Render decomposition — all landed.
The state we started from, and the principle we settled on.
On feature/Temporal-merge-3, error handling was inconsistent. A tier review classified inference workers into three buckets — and a large share of them (Google, Mistral, Azure img-gen, FAL, HuggingFace, Docling, Linkup, pypdfium2) had no SDK exception handling at all: a raw provider error propagated up as-is, untyped, uncategorized. Other layers caught errors and re-wrapped them, sometimes dropping the __cause__. There was a PipeRouter retry loop that only ever ran on the direct path, re-ran at the wrong granularity, and carried a config bug. And except Exception appeared throughout business logic, silently swallowing bugs.
Three cross-cutting rules now hold everywhere and the tracks build on them:
pipelex.base_exceptions.PipelexError.except Exception is allowed only at CLI entry points and async task roots — and ruff rule BLE001 is enabled to enforce it.InferenceErrorCategory, chain via from exc.Errors flow up through six layers. Each one has exactly one job.
From a raw SDK exception to four different rendered outputs — one schema throughout.
ErrorReport drives all four delivery surfaces. Wrapper exceptions inherit the underlying classification through __cause__-chain enrichment.| Track | Status | What it delivers |
|---|---|---|
| Metadata model | Landed | The data contract — error_category, error_domain, user_action, ProviderErrorMetadata — carried on the exception class. |
| Worker classification | Landed | Every LLM / img-gen / extract / search worker maps SDK exceptions → categorized CogtError. |
| Retry & resilience | Landed | Removed the fake PipeRouter loop; explicit transport retry; bounded PipeBatch fan-out. |
| CLI delivery | Landed | Agent CLI markdown/JSON; error_domain → HTTP-status mapping; shared Rich panel helper. |
| Temporal integration | Landed | ErrorReport carried across activity → workflow → submitter; category-aware retry. |
| Testing | Landed | Per-worker classification, full-chain snapshot, local/Temporal parity, dict drift-detection. |
| Extract / Classify / Render | Landed | Per-worker pipeline decomposed into one per-provider Extract + shared Classify + shared Render. Branch refactor/ECR. |
One frozen dataclass. pipelex/base_exceptions.py.
ErrorReport is a frozen pydantic dataclass with extra="forbid". It is the single source of truth for error serialization — to_dict() drops None fields, from_dict() is its strict inverse (used to recover a report that crossed the Temporal boundary), and http_status maps it to an HTTP code for downstream APIs.
@dataclass(frozen=True, config={"extra": "forbid"}) class ErrorReport: error_type: str message: str error_category: str | None = None + error_domain: str | None = None # input / config / runtime retryable: bool | None = None - user_action: str | None = None + user_action: UserAction | None = None # typed: kind + detail model: str | None = None provider: str | None = None + provider_metadata: ProviderErrorMetadata | None = None
The most important behaviour change is on PipelexError itself. A wrapper exception (PipeRunError → PipeRouterError → PipelineExecutionError) carries no category of its own. to_error_report() now enriches the report from the __cause__ chain, so the inference classification survives every wrapping layer:
+def _enrich_error_report_from_cause(self, report: ErrorReport) -> ErrorReport: + """Fill the None classification fields of `report` from the __cause__ chain.""" + cause = self.__cause__ + if not isinstance(cause, PipelexError): + return report + # ... cyclic-chain guard omitted ... + cause_report = cause.to_error_report() + return ErrorReport( + error_type=report.error_type, # keep own identity + message=report.message, + error_category=report.error_category or cause_report.error_category, + error_domain=report.error_domain or cause_report.error_domain, + retryable=report.retryable if report.retryable is not None else cause_report.retryable, + user_action=report.user_action or cause_report.user_action, + model=report.model or cause_report.model, + provider=report.provider or cause_report.provider, + provider_metadata=report.provider_metadata or cause_report.provider_metadata, + )
to_error_report() override on a subclass must end with self._enrich_error_report_from_cause(report) — otherwise that subclass becomes a black hole that drops the cause's classification. The cyclic-__cause__ guard exists because a malformed chain must never turn the error-reporting path into a RecursionError.
InferenceErrorCategory
Drives retry decisions. Only TRANSIENT is retryable.
TRANSIENT — retry itCONFIGURATION — fix the setupCONTENT — fix the input/promptCAPACITY — quota / billingAMBIGUOUS — outcome unknown, unsafe to retryUNKNOWN — could not classifyErrorDomain
Drives HTTP status. Class-level on the exception.
INPUT → HTTP 422 — caller can fix itCONFIG → HTTP 500 — env change neededRUNTIME → HTTP 500 — failure during executionA provider 429 in provider_metadata overrides this so the API can emit Retry-After.
Every inference worker collapses to one three-line shape.
After the ECR (refactor/ECR) sweep, every provider worker under pipelex/plugins/*/ handles SDK exceptions with the same three steps. Extract turns the provider's SDK exception into a provider-blind ProviderErrorMetadata envelope. Classify is a single shared, provider-blind function mapping that envelope to a ClassificationResult(category, user_action_kind, is_model_not_found). Render is a single shared function per worker family that picks the right CogtError subclass from an InferenceErrorFamily tag plus the is_model_not_found flag (e.g. LLMModelNotFoundError vs LLMCompletionError).
except (APIStatusError, APIConnectionError, APITimeoutError) as exc: + metadata = extract_openai_metadata(exc) # provider-specific + classification = classify_inference_error(metadata) # shared, provider-blind + raise render_llm_error( # shared + classification=classification, + metadata=metadata, + family=InferenceErrorFamily.LLM, + model_desc=self.inference_model.desc, + ) from exc
Only the per-provider extract_*_metadata functions stay plugin-local — one per SDK family, registered against ProviderName. Classify and Render live once, in pipelex/cogt/inference/error_classify.py and error_render.py. A parity meta-test (test_provider_classification_parity.py) walks every ProviderName against the extract-fn registry and the worker-family map, so adding a new provider without wiring it fails fast.
This sweep lifted every worker kind — LLM, img-gen, extract, search, plus AWS Bedrock — to the same standard. The workers that previously had zero SDK handling (Google, Mistral, Azure img-gen, FAL, HuggingFace, Docling, Linkup, pypdfium2) now all classify. Refinements that landed alongside:
@property accessors on ProviderErrorMetadata (is_quota_exhaustion, is_content_policy_violation, is_network_error) or as a flag the Extract fn sets. Classify reads them; the worker no longer branches on them.instructor unwrap. On structured-generation paths, extract_underlying_sdk_exception() recovers the real SDK exception out of InstructorRetryException, so it routes through the same Extract → Classify → Render chain as the plain-text path.UNKNOWN instead of mis-labelling. An unrecognized inner exception (e.g. a pydantic.ValidationError from a schema mismatch) now lands in UNKNOWN rather than being mis-categorized as a real CONTENT-policy violation.ProviderErrorMetadata. Every raised inference error carries status_code, request_id, retry_after_seconds, provider_error_code, and the human message. The raw response body is held in-process but excluded from serialization — it can carry credentials.An honest model — and removing the loop that pretended otherwise.
The headline change: the PipeRouter transient-retry loop was removed. It only ever ran on the direct (non-Temporal) path, re-ran at the wrong granularity, and carried a per-run/global config bug. Rather than fix a retry loop for a path that is deliberately not the resilient one, it was deleted (PR #909).
Supporting changes that landed under this track:
cogt.transport_max_retries (default 2) is wired into every inference SDK client factory. The two families that defaulted to no transport retry (Mistral, Google) are brought up to the floor. The SDK-less azure_rest img-gen path gets a tenacity-based wrapper (transport_retry.py) that narrows retries for non-idempotent POSTs.instructor confined to schema re-ask. Its retry predicate now matches only validation failures — a transport error propagates raw instead of being wrapped, so Tier 1 stays the sole transport-retry layer.PipeBatch uses gather_bounded() (max_concurrency, default 8) — admission control so a big batch doesn't trigger a self-inflicted rate-limit storm. This is not retry; it stays.Keeping the classification intact across process boundaries.
When a pipe runs on a Temporal worker, the error has to survive serialization across the activity → workflow → submitter boundary. Temporal's default failure converter would auto-wrap a raw PipelexError — losing both our structured ErrorReport and the category-aware retry decision. The bridge closes that gap.
+@functools.wraps(func) +async def wrapper(*args, **kwargs): + try: + return await func(*args, **kwargs) + except PipelexError as exc: # ONLY PipelexError — never bare Exception + raise TemporalError.from_message_exception(exc=exc) from exc
Applied beneath @activity.defn on every in-scope activity. from_message_exception() derives non_retryable from InferenceErrorCategory.is_retryable (the class-name list is now only a fallback for category-less errors) and packs to_error_report().to_dict() into ApplicationError.details. On the way back, recover_error_report() walks the __cause__ chain, pulls the dict, and rebuilds the ErrorReport — tolerating version skew so the error path never crashes on a rolling deploy.
error_category / retryable / model / provider / user_action as the identical failure run locally. An integration parity pair locks this by construction.
Where the classified error finally gets rendered.
Every handle_* function builds its panel through one shared display_error_panel() helper — red banner, structured fields, user_action tip, doc/Discord links. Exception-specific logic stays in the handler; the panel shape lives in one place.
run / validate / init / models / doctor / check-model default to markdown, with --format json available. The format is set once per invocation via a ContextVar, so the error path inherits it without threading an argument through.
The agent JSON error payload is the structured shape downstream agents parse:
{ "error": true, "error_type": "LLMCompletionError", "message": "...", "hint": "...", "retryable": true, "error_domain": "runtime", "error_category": "transient", "model": "gpt-4o", "provider": "openai", "error_source": ["LLMCompletionError @ .../worker.py:152 (in _gen_text)"] }
HTTP-status mapping lives in the library. pipelex has no API server, but pipelex-relay and pipelex-back-office both render an ErrorReport as an HTTP response — and were reinventing the mapping. It now lives once in base_exceptions.py: error_domain_to_http_status() is the pure table, ErrorReport.http_status layers the provider-429 passthrough on top. Downstream FastAPI handlers become a trivial adapter and must not redefine the contract.
What proves the system behaves — and what stops it rotting.
| Test layer | What it locks in |
|---|---|
| Per-worker classification | Mocks each SDK to raise every typed exception; asserts the resulting CogtError has the expected error_category, user_action, provider_metadata. |
instructor unwrap | One end-to-end test per provider against real instructor.from_*(…) so the unwrap can't silently rot if instructor's wrapping shape changes. |
| Full-chain snapshot | Builds a pipeline where one pipe fails, runs the agent CLI, asserts the JSON carries the classification + an error_source chain in order. Catches a wrapper silently swallowing error_category. |
| Local / Temporal parity | Runs the same pipe locally and on the in-process Temporal server; asserts an identical ErrorReport. Parity holds by construction. |
| Dict drift-detection | Walks the PipelexError hierarchy and asserts every subclass is covered by class-level metadata or a fallback dict entry — closes the "silent breakage when a class is renamed" failure mode. |
Deliberately scoped out — flagged here so review doesn't treat it as missing.
The bulk of the work has landed. One item remains, intentionally left for after this branch:
A handful of CogtError subclasses still carry no class-level error_category, and several non-inference PipelexError subclasses still rely on the agent_output.py fallback dicts for hint / error_domain rather than class-level metadata. The drift-detection test already enforces coverage one way or the other — the migration just moves entries onto the classes. See the Followups in track-metadata-model.md.
Everything else — the metadata model, worker classification, retry & resilience, CLI delivery, the Temporal bridge, and the test coverage — is landed and described in current-state terms in the track docs.
wip/error-handling/README.md (status table + read order), then architecture.md (layer model, hierarchy, ErrorReport shape). Each track-*.md is a self-contained concern — read them in any order. Completed plans are archived under archive-*.md for their running notes.