Reintroduce structuring_method = preliminary_text via build-time elaboration + new PipeStructure operator

Brings back the "text-then-object" capability removed in 16b775b8, but reshapes it as a build-time rewrite into PipeSequence[PipeLLM(text), PipeStructure] instead of a runtime branch inside PipeLLM. Adds a first-class PipeStructure operator that users can also call directly.

Branch: feature/Text-then-object Base: bb9bdb32 → HEAD 62fc10e4 55 files changed · +3208 / −97 Added: PipeStructure Changed: PipeLLM runtime simplified Changed: BundleElaborator pass

TL;DR

Why this change

Before this PR, structuring_method = "preliminary_text" existed as a field on the runtime PipeLLM, but it raised NotImplementedError at run time — left over after the original text-then-object implementation was removed in 16b775b8. The directive was effectively dead.

We have two kinds of users who want this:

Solving both with a runtime branch on PipeLLM bloats the operator and hides a sequence behind one pipe. Instead, we introduce PipeStructure as the underlying primitive, then make structuring_method = "preliminary_text" a build-time shorthand that expands into it. The runtime layer (and Temporal) never sees the directive — it sees only ordinary PipeSequence + PipeLLM + PipeStructure.

Architectural shape

Where elaboration sits in the load path

.mthds file │ ▼ TOML parser ─► dict │ ▼ PipelexBundleBlueprint.model_validate(dict) ◄── parse-time validators fire here │ (incl. new validate_preliminary_text_output) ▼ BundleElaborator.elaborate(bundle) ◄── NEW: rewrites preliminary_text │ (fast-path short-circuit if absent) │ ┌─ scans for PipeLLM with structuring_method=preliminary_text │ ├─ for each match: synthesize draft + structure + wrapping sequence │ ├─ writes bundle.elaboration_metadata side-table │ └─ re-validates the rewritten bundle (ref checks, main_pipe) ▼ elaborated PipelexBundleBlueprint │ ▼ LibraryManager → PipeFactory → runtime PipeAbstract objects (sees only PipeSequence, PipeLLM, PipeStructure — never the directive)

Every code path that loads a bundle from a .mthds file goes through the elaborator. The fast-path returns the input bundle unchanged (identity-preserving) when no preliminary_text directive is present, so the common case pays no overhead.

What preliminary_text rewrites into

Before — what the user writes

[pipe.review_restaurant]
type = "PipeLLM"
description = "..."
inputs = { transcript = "Text" }
output = "RestaurantReview"
structuring_method = "preliminary_text"
prompt = """Write a thorough review …
@transcript
"""

After — what the runtime sees

# original code reused for the wrapping sequence
[pipe.review_restaurant]
type = "PipeSequence"
inputs = { transcript = "Text" }
output = "RestaurantReview"
steps = [
  { pipe = "review_restaurant__draft_text",
    result = "draft_text" },
  { pipe = "review_restaurant__structure",
    result = "review_restaurant" },
]

[pipe.review_restaurant__draft_text]
type = "PipeLLM"
inputs = { transcript = "Text" }
output = "Text"          # always single Text
prompt = "...verbatim..."

[pipe.review_restaurant__structure]
type = "PipeStructure"
inputs = { draft_text = "Text" }
output = "RestaurantReview"   # multiplicity preserved
model = original.model_to_structure

The original pipe code is reused for the wrapping sequence, so anything that referenced review_restaurant — other pipes, main_pipe, the run API — keeps working unchanged.

Multiplicity rule

Step 1 always emits a single Text, even when the original output is Foo[] or Foo[3]. Step 2 (PipeStructure) is the one that fans out: one preliminary text → N structured objects. This matches the deleted make_text_then_object_list behavior verbatim.

The bundle elaborator

New module: pipelex/core/interpreter/bundle_elaborator.py

A single class with one public classmethod elaborate(bundle). The dispatch is intentionally narrow — there is exactly one elaboration kind today. The structure is set up so adding a second kind is mechanical, but no premature plugin registry.

pipelex/core/interpreter/bundle_elaborator.py · key path
@classmethod
def elaborate(cls, bundle: PipelexBundleBlueprint) -> PipelexBundleBlueprint:
    if not bundle.pipe or not any(_is_preliminary_text_pipe(bp) for bp in bundle.pipe.values()):
        return bundle  # fast-path: identity-preserving short-circuit

    existing_codes: set[str] = set(bundle.pipe.keys())
    new_pipe_dict: dict[str, PipeBlueprintUnion] = {}
    elaboration_metadata: dict[str, ElaborationMetadata] = {}

    for pipe_code, pipe_blueprint in bundle.pipe.items():
        if _is_preliminary_text_pipe(pipe_blueprint):
            cls._elaborate_preliminary_text(
                pipe_code=pipe_code,
                pipe_blueprint=pipe_blueprint,
                new_pipe_dict=new_pipe_dict,
                elaboration_metadata=elaboration_metadata,
                existing_codes=existing_codes,
            )
        else:
            new_pipe_dict[pipe_code] = pipe_blueprint

    # Defense in depth: synthesized pipes must never themselves carry the directive.
    for synthetic_code, synthetic_blueprint in new_pipe_dict.items():
        if synthetic_code in elaboration_metadata and _is_preliminary_text_pipe(synthetic_blueprint):
            raise BundleElaboratorError(...)

    elaborated = bundle.model_copy(update={
        "pipe": new_pipe_dict,
        "elaboration_metadata": elaboration_metadata,
    })

    # Re-run bundle-level validators against the synthetic pipes.
    try:
        PipelexBundleBlueprint.model_validate(elaborated.model_dump(by_alias=True))
    except ValidationError as exc:
        raise BundleElaboratorError(...) from exc

    return elaborated

A module-level TypeGuard narrows the union type at the iteration site so the dispatch helper receives a properly-typed PipeLLMBlueprint:

pipelex/core/interpreter/bundle_elaborator.py · TypeGuard
def _is_preliminary_text_pipe(pipe_blueprint: PipeBlueprintUnion) -> TypeGuard[PipeLLMBlueprint]:
    if not isinstance(pipe_blueprint, PipeLLMBlueprint):
        return False
    method = pipe_blueprint.structuring_method
    return method is not None and method.is_preliminary_text

Wiring into the interpreter

pipelex/core/interpreter/interpreter.py
@@ make_pipelex_bundle_blueprint @@
 try:
     pipelex_bundle_blueprint = PipelexBundleBlueprint.model_validate(blueprint_dict)
     pipelex_bundle_blueprint.source = str(bundle_path) if bundle_path else None
-    return pipelex_bundle_blueprint
 except ValidationError as exc:
     ...
+
+try:
+    return BundleElaborator.elaborate(bundle=pipelex_bundle_blueprint)
+except BundleElaboratorError as exc:
+    raise PipelexInterpreterError(message=str(exc)) from exc

The PipeStructure operator

Mirrors the shape of PipeLLM for object generation but stripped of everything that doesn't apply: no user-controlled prompt template, no image/document inputs, no system prompt, exactly one Text input.

FieldTypeNotes
type"PipeStructure"Discriminator
inputsdictExactly one entry. Concept must be Text or refine Text. No multiplicity (use PipeBatch).
outputstringAny structured concept, with optional multiplicity (Foo, Foo[], Foo[N]). Cannot be Text.
modelLLMModelChoice | NoneFalls back to llm_choice_overrides.for_objectllm_choice_defaults.for_object.

Runtime flow

working_memory.get_stuff_as_str(text_input_name) │ ▼ render_template(structuring_prompt, {text}) ◄── new generic template in pipelex.toml │ ▼ + get_output_structure_prompt(target_concept) ◄── shared with PipeLLM.object generation │ ▼ LLMSetting = self.llm_choice ?? deck.llm_choice_overrides.for_object ?? deck.llm_choice_defaults.for_object │ ▼ content_generator.make_object (single) content_generator.make_object_list (Foo[] / Foo[N]) │ ▼ StuffFactory.make_stuff → working_memory.set_new_main_stuff

Critical code: _live_run_operator_pipe

pipelex/pipe_operators/structure/pipe_structure.py
text_str = working_memory.get_stuff_as_str(name=self.text_input_name)

multiplicity_resolution = output_multiplicity_to_apply(
    base_multiplicity=self.output_multiplicity,
    override_multiplicity=pipe_run_params.output_multiplicity,
)
is_multiple_output = multiplicity_resolution.is_multiple_outputs_enabled
fixed_nb_output = multiplicity_resolution.specific_output_count

llm_choice_for_object = (
    self.llm_choice
    or model_deck.llm_choice_overrides.for_object
    or model_deck.llm_choice_defaults.for_object
)
llm_setting_for_object = model_deck.get_llm_setting(llm_choice=llm_choice_for_object)

structuring_template = llm_config.get_template(template_name="structuring_prompt")
rendered_user_prompt = await render_template(
    template=structuring_template,
    category=TemplateCategory.LLM_PROMPT,
    context={"text": text_str},
)
if llm_config.is_structure_prompt_enabled:
    rendered_user_prompt += await get_output_structure_prompt(
        concept_ref=self.output.concept.concept_ref
    )

llm_prompt = LLMPrompt(user_text=rendered_user_prompt)
content_class = get_class_registry().get_required_subclass(
    name=self.output.concept.structure_class_name, base_class=StuffContent,
)

if is_multiple_output:
    generated_objects = await content_generator.make_object_list(
        job_metadata=job_metadata, object_class=content_class,
        llm_prompt_for_object_list=llm_prompt,
        llm_setting_for_object_list=llm_setting_for_object,
        nb_items=fixed_nb_output,
    )
    the_content = ListContent(items=generated_objects)
else:
    the_content = await content_generator.make_object(...)

The new generic template

pipelex/pipelex.toml · [cogt.llm_config.generic_templates]
structuring_prompt = """
Read the following text carefully and produce the requested structured output from it.

---
{{ text }}
"""

The operator deliberately has no user-controlled prompt template — the only variable is text, fed automatically from the declared input. Customization for preliminary_text is captured as a follow-up; if a user really needs custom prompts they author the two pipes by hand.

What changed on PipeLLM

The runtime PipeLLM no longer carries structuring_method at all. The field, validator, the NotImplementedError trap, the execution_data_dict entry, and the factory's forwarding of the field are all removed.

pipelex/pipe_operators/llm/pipe_llm.py · removed
-    structuring_method: StructuringMethod | None = None

-    @model_validator(mode="after")
-    def validate_output_concept_consistency(self) -> Self:
-        if self.structuring_method is not None and self.output.concept.structure_class_name == NativeConceptCode.TEXT:
-            msg = (
-                f"Output concept '{self.output.concept.code}' is considered a Text concept, "
-                f"so it cannot be structured. Maybe you forgot to add '{NativeConceptCode.TEXT}' to the class registry?"
-            )
-            raise ValueError(msg)
-        return self

# inside _live_run_operator_pipe:
-        if self.structuring_method is not None:
-            match self.structuring_method:
-                case StructuringMethod.PRELIMINARY_TEXT:
-                    msg = (
-                        f"PipeLLM '{self.code}': structuring_method='preliminary_text' is not currently supported. "
-                        "The text-then-object mechanism was removed; a new implementation is planned."
-                    )
-                    raise NotImplementedError(msg)
-                case StructuringMethod.DIRECT:
-                    pass

# inside execution_data_dict:
-        if self.structuring_method is not None:
-            execution_data_dict["structuring_method"] = self.structuring_method

The blueprint keeps the field — and gains a parse-time validator

structuring_method remains part of the language surface. PipeLLMBlueprint gets a model_validator(mode="after") that mirrors the elaborator's pre-check, so the user gets the error during model_validate (parse time) instead of at elaboration time. The elaborator's check stays as defense-in-depth (only reachable via model_construct).

pipelex/pipe_operators/llm/pipe_llm_blueprint.py · added
class StructuringMethod(StrEnum):
    DIRECT = "direct"
    PRELIMINARY_TEXT = "preliminary_text"

    @property
    def is_preliminary_text(self) -> bool:        # avoid `==` against enum (project rule)
        match self:
            case StructuringMethod.PRELIMINARY_TEXT:
                return True
            case StructuringMethod.DIRECT:
                return False


class PipeLLMBlueprint(PipeBlueprint):
    ...
    structuring_method: StructuringMethod | None = None

    @model_validator(mode="after")
    def validate_preliminary_text_output(self) -> Self:
        if self.structuring_method is None or not self.structuring_method.is_preliminary_text:
            return self
        output_parse_result = parse_concept_with_multiplicity(self.output)
        if QualifiedRef.parse(output_parse_result.concept_ref_or_code).local_code == NativeConceptCode.TEXT:
            raise ValueError(
                f"PipeLLM with `structuring_method = preliminary_text` cannot have output `{self.output}`. "
                "The output must be a structured concept, not Text."
            )
        return self
Heads-up for direct importers. StructuringMethod moved from pipelex.pipe_operators.llm.pipe_llm to pipelex.pipe_operators.llm.pipe_llm_blueprint (it now lives next to the only consumer). CHANGELOG entry calls this out.

PipeLLMSpec still exposes the directive

AI agents authoring via specs can still opt in. PipeLLMSpec gains a plain structuring_method: StructuringMethod | None = None field (intentionally not SkipJsonSchema, so it shows up in the JSON schema we hand to agents) and forwards it in to_blueprint().

The elaboration_metadata side-table

Synthetic-pipe metadata lives on the bundle, not on every pipe blueprint and not on the runtime PipeAbstract. The user-facing per-pipe schema stays unpolluted.

pipelex/core/bundles/pipelex_bundle_blueprint.py · added
class StepRole(StrEnum):
    DRAFT_TEXT = "draft_text"
    STRUCTURE = "structure"


class ElaborationMetadata(BaseModel):
    parent_pipe_code: str
    step_role: StepRole


class PipelexBundleBlueprint(BaseModel):
    ...
    # Process-local. Survives model_copy. Dropped by any model_dump→model_validate
    # round-trip (exclude=True keeps MTHDS / TOML / JSON exports clean).
    elaboration_metadata: dict[str, ElaborationMetadata] | None = Field(default=None, exclude=True)

    def get_elaboration_for(self, pipe_code: str) -> ElaborationMetadata | None:
        if not self.elaboration_metadata:
            return None
        return self.elaboration_metadata.get(pipe_code)

Lifetime

One specific consumer wired today

The dependency loader in LibraryManager._load_single_dependency walks the side-table to keep synthetic helpers attached to exported parents. Without this, exporting review_restaurant from a package would leave the wrapping PipeSequence referencing unresolved review_restaurant__draft_text / __structure codes:

pipelex/libraries/library_manager.py · added
all_exported = resolved_dep.exported_pipe_codes | main_pipes
synthetic_helpers: set[str] = set()
for blueprint in dep_blueprints:
    if not blueprint.elaboration_metadata:
        continue
    for synthetic_code, meta in blueprint.elaboration_metadata.items():
        if meta.parent_pipe_code in all_exported:
            synthetic_helpers.add(synthetic_code)
all_exported |= synthetic_helpers

Validation layers

Three layers guard against the only authoring mistake the elaborator can hit — combining preliminary_text with a Text output:

LayerWhereWhat it catches
1. Construction PipeLLMBlueprint.validate_preliminary_text_output String-level: rejects "Text", "native.Text", "Text[]", "Text[N]". Fires during model_validate, before the elaborator runs.
2. Defense-in-depth BundleElaborator._elaborate_preliminary_text Same string-level check — only reachable if a caller bypassed validation via model_construct. Test suite exercises it.
3. Library-time PipeStructure.validate_output_with_library Concept-level: catches a domain concept that refines = "Text" — those slip past string-level guards because they don't read as Text in the source.

The elaborator additionally guards against:

Test surface

Test counts: PipeStructure-focused subset is 92 passed. Touched-source areas overall: 1506 passed, 1 xfailed. make agent-check and make docs-check clean.

New test modules

Integration coverage worth knowing about

Authoring note. Writing type = "PipeStructure" directly in a .mthds file currently fails plxt schema validation: the bundled schema in vscode-pipelex/crates/taplo-common/schemas/mthds_schema.json predates this PR. The preliminary_text path is unaffected (the synthesized PipeStructure lives in-memory only). Cross-repo follow-up: ship a pipelex-tools release with the regenerated schema. Captured in TODOS.md.

Follow-ups deliberately deferred

Listed in TODOS.md; none of these block this PR.

  1. Synthetic-pipe marker on graph node tags + CLI listing exclusion (uses the side-table).
  2. Friendly synthetic-pipe rendering across logs, traces, and run-reporting.
  3. mthds-ui graph viewer integration.
  4. Bundle-load benchmark in CI.
  5. PipeStructure image-input support.
  6. Per-step prompt customization for preliminary_text.
  7. Generic meta-pipe / build-time elaboration framework — when a second directive lands.
  8. pipelex-dev elaborate-bundle <path> debugging CLI.
  9. Revisit StructuringMethod.DIRECT (functionally identical to None today; kept for symmetry).
  10. Persist elaboration_metadata across serialization boundaries — drop exclude=True when a second cross-boundary consumer materializes (graph viewer over a serialized bundle, Temporal payload, persistent library cache). Today the only consumer is the in-process dependency loader.

Brief generated from the diff between bb9bdb32 and HEAD on feature/Text-then-object. See TODOS.md for the full phase-by-phase plan, decisions taken, and audit notes; docs/under-the-hood/build-time-elaboration.md for the user-facing mechanism doc; docs/building-methods/pipes/pipe-operators/PipeStructure.md for the operator reference.