{% if logo_data_url %} {% endif %}

Comparison · {{ runs|length }} run{% if runs|length != 1 %}s{% endif %}

Run ids: {{ run_ids_label }}
Generated {{ generated_at }} · AMX {{ amx_version }}
{% if missing %}
{{ missing|length }} run id{{ "s" if missing|length != 1 else "" }} not found in history: {% for m in missing %}#{{ m }}{% if not loop.last %}, {% endif %}{% endfor %}
{% endif %} {# ── Summary card ──────────────────────────────────────────────── #}

Summary

One row per run.

{% for r in summary_rows %} {% endfor %}
Run Command LLM Doc / Code Duration Status
#{{ r.run_id }} {{ r.command or "—" }} {{ r.llm_provider_model or r.llm_model or "—" }} {% set doc = r.doc_profile or "" %} {% set code = r.code_profile or "" %} {% if doc and code %}{{ doc }} / {{ code }}{% elif doc or code %}{{ doc or code }}{% else %}—{% endif %} {% if r.duration_sec %}{{ "%.2f"|format(r.duration_sec) }}s{% else %}—{% endif %} {{ r.status_label }}
{# ── Aggregates card ──────────────────────────────────────────── #} {% if aggregate_rows %}

Aggregates

Per-run roll-ups (model time, tokens, cost, confidence band split, approval rate). Best value per row is highlighted.

{% for rid in run_ids %}{% endfor %} {% for rid in run_ids %}{% endfor %} {% for arow in aggregate_rows %} {% for rid in run_ids %} {% set cell = arow.cells[rid] %} {% endfor %} {% endfor %}
Metric#{{ rid }}
{{ arow.label }}{{ cell.display }}
{% endif %} {# ── Per-column descriptions card ─────────────────────────────── #} {% if percol_rows %}

Per-column descriptions

Each cell shows the chosen description for that run. The cell with the highest logprob across the row gets the winner highlight.

{% for rid in run_ids %}{% endfor %} {% for rid in run_ids %}{% endfor %} {% for prow in percol_rows %} {% for rid in run_ids %} {% set cell = prow.cells[rid] %} {% if cell %} {% else %} {% endif %} {% endfor %} {% endfor %}
Asset#{{ rid }}
{{ prow.label }}
{{ cell.description }}
{% if cell.confidence %}{{ cell.confidence }}{% endif %} {% if cell.logprob_display %}logprob {{ cell.logprob_display }}{% endif %} {% if cell.token_count %} · {{ cell.token_count }} tok{% endif %}
{% endif %} {% if not aggregate_rows and not percol_rows %}

No aggregate metrics or per-column overlap to display for the selected runs.

{% endif %} {# ── Quality metrics card (Tier 0/1/2 academic metrics) ────────── #} {% if quality_per_run %}

Quality metrics

Tier {{ quality_tier }} academic text-quality analysis. {% if quality_references_summary %}References: {{ quality_references_summary }}.{% endif %}

{% for row in quality_per_run %} {% endfor %}
Run Diversity (TTR) Schema grounding chrF ROUGE-L BERTScore Embed. agree. Judge win-rate
#{{ row.run_id }} {% if row.type_token_ratio is not none %}{{ "%.0f"|format(row.type_token_ratio * 100) }}%{% else %}—{% endif %} {% if row.schema_grounding is not none %}{{ "%.0f"|format(row.schema_grounding * 100) }}%{% else %}—{% endif %} {% if row.chrf is not none %}{{ "%.0f"|format(row.chrf * 100) }}%{% else %}—{% endif %} {% if row.rouge_l is not none %}{{ "%.0f"|format(row.rouge_l * 100) }}%{% else %}—{% endif %} {% if row.bertscore is not none %}{{ "%.0f"|format(row.bertscore * 100) }}%{% else %}—{% endif %} {% if row.embedding_agreement is not none %}{{ "%.0f"|format(row.embedding_agreement * 100) }}%{% else %}—{% endif %} {% if row.judge_win_rate is not none %} {{ "%.0f"|format(row.judge_win_rate * 100) }}% ({{ row.judge_wins }}/{{ row.judge_pairings }}) {% else %}—{% endif %}
{% endif %} {# ── Methods / academic citations (always shown if metrics ran) ──── #} {% if quality_citations %}

Methods

Bibliographic references for the metrics above.

{% endif %}