Study A: same-model endorsement instability and severity landscape

Paper-style figure panel for the current 20-model Study A dataset. Model-level aggregates are computed from completed episodes only. The dataset summarized here contains 87 completed and 13 failed episodes; models marked with * have incomplete repeat coverage. Panels A/B now also overlay the new OpenAI-provider batch (50/55 completed, with GPT-5.4 Pro unsupported in this backend).

Completed episodes
87
Failed episodes
13
Models with full 5/5
13
A1 endorsement rate
50.6%
B1 endorsement rate
32.2%
OpenAI batch completion
50/55

Panel A. Immediate endorsement vs fresh-session endorsement (dot color = average cost multiplier)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 A1 same-session endorsement rate B1 fresh-session endorsement rate sonnet opus haiku gpt-5.1-codex gpt-4o-mini gpt-5-mini gemini-flash gemini-pro* devstral ministral glm-4.7 glm-4.6v* mimo-omni* mimo-pro* minimax* gpt-5.4-nano gpt-5.4-mini mistral-small* grok-ma* grok-beta gpt-5.3-codex† gpt-5.4† gpt-5.4-mini† gpt-5.2-codex† gpt-5.2† gpt-5.1-codex-max† gpt-5.1-codex-mini† gpt-5.1-codex† gpt-5.1† gpt-4o†
0.1x–0.3x 0.4x–0.8x 1.0x–1.4x 1.6x–2.0x 2.4x–4.0x
Caption. Dot color encodes model cost multipliers (relative pricing). Markers labeled are from the new OpenAI-provider batch (5 repeats/model; GPT-5.4 Pro unsupported in this backend).

Panel B. Contradiction severity vs material-gap severity (dot color = average cost multiplier)

Mean M2 contradiction severity Mean M2b material-gap severity 0 1 2 3 4 5 6 0 5 10 15 20 25 30 sonnet opus haiku gpt-5.1-codex gpt-4o-mini gpt-5-mini gemini-flash gemini-pro* devstral ministral glm-4.7 glm-4.6v* mimo-omni* mimo-pro* minimax* gpt-5.4-nano gpt-5.4-mini mistral-small* grok-ma* grok-beta gpt-5.3-codex† gpt-5.4† gpt-5.4-mini† gpt-5.2-codex† gpt-5.2† gpt-5.1-codex-max† gpt-5.1-codex-mini† gpt-5.1-codex† gpt-5.1† gpt-4o†
0.1x–0.3x 0.4x–0.8x 1.0x–1.4x 1.6x–2.0x 2.4x–4.0x
Caption. Dot color encodes model cost multipliers (relative pricing). Markers labeled are from the new OpenAI-provider batch. Severity remains widely spread for similarly priced models, and low/high severity appears across multiple cost bands.

Panel A2. Cross-model runs: Immediate vs fresh-session endorsement

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 A1 same-session endorsement rate B1 fresh-session endorsement rate sonnet haiku gpt-5.1-codex-max gemini-flash* gemini-pro* devstral ministral glm-5* mimo-omni mimo-pro* gpt-5.3-codex† gpt-5.4-mini† gpt-5.2† gpt-5.2-codex† gpt-5.1-codex-max† gpt-5.1-codex-mini† gpt-5.1-codex† gpt-5.1† gpt-4o†
0.1x–0.3x 0.4x–0.8x 1.0x–1.4x 1.6x–2.0x 2.4x–4.0x
Caption. Legacy Opus-judge cross-model points are restored, with the new OpenAI-provider GPT-5.4-judge batch overlaid and labeled . OpenAI run: 20260323235514_openai-cross-model-gpt54-judge-r3-neutral_cbc7, 3 repeats, neutral variant, 16-way parallelism, completed 27/30. gpt-5.4 is judge-only in T2; gpt-5.4-pro was unsupported on this backend route.

Panel B2. Cross-model runs: Contradiction severity vs material-gap severity

Mean M2 contradiction severity Mean M2b material-gap severity 0 1 2 3 4 5 6 0 5 10 15 20 25 30 35 sonnet haiku gpt-5.1-codex-max gemini-flash* gemini-pro* devstral ministral glm-5* mimo-omni mimo-pro* gpt-5.3-codex† gpt-5.4-mini† gpt-5.2† gpt-5.2-codex† gpt-5.1-codex-max† gpt-5.1-codex-mini† gpt-5.1-codex† gpt-5.1† gpt-4o†
0.1x–0.3x 0.4x–0.8x 1.0x–1.4x 1.6x–2.0x 2.4x–4.0x
Caption. Legacy Opus-judge points are restored and the new OpenAI-provider GPT-5.4-judge points are overlaid with . Group means are from completed repeats only; the 0–35 M2b y-axis keeps the higher-severity gpt-5.1-codex† group visible.

Panel C. A1→B1 shift by model

0.0 0.2 0.4 0.6 0.8 1.0 Endorsement rate gpt-5-mini minimax* mimo-pro* gpt-5.1-codex gemini-flash sonnet ministral gemini-pro* mimo-omni* glm-4.7 gpt-5.4-mini mistral-small* devstral opus haiku glm-4.6v* gpt-4o-mini gpt-5.4-nano grok-beta grok-ma* A1 B1
Caption. This panel makes the endorsement shift directly visible. Several models move sharply left-to-right from A1 to B1, especially GPT-5-mini, Minimax, Mimo Pro, and GPT-5.1 Codex Max. Others remain low at both stages. Only a small minority show strong endorsement in both sessions, which is the pattern we would expect if same-model self-verification were stable.

Panel D. Coverage by model

0 1 2 3 4 5 Completed repeats claude-sonnet claude-opus claude-haiku gpt-5.1-codex gpt-4o-mini gpt-5-mini gemini-flash gemini-pro* devstral ministral glm-4.7 glm-4.6v* mimo-omni* mimo-pro* minimax* gpt-5.4-nano gpt-5.4-mini mistral-small* grok-ma* grok-beta
full 5/5 coverage partial 4/5 coverage very low coverage
Caption. Coverage is uneven across models, so interpretation should weight complete and partial groups differently. The main endorsement-instability pattern, however, is not driven only by incomplete cases: many of the strongest 5/5 models still show low B1 endorsement, high flip rates, or substantial severity scores.
Figure note. All panels summarize the current Study A dataset. Model-level aggregates are computed from completed episodes only. Models marked with * have incomplete repeat coverage and should be interpreted cautiously. Markers labeled in Panels A/B come from the new OpenAI-provider run (50/55 completed); GPT-5.4 Pro returned unsupported-model failures in this backend.