Study A: same-model endorsement instability and severity landscape
Paper-style figure panel for the current 20-model Study A dataset. Model-level aggregates are computed from completed episodes only. The dataset summarized here contains 87 completed and 13 failed episodes; models marked with * have incomplete repeat coverage. Panels A/B now also overlay the new OpenAI-provider batch (50/55 completed, with GPT-5.4 Pro unsupported in this backend).
Completed episodes
87
Failed episodes
13
Models with full 5/5
13
A1 endorsement rate
50.6%
B1 endorsement rate
32.2%
OpenAI batch completion
50/55
Panel A. Immediate endorsement vs fresh-session endorsement (dot color = average cost multiplier)
0.1x–0.3x
0.4x–0.8x
1.0x–1.4x
1.6x–2.0x
2.4x–4.0x
Panel B. Contradiction severity vs material-gap severity (dot color = average cost multiplier)
0.1x–0.3x
0.4x–0.8x
1.0x–1.4x
1.6x–2.0x
2.4x–4.0x
Panel A2. Cross-model runs: Immediate vs fresh-session endorsement
0.1x–0.3x
0.4x–0.8x
1.0x–1.4x
1.6x–2.0x
2.4x–4.0x
Panel B2. Cross-model runs: Contradiction severity vs material-gap severity
0.1x–0.3x
0.4x–0.8x
1.0x–1.4x
1.6x–2.0x
2.4x–4.0x
Panel C. A1→B1 shift by model
Panel D. Coverage by model
full 5/5
coverage
partial
4/5 coverage
very low
coverage
Figure note. All panels summarize the current Study A
dataset. Model-level aggregates are computed from completed episodes
only. Models marked with * have incomplete repeat
coverage and should be interpreted cautiously. Markers labeled
† in Panels A/B come from the new OpenAI-provider run
(50/55 completed); GPT-5.4 Pro returned unsupported-model failures in
this backend.