Observatory
Bulletin No. 01 — Restricted instrument
Measure how
language models
actually behave.
Compare responses across tasks, judges, and rubrics. Drill into individual samples. Generate signed reports. Treat the answers as data, not vibes.
Identifying