Observatory

Bulletin No. 01 — Restricted instrument

Measure how
language models
actually behave.

Compare responses across tasks, judges, and rubrics. Drill into individual samples. Generate signed reports. Treat the answers as data, not vibes.

Identifying