{% extends "base.html" %} {% block title %}Benchmarks — Maxim Docs | Multi-Model Cognitive Testing{% endblock %} {% block meta_description %}Compare LLM backends across cognitive metrics with Maxim's three-tier benchmark system: hallucination, memory formation, causal learning.{% endblock %} {% block meta_keywords %}Maxim benchmarks, cognitive benchmarks, LLM comparison, hallucination rate, memory formation, causal learning, embodiment metrics, bio-inspired testing{% endblock %} {% block meta_author %}Maxim Project{% endblock %} {% block og_site_name %}Maxim{% endblock %} {% block og_type %}article{% endblock %} {% block structured_data %} {% endblock %} {% block content %}
MAXIM
Multi-Model Cognitive Architecture Testing
Standard LLM benchmarks measure token prediction. Maxim benchmarks measure cognitive behavior — whether a model can form memories, learn causal relationships, use tools correctly, and drive a bio-inspired architecture end-to-end.
The benchmark system runs the same simulation pipeline used for scenario testing, but adds structured metric collection across three tiers. Each tier captures a different layer of the architecture, from raw LLM output quality up through embodiment readiness.
How well does the model follow instructions? Measures hallucination, JSON compliance, tool usage accuracy, and alias handling at the raw output level.
Does the model drive the cognitive systems effectively? Tracks memory formation, associative graph growth, concept extraction, causal link discovery, and learning efficiency.
Is the model ready for physical deployment? Auto-detected when hardware adapters are available. Measures spatial attention accuracy, motor planning latency, and sensor fusion coherence.
Why three tiers? A model that scores perfectly on JSON compliance (Tier 1) might still fail to form useful memories (Tier 2). And a model that drives the cognitive architecture well in simulation might produce actions too slowly for real-time embodiment (Tier 3). Each tier catches failures the others miss.
Run a benchmark from the CLI by specifying models to compare and a campaign scenario:
Or use the Python API for programmatic access:
Every benchmark run collects metrics across all available tiers. Tier 1 and Tier 2 are always present. Tier 3 activates automatically when embodiment adapters are detected.
Raw model output quality. These metrics measure how well the LLM follows structured output requirements and avoids common failure modes.
| Metric | Type | Description |
|---|---|---|
| hallucination_rate | float (0–1) | Fraction of responses containing fabricated facts or non-existent tool names |
| correct_tool_usage_rate | float (0–1) | Fraction of tool calls with valid name, correct argument types, and meaningful parameters |
| json_compliance_rate | float (0–1) | Fraction of responses that parse as valid JSON on first attempt (before repair pipeline) |
| alias_redirect_rate | float (0–1) | Fraction of hallucinated tool names successfully caught and redirected via TOOL_ALIASES |
How effectively the model drives Maxim's bio-inspired subsystems. These metrics reflect the quality of the cognitive pipeline, not just the LLM output.
| Metric | Type | Description |
|---|---|---|
| memory_formation_rate | float (0–1) | Fraction of salient percepts that produce at least one hippocampal memory |
| associative_graph_density | float | Edges / nodes in the hippocampal associative graph (higher = richer associations) |
| concept_formation_rate | float (0–1) | Fraction of eligible memory clusters that produce ATL semantic concepts |
| causal_link_count | int | Number of action-outcome causal links discovered by NAc |
| learning_efficiency | float | Causal links per observation — how quickly the model learns from experience |
| observation_density | float | Observations per simulation turn — how much the model attends to its environment |
| pain_signal_count | int | Number of pain/aversion signals triggered during the run |
| type_token_ratio | float (0–1) | Lexical diversity of model output — unique tokens / total tokens |
Auto-detected when hardware adapters are present. These metrics measure readiness for physical deployment.
Auto-detection: Tier 3 metrics activate when the runtime detects hardware adapters (vision engine, motor controller, sensor fusion). In pure simulation mode, Tier 3 is reported as not_available and does not affect pass/fail status.
Maxim ships with six built-in benchmark scenarios, ranging from a 30-second smoke test to a comprehensive cognitive evaluation.
30-second smoke test. Verifies the pipeline boots, the model produces valid JSON, and at least one tool call succeeds.
~30s | Tier 1 onlyPresents novel situations requiring tool exploration. Measures correct_tool_usage_rate and alias_redirect_rate under unfamiliar conditions.
~60s | Tier 1Repeated action-outcome sequences to test NAc causal link formation. Measures causal_link_count and learning_efficiency.
~90s | Tier 1 + 2Scenarios that should trigger pain/aversion signals. Tests whether the model learns to avoid harmful actions after negative feedback.
~90s | Tier 1 + 2Multi-turn narrative with recurring themes. Measures whether hippocampal memories cluster into ATL semantic concepts.
~120s | Tier 1 + 2Comprehensive evaluation that combines all scenarios above. The standard benchmark for model-to-model comparison.
~5min | Tier 1 + 2 (+ Tier 3 if available)Benchmark scenarios use the same YAML format as simulation scenarios, with additional benchmark and suite sections for metric expectations and metadata.
The metadata section is used for filtering and reporting. The benchmark.expectations list defines pass/fail criteria. The suite section controls which aggregate suites include this scenario.
Use --baseline to compare the current run against a previous benchmark result. The output shows deltas for every metric, making regressions immediately visible.
The delta report uses directional arrows to show improvement or regression:
Tip: Baselines are saved automatically after each run. To create a named baseline for long-term tracking, copy the output directory: cp -r ~/.maxim/benchmarks/latest/ ~/.maxim/benchmarks/v1.0.0_mistral-7b/
Benchmark results are saved to ~/.maxim/benchmarks/{timestamp}_{model}/ with a structured JSON report and a human-readable Markdown summary.
| File | Description |
|---|---|
| benchmark_report.json | Full structured output — all metrics, expectations, pass/fail status, model metadata |
| summary.md | Human-readable summary with tables, deltas (if baseline provided), and per-scenario breakdown |
The JSON report structure:
Expectations define pass/fail criteria for benchmark scenarios. Each expectation type maps to a specific bio-system measurement. A scenario passes only when all its expectations are met.
| Expectation Type | Bio-System | Parameters | Description |
|---|---|---|---|
| memory_count_range | Hippocampus | min, max | Total episodic memories formed must fall within the given range |
| concept_formed | ATL | concept_name | A specific semantic concept must be extracted by the ATL |
| graph_density_above | Hippocampus | threshold | Associative graph edge/node ratio must exceed the threshold |
| causal_link_formed | NAc | action, outcome | A specific action → outcome causal link must be discovered |
| prediction_valence | NAc | action, valence | NAc's predicted valence for an action must match (positive/negative/neutral) |
| hallucination_rate_below | LLM | threshold | Hallucination rate must stay below the threshold (typically 0.05–0.15) |
| tool_used | Executor | tool_name, min_count | A specific tool must be called at least min_count times during the run |
| pain_signal_count | Proprioception | min, max | Pain/aversion signals triggered must fall within the given range |
Composing expectations: A single scenario can combine any number of expectations. For example, a causal learning scenario might require causal_link_formed + memory_count_range + hallucination_rate_below to pass. All expectations must be satisfied — there is no partial credit.