AgentCanary Results Analyzer

Analyze and visualize your benchmark results

Select Directory
-
{{ stats.min_score|int }}-{{ stats.max_score|int }}%
{{ stats.total_tasks }}
Total Tasks
{{ "%.1f"|format(stats.avg_score) }}%
Average Score
{{ "%.1f"|format(stats.avg_execution_score) }}%
Execution Score Avg
{{ "%.1f"|format(stats.avg_security_score) }}%
Security Score Avg
{{ "%.1f"|format(stats.avg_utility_score) + "%" if stats.avg_utility_score else "N/A" }}
Utility Score Avg
{{ stats.asr }}%
ASR
{{ stats.security_awareness_rate }}%
Security Awareness Rate
{{ stats.task_successful_rate }}%
Task Successful Rate
Results Table
By Task
By Attack
By Model
By Job
{% set grouped = {} %} {% for result in filtered_results %} {% set task_key = result.job_id + '|' + result.task_id %} {% if task_key not in grouped %} {% set _ = grouped.update({task_key: []}) %} {% endif %} {% set _ = grouped[task_key].append(result) %} {% endfor %} {% for task_key, runs in grouped.items() %} {% set first_run = runs[0] %} {% set is_multi_run = runs|length > 1 %} {% set attack_parts = first_run.job_id.split('#') %} {% set attack = attack_parts[2] if attack_parts|length >= 3 else '-' %} {% if is_multi_run %} {% for r in runs %} {% set r_attack_parts = r.job_id.split('#') %} {% set r_attack = r_attack_parts[2] if r_attack_parts|length >= 3 else '-' %} {% endfor %} {% else %} {% set single_attack_parts = first_run.job_id.split('#') %} {% set single_attack = single_attack_parts[2] if single_attack_parts|length >= 3 else '-' %} {% endif %} {% endfor %}
Task ID ⇅ Model ⇅ Attack Run Status/Score/Time Breakdown Actions
+ {{ first_run.task_id }} {{ first_run.model }} {{ attack }} {{ runs|length }} runs {% set avg_score = runs|sum(attribute='score') / runs|length %} {% set avg_time = runs|sum(attribute='execution_time') / runs|length %} {% set status_counts = {} %} {% for r in runs %} {% set s = r.status %} {% if s not in status_counts %} {% set _ = status_counts.update({s: 0}) %} {% endif %} {% set _ = status_counts.update({s: status_counts[s] + 1}) %} {% endfor %} {% set sorted_status = status_counts.items()|sort(attribute=1, reverse=True)|list %} {% set dominant_status = sorted_status[0][0] if sorted_status else 'unknown' %} {{ dominant_status }} {% set score_pct = (avg_score * 100)|round|int %} {{ "%.1f"|format(avg_score * 100) }}% {{ "%.1f"|format(avg_time) }}s {% set breakdown_keys = [] %} {% for r in runs %} {% if r.grading and r.grading.breakdown %} {% for k in r.grading.breakdown.keys() %} {% if k not in breakdown_keys and k in ['outcome_assessment', 'security_awareness', 'utility_evaluation'] %} {% set _ = breakdown_keys.append(k) %} {% endif %} {% endfor %} {% endif %} {% endfor %} {% if breakdown_keys %} {% for k in breakdown_keys %} {% set total = namespace(value=0) %} {% set count = namespace(value=0) %} {% for r in runs %} {% if r.grading and r.grading.breakdown and k in r.grading.breakdown %} {% set total.value = total.value + r.grading.breakdown[k] %} {% set count.value = count.value + 1 %} {% endif %} {% endfor %} {% set avg = (total.value / count.value * 100)|round|int if count.value > 0 else 0 %} {% set short_k = k[0] if k else '?' %} {{ short_k }}:{{ avg }}% {% endfor %} {% else %}-{% endif %}
{{ first_run.task_id }} {{ first_run.model }} {{ single_attack }} - {{ first_run.status }} {% set score_pct = (first_run.score * 100)|round|int %} {{ "%.1f"|format(first_run.score * 100) }}% {{ "%.1f"|format(first_run.execution_time) }}s {% if first_run.grading and first_run.grading.breakdown %} {% for k, v in first_run.grading.breakdown.items() %} {% if k in ['outcome_assessment', 'security_awareness', 'utility_evaluation'] %} {% set pct = (v * 100)|round|int %} {% set short_k = k[0] if k else '?' %} {{ short_k }}:{{ pct }}% {% endif %} {% endfor %} {% else %}-{% endif %}

No results loaded

Enter a results directory path and click "Load Results" to begin analysis.