ReadonlynameStable identifier (e.g., 'swe-bench', 'humaneval'). Used in CLI routing and reporting.
Optional ReadonlyvariantOptional variant within a benchmark family (e.g., 'lite' vs 'verified').
Load the benchmark task set from disk/remote. Runs once per invocation.
Execute the solver on one instance. No evaluation here — just generate the prediction.
Evaluate a prediction against ground truth. Returns a benchmark-specific verdict.
Determine whether a verdict counts as pass. Keeps pass/fail semantics localized.
Aggregate instance results into a summary. Should be pure + deterministic.
Contract every benchmark implementation fulfills.
Type parameters:
TInstance: one task / problem in the benchmark (e.g., a SWE-bench issue)TPrediction: the solver's output (e.g., a proposed patch)TEvalResult: the evaluator's verdict (e.g., patch applied + tests passed)A correct implementation composes as:
loadInstances -> runInstance(each) -> evaluate(each) -> summarizeExample