AI Skill Hub 推荐使用:LLM编码基准 是一款优质的Agent工作流。AI 综合评分 7.5 分,在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案,这是一个值得深入了解的选择。
LLM编码基准 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
LLM编码基准 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
# 方式一:pip 安装(推荐)
pip install llm-coding-benchmark
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install llm-coding-benchmark
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/akitaonrails/llm-coding-benchmark
cd llm-coding-benchmark
pip install -e .
# 验证安装
python -c "import llm_coding_benchmark; print('安装成功')"
# 命令行使用
llm-coding-benchmark --help
# 基本用法
llm-coding-benchmark input_file -o output_file
# Python 代码中调用
import llm_coding_benchmark
# 示例
result = llm_coding_benchmark.process("input")
print(result)
# llm-coding-benchmark 配置文件示例(config.yml) app: name: "llm-coding-benchmark" debug: false log_level: "INFO" # 运行时指定配置文件 llm-coding-benchmark --config config.yml # 或通过环境变量配置 export LLM_CODING_BENCHMARK_API_KEY="your-key" export LLM_CODING_BENCHMARK_OUTPUT_DIR="./output"
📝 This repository is the data source for a series of blog posts on akitaonrails.com that walk through the experiments and findings narratively: - 2026-05-04 — LLM Benchmarks: DeepSeek Unlocked via deepclaude — Round 4: using the deepclaude env-swap shim to finally benchmark DeepSeek V4 Pro through Claude Code's autonomous loop after opencode's reasoning_content interop bug kept it unmeasurable. - 2026-04-25 — LLM Benchmarks: Vale a Pena Misturar 2 Modelos? (PT-BR) — the multi-agent / forced-delegation rounds, cost-quality-time analysis, and the bottom-line verdict on whether pairing a planner with a cheaper executor is actually worth it vs solo Opus. - 2026-04-24 — LLM Benchmarks Parte 3: DeepSeek, Kimi, MiMo (PT-BR) — the original cross-model audit including the rubric-driven re-ranking and the RubyLLM API hallucination patterns documented in the project'sdocs/success_report.md. Source code and raw results in this repo are the artifacts referenced by those posts; the docs indocs/success_report*.mdare the long-form analyses.
This repository benchmarks autonomous coding runs against one fixed Rails application brief. It is built to compare a mix of local Ollama-hosted models and cloud models under the same prompt family, collect normalized run metadata, and summarize the results in Markdown.
The benchmark runner currently uses:
opencode run --agent build --format json
Each model run gets its own workspace under results/<slug>/project, plus raw opencode logs and a normalized result.json.
The current successful path is a two-phase OpenRouter run:
docker build, and docker compose up --buildIf the runs are already on disk and you just want to rebuild the Markdown summary:
python scripts/run_benchmark.py --report-only
If your warmup file lives somewhere else:
python scripts/run_benchmark.py \
--report-only \
--ollama-warmup-results path/to/ollama_warmup.json
Benchmark runs do not rely on mutating your home opencode config.
Before execution, scripts/run_benchmark.py writes a local config file:
config/opencode.benchmark.json
That file is built from your installed ~/.config/opencode/opencode.json, but trimmed to the providers and models needed for the selected benchmark run. Benchmark subprocesses are launched with:
OPENCODE_CONFIG=config/opencode.benchmark.json
This keeps the benchmark reproducible and lets the harness apply safe per-model context values without rewriting your global setup.
If you want to refresh the local benchmark config without starting model runs:
python scripts/run_benchmark.py --sync-ollama-contexts-only
That will regenerate config/opencode.benchmark.json from the current model config and warmup results.
Append a new entry to the models array:
{
"slug": "vendor_name_version",
"id": "openrouter/vendor/model-id",
"label": "Vendor Model Name",
"provider": "openrouter",
"selection_reason": "One-line context (pricing, why it was added, known caveats)."
}
For local llama-swap models, also add "llama_swap_model": "vendor:tag" matching the name configured on the llama-swap server. For models you don't want to run by default, add "skip_by_default": true.
Recommended order:
opencode config.This repo supports running the local-llama-swap subset of the benchmark against two different machines, with separate config files and result directories so the runs don't overwrite each other:
| Profile | Hardware | llama-swap host | Models config | Results dir | Report |
|---|---|---|---|---|---|
| **AMD server** | Strix Halo, gfx1151, 128 GB unified | http://192.168.0.90:11435 | config/models.json | results/ | docs/report.md |
| **NVIDIA workstation** | RTX 5090, sm_120, 32 GB VRAM | http://localhost:11435 | config/models.nvidia.json | results-nvidia/ | docs/report.nvidia.md |
The NVIDIA profile is a strict subset of the AMD profile: only the local llama-swap models that fit in 32 GB of VRAM are included, with smaller benchmark_context_override values to keep KV cache within budget. The OpenRouter and Z.ai cloud models are not duplicated — those go in config/models.json alone since they don't depend on local hardware.
The Docker setup for the NVIDIA workstation lives at ~/Projects/llama-swap-docker (separate repo). It builds llama.cpp from source against CUDA 12.8 with CMAKE_CUDA_ARCHITECTURES=120 so the kernels target Blackwell directly.
This project supports two local model backends. Ollama was the original backend; llama-swap was added after Ollama proved unreliable for unattended benchmark runs.
全面评估LLM编码能力的基准项目
该工具未明确声明开源协议,商业使用前请联系原作者确认授权范围,避免侵权风险。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
总体来看,LLM编码基准 是一款质量良好的Agent工作流,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | llm-coding-benchmark |
| 原始描述 | 开源AI工作流:Simple benchmark to test the most popular open source and commercial LLMs with a。⭐134 · Python |
| Topics | LLMbenchmarkPython |
| GitHub | https://github.com/akitaonrails/llm-coding-benchmark |
| 语言 | Python |
收录时间:2026-06-01 · 更新时间:2026-06-02 · License:未公布 · AI Skill Hub 不对第三方内容的准确性作法律背书。
选择 Agent 类型,复制安装指令后粘贴到对应客户端