AI Skill Hub 推荐使用:开源AI工具:Benchmarking 是一款优质的AI工具。已获得 2.8k 颗 GitHub Star,AI 综合评分 7.5 分,在同类工具中表现稳健。如果你正在寻找可靠的AI工具解决方案,这是一个值得深入了解的选择。
Benchmarking large language models' complex reasoning ability with chain-of-thought,帮助开发者评估大型语言模型的复杂推理能力,提高AI模型的可靠性和可信度。
开源AI工具:Benchmarking 是一款基于 Jupyter Notebook 开发的开源工具,专注于 installable、jupyter notebook 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
Benchmarking large language models' complex reasoning ability with chain-of-thought,帮助开发者评估大型语言模型的复杂推理能力,提高AI模型的可靠性和可信度。
开源AI工具:Benchmarking 是一款基于 Jupyter Notebook 开发的开源工具,专注于 installable、jupyter notebook 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 克隆仓库 git clone https://github.com/FranxYao/chain-of-thought-hub cd chain-of-thought-hub # 查看安装说明 cat README.md # 按 README 完成环境依赖安装后即可使用
# 查看帮助 chain-of-thought-hub --help # 基本运行 chain-of-thought-hub [options] <input> # 详细使用说明请查阅文档 # https://github.com/FranxYao/chain-of-thought-hub
# chain-of-thought-hub 配置说明 # 查看配置选项 chain-of-thought-hub --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export CHAIN_OF_THOUGHT_HUB_CONFIG="/path/to/config.yml"
"A fantasy graph illustrating a chain of stars in a dark night with blue sky, digital art, super resolution". Midjourney V5
----
By Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, Tushar Khot, Wenhu Chen
From University of Edinburgh, University of Washington, Allen Institute for AI, University of Waterloo
Recently, there are a lot of progress in LLMs. Many claim that a small model less than 10B can achieve comparable performance to GPT-3.5. Really?
In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when **\the complexity of the task reaches a sufficient threshold\** — GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5. -- GPT-4 release blog
The key differentiator is whether a model can do complex tasks, like the old saying: "chit-chat is cheap, show me the reasoning." This is why we compile a list of complex reasoning tasks including math (GSM8K), science (MATH, TheoremQA), symbolic (BBH), knowledge (MMLU, C-Eval), coding (HumanEval), factual (SummEdits), and long-context (RepoBench, Qspr, QALT, BkSS) to measure the models' performance on challenging tasks.
More importantly, we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications. When this comes, chain-of-thought prompt engineering will be the next-generation system calls and shell scripts.
The credibility of chain-of-thought hub comes from the very carefully mediculously picked datasets and models that can clearly help the development of LLMs. The resutls and scripts from Chain-of-thought Hub is being used and referred by leading industrial and academic organizations in the space of large language models. We devide the tasks into three categories: main, experimental, and long-context. Main: datasets that are stable and consistently referred by places where LLMs are built. Experimental: datasets that has the potential to test future LLM capabilities. * Long-context: datasets that require reasoning over very long context, an important direction of future LLMs.
<details> <summary>[List of datasets we consider]</summary>
| Section | Dataset | Description | | ------- | ------- | ----------- | | Main | GSM8K | Grade-level math word problems | | Main | MATH | Competition-level math and science problems | | Main | MMLU | Multi-discipline knowledge | | Main | BBH | Challenging language and symbolic reasoning | | Main | HumanEval | Python coding | | Main | C-Eval | Chineses multi-discipline knowledge | | Experimental | TheoremQA | Theorem proving | | Experimental | SummEdits | Factual reasoning | | Long Ctx | Qspr | Question answering over research papers | | Long Ctx | QALT | Multiple-choice questions over long articles and stories | | Long Ctx | BkSS | Reordering of summaries of parts of novels | </details>
[Call for contribution]: would love to invite community members to: Send a PR to fill in a missing number in the table Raise an issue to suggest / brainstorm a new task / benchmark that measures reasoning over very long context Raise an issue to suggest / brainstorm a new task / benchmark that measures complex API calls and tool usage Raise an issue to suggest other good tasks / benchmarks that can clearly differentiate models' performance * Raise an issue to suggest a new model that can be added to the table
[UPDATE 20231210]: Add Gemini, Yi-34B, DeepSeek 67B Update long-context -- we will have more updates on this section * Preview of Mistral 7B8E MoE model results <details> <summary>Mistral 7B 8E looks approximately comparible with Yi34B / LLaMA2 70B / DeepSeek 67B</summary>
| Benchmark | Mistral 7B Dense | Mistral 7Bx8E=50B | Yi-34B | DeepSeek-67B | LLaMA2 70B |
|---|---|---|---|---|---|
| Arc-c | 59.98 | 66.38 | 64.59 | 65.44 | - |
| HellaSwag | 83.31 | 86.61 | 85.69 | 87.10 | - |
| MMLU | 64.16 | 71.73 | 76.35 | 71.78 | 68.9 |
| TruthfulQA | 42.15 | 48.55 | 56.23 | 51.08 | 50.18 |
| Winogrande | 78.37 | 82.40 | 83.03 | 84.14 | - |
| GSM8K | 37.83 | 57.09 | 50.64 | 56.71 | 56.8 |
</details>
[UPDATE 20230620]: Seperate main (datasets that are stable and consistently referred by places where LLMs are built) and experimental (datasets that has the potential to test future LLM capabilities) leaderboards. Add long-context section (experimental)
<details> <summary>[Previous updates]</summary> [UPDATE 20230609]: Add evaluation scripts on MMLU for LLaMA and Falcon
[UPDATE 20230601]: Add SummEdits
[UPDATE 20230527]: Add TheoremQA, add Vicuna, Alpaca, InstructCodeT5. </details>
A detailed roadmap is discussed in our previous blog post.
Generally, the recipe for building models of strong reasoning is the same as generic LLMs: pretraining, finetuning, reinforcement learning. Here we list some very important papers that should be considered:
cd penguins
research/complexity_based_prompting/gsm8k/flan_t5_11b_gsm8k.ipynb for a place to start该工具提供了评估大型语言模型复杂推理能力的方法,帮助开发者提高AI模型的可靠性和可信度,但需要注意的是该工具的使用需要一定的技术背景和经验。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。
总体来看,开源AI工具:Benchmarking 是一款质量良好的AI工具,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | chain-of-thought-hub |
| Topics | installablejupyter notebook |
| GitHub | https://github.com/FranxYao/chain-of-thought-hub |
| License | MIT |
| 语言 | Jupyter Notebook |
收录时间:2026-06-06 · 更新时间:2026-06-06 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。