AI Skill Hub 强烈推荐:Gym AI工作流评估框架 是一款优质的Agent工作流。已获得 1.0k 颗 GitHub Star,AI 综合评分 8.2 分,在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案,这是一个值得深入了解的选择。
Gym AI工作流评估框架 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
Gym AI工作流评估框架 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
# 方式一:pip 安装(推荐)
pip install gym
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install gym
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/NVIDIA-NeMo/Gym
cd Gym
pip install -e .
# 验证安装
python -c "import gym; print('安装成功')"
# 命令行使用
gym --help
# 基本用法
gym input_file -o output_file
# Python 代码中调用
import gym
# 示例
result = gym.process("input")
print(result)
# gym 配置文件示例(config.yml) app: name: "gym" debug: false log_level: "INFO" # 运行时指定配置文件 gym --config config.yml # 或通过环境变量配置 export GYM_API_KEY="your-key" export GYM_OUTPUT_DIR="./output"
Requirements • Quick Start • Environment Tutorials • Available Environments • Documentation & Resources • Community & Support • Citations
NeMo Gym is a library for evaluating and improving models and agents using environments. NeMo Gym provides infrastructure to develop environments, scalably run evaluation and training, and a collection of popular benchmarks and training environments.
An environment is the complete system an agent interacts with to complete a task. It consists of a dataset (tasks to solve), an agent harness (how the model interacts with the world), a verifier (task completion scoring), and state (per-task execution context).
NeMo Gym is designed to run on standard development machines:
| Hardware Requirements | Software Requirements |
|---|---|
| **GPU**: Not required for NeMo Gym library operation<br>• GPU may be needed for specific resources servers or model inference (see individual server documentation) | **Operating System**:<br>• Linux (Ubuntu 20.04+, or equivalent)<br>• macOS (11.0+ for x86_64, 12.0+ for Apple Silicon)<br>• Windows (via WSL2) |
| **CPU**: Any modern x86_64 or ARM64 processor (e.g., Intel, AMD, Apple Silicon) | **Python**: 3.12 or higher |
| **RAM**: Minimum 8 GB (16 GB+ recommended for larger environments) | **Git**: For cloning the repository |
| **Storage**: Minimum 5 GB free disk space for installation and basic usage | **Internet Connection**: Required for downloading dependencies and API access |
Additional Requirements
Requires Python 3.12+ on x86_64 or ARM64 (Linux, macOS, Windows via WSL2). No GPU required. See the Getting Started docs for a more comprehensive walkthrough.
Install NeMo Gym:
Requires uv and Python 3.12+.
git clone git@github.com:NVIDIA-NeMo/Gym.git
cd Gym
uv venv --python 3.12 && source .venv/bin/activate
uv sync
Configure your model:
This quickstart uses OpenAI. NeMo Gym supports local and hosted inference — see Configure Model for vLLM, Fireworks, OpenRouter, and others.
Create env.yaml in the project root:
policy_base_url: https://api.openai.com/v1
policy_api_key: <your-openai-api-key>
policy_model_name: gpt-4.1-2025-04-14
Learn how to build custom environments through hands-on tutorials. Here are popular starting points:
| Name | Demonstrates |
|---|---|
| [Single Step](https://docs.nvidia.com/nemo/gym/main/environment-tutorials/single-step-environment) | Basic single-step tool calling |
| [Multi Step](https://docs.nvidia.com/nemo/gym/main/environment-tutorials/multi-step-environment) | Multi-step tool calling |
| [Session State](https://docs.nvidia.com/nemo/gym/main/environment-tutorials/stateful-environment) | Session state management (in-memory) |
| [Multi Reward](https://docs.nvidia.com/nemo/gym/main/build-verifiers/multi-reward-verification) | Multiple reward components for evaluation and multi-objective RL (e.g. GDPO) |
See all environment tutorials for additional patterns and advanced topics.
Environments for training and evaluation.
Each resources server includes example data, configuration files, and tests. See each server's README for details.
The Dataset column links to publicly available datasets (e.g., on HuggingFace). A - means the train/validation data has not been publicly released yet, or that it is procedurally generated using a provided script. If no data is released yet, new data can be generated, or the environment can be used as a reference. Each server includes 5 example tasks in data/example.jsonl.
| Environment | Domain | Description | Value | Train | Validation | License | Config | Dataset |
|---|---|---|---|---|---|---|---|---|
| Aalcr | other | - | - | - | - | - | <a href='resources_servers/aalcr/configs/aalcr.yaml'>aalcr.yaml</a> | - |
| Abstention | rlhf | Train models to abstain when unsure using three-tier reward on HotPotQA with LLM judge | Improve calibration by rewarding abstention over incorrect answers | ✓ | ✓ | Creative Commons Attribution-ShareAlike 4.0 International | <a href='resources_servers/abstention/configs/abstention.yaml'>abstention.yaml</a> | - |
| Anyterminal Agent | coding | Terminal Bench run by claude-code natively inside the task container. | Evaluate terminal-task capabilities on Terminal Bench with any Gym agent. | - | - | - | <a href='responses_api_agents/anyterminal_agent/configs/anyterminal_claude_code.yaml'>anyterminal_claude_code.yaml</a> | - |
| Anyterminal Agent | coding | Terminal Bench run by the Hermes agent inside the task container. | Evaluate terminal-task capabilities on Terminal Bench with any Gym agent. | - | - | - | <a href='responses_api_agents/anyterminal_agent/configs/anyterminal_hermes.yaml'>anyterminal_hermes.yaml</a> | - |
| Arc Agi | knowledge | Solve puzzles designed to test intelligence. See https://arcprize.org/arc-agi. | Improve puzzle-solving capabilities. | - | ✓ | - | <a href='resources_servers/arc_agi/configs/arc_agi.yaml'>arc_agi.yaml</a> | - |
| Arena Judge | - | - | - | - | - | <a href='resources_servers/arena_judge/configs/arena_judge.yaml'>arena_judge.yaml</a> | - | |
| Asr With Pc | other | ASR with WER scoring (standard, case-sensitive, punctuation+capitalization) | Improve transcription quality with structural detail | - | - | - | <a href='resources_servers/asr_with_pc/configs/asr_with_pc.yaml'>asr_with_pc.yaml</a> | - |
| Aviary | agent | Multi-hop question answering on the HotPotQA dataset with Wikipedia search | Improve knowledge and agentic capability | ✓ | ✓ | Apache 2.0 | <a href='resources_servers/aviary/configs/hotpotqa_aviary.yaml'>hotpotqa_aviary.yaml</a> | - |
| Aviary | math | GSM8k benchmark with calculator tool | Test math and agentic capability | ✓ | ✓ | Apache 2.0 | <a href='resources_servers/aviary/configs/gsm8k_aviary.yaml'>gsm8k_aviary.yaml</a> | - |
| Bigcodebench | coding | Verifies model-generated Python solutions against the BigCodeBench unittest suite. | Improve practical, library-rich Python coding capabilities. | - | - | - | <a href='resources_servers/bigcodebench/configs/bigcodebench.yaml'>bigcodebench.yaml</a> | - |
| Bird Sql | coding | Text-to-SQL with execution-based evaluation on BIRD dev (1534 SQLite tasks). Binary reward from unordered result-set equality. | Improve text-to-SQL capabilities on BIRD's realistic dev split using execution-based binary reward without an LLM judge. | - | - | - | <a href='resources_servers/bird_sql/configs/bird_sql.yaml'>bird_sql.yaml</a> | - |
| Blackjack | games | Blackjack. Model hits or stands. Reward +1 win, 0 draw, -1 loss/bust. | Example gymnasium-style multi-step environment | - | - | - | <a href='resources_servers/blackjack/configs/blackjack.yaml'>blackjack.yaml</a> | - |
| Browsecomp Advanced Harness | agent | Model uses search tools to satisfy a user query. | Measure agentic search capability | - | - | - | <a href='resources_servers/browsecomp_advanced_harness/configs/browsecomp_advanced_harness.yaml'>browsecomp_advanced_harness.yaml</a> | - |
| Bunsenbench Chemistry Mcq | knowledge | Public BunsenBench chemistry multiple-choice benchmark verifier | Measure chemistry MCQ reasoning with source and taxonomy breakdowns | - | - | - | <a href='resources_servers/bunsenbench_chemistry_mcq/configs/bunsenbench_chemistry_mcq.yaml'>bunsenbench_chemistry_mcq.yaml</a> | - |
| Calendar | agent | Multi-turn calendar scheduling dataset. User states events and constraints in natural language; model schedules events to satisfy all constraints. | Improve multi-turn instruction following capabilities | ✓ | ✓ | Apache 2.0 | <a href='resources_servers/calendar/configs/calendar.yaml'>calendar.yaml</a> | <a href='https://huggingface.co/datasets/nvidia/Nemotron-RL-agent-calendar_scheduling'>Nemotron-RL-agent-calendar_scheduling</a> |
| Calendar | agent | Multi-turn calendar scheduling dataset. User states events and constraints in natural language; model schedules events to satisfy all constraints. | Improve multi-turn instruction following capabilities | ✓ | ✓ | Creative Commons Attribution 4.0 International | <a href='resources_servers/calendar/configs/calendar_v2.yaml'>calendar_v2.yaml</a> | <a href='https://huggingface.co/datasets/nvidia/Nemotron-RL-Instruction-Following-Calendar-v2'>Nemotron-RL-Instruction-Following-Calendar-v2</a> |
| Circle Click | other | Click on circles in images | Improve visual grounding and spatial reasoning | - | - | - | <a href='resources_servers/circle_click/configs/circle_click.yaml'>circle_click.yaml</a> | - |
| Circle Count | other | Count circles of a given color in images | Improve visual counting and color recognition | - | - | - | <a href='resources_servers/circle_count/configs/circle_count.yaml'>circle_count.yaml</a> | - |
| Code Fim | coding | Code Fill-in-the-Middle judged by HumanEval-Infilling test suite (single_line, multi_line, random_span, random_span_light) | Improve Python code-infilling capabilities (prefix + completion + suffix) | - | - | - | <a href='resources_servers/code_fim/configs/code_fim.yaml'>code_fim.yaml</a> | - |
| Code Gen | coding | Model must submit the right code to solve a problem | Improve competitive coding capabilities | ✓ | ✓ | Apache 2.0 | <a href='resources_servers/code_gen/configs/code_gen.yaml'>code_gen.yaml</a> | <a href='https://huggingface.co/datasets/nvidia/nemotron-RL-coding-competitive_coding'>nemotron-RL-coding-competitive_coding</a> |
| Competitive Coding Challenges | coding | Execution of competitive programming competition questions | Improve competitive coding capabilities on contest-style problems | - | - | - | <a href='resources_servers/competitive_coding_challenges/configs/competitive_coding_challenges.yaml'>competitive_coding_challenges.yaml</a> | - |
| Critpt | other | Research-level physics problems scored by the Artificial Analysis API | Evaluate model performance on research-level physics reasoning | - | - | - | <a href='resources_servers/critpt/configs/critpt.yaml'>critpt.yaml</a> | - |
| Cvdp | coding | CVDP benchmark dataset for code generation | Evaluate RTL code generation capabilities | - | ✓ | - | <a href='resources_servers/cvdp/configs/cvdp.yaml'>cvdp.yaml</a> | - |
| Equivalence Llm Judge | agent | Short bash command generation questions with LLM-as-a-judge | Improve foundational bash and IF capabilities | ✓ | ✓ | GNU General Public License v3.0 | <a href='resources_servers/equivalence_llm_judge/configs/nl2bash-equivalency.yaml'>nl2bash-equivalency.yaml</a> | - |
| Equivalence Llm Judge | knowledge | Short answer questions with LLM-as-a-judge | Improve knowledge-related benchmarks like GPQA / HLE | - | - | - | <a href='resources_servers/equivalence_llm_judge/configs/equivalence_llm_judge.yaml'>equivalence_llm_judge.yaml</a> | - |
| Equivalence Rule | knowledge | Question - Answering with rule-based reward | Improve retrieval and counting capabilities | - | - | - | <a href='resources_servers/equivalence_rule/configs/lc.yaml'>lc.yaml</a> | - |
| Ether0 | knowledge | ether0 chemistry benchmark verifiers | Evalutate chemistry knowledge and reasoning with ether0 benchmark | - | ✓ | - | <a href='resources_servers/ether0/configs/ether0.yaml'>ether0.yaml</a> | - |
| Evalplus | coding | Function-completion code judged by EvalPlus base + plus tests (HumanEval+, MBPP+) | Improve Python function-completion capabilities | - | - | - | <a href='resources_servers/evalplus/configs/evalplus.yaml'>evalplus.yaml</a> | - |
| Finance Sec Search | agent | SEC EDGAR filing search for financial analysis questions | Enable LLMs to search and analyze SEC filings | - | - | - | <a href='resources_servers/finance_sec_search/configs/finance_sec_search.yaml'>finance_sec_search.yaml</a> | - |
| Format Verification | instruction_following | Verify citation/reference markers in model responses via string matching | Improve instruction following for citation format adherence | ✓ | - | Apache 2.0 | <a href='resources_servers/format_verification/configs/citation_format.yaml'>citation_format.yaml</a> | - |
| Format Verification | instruction_following | Verify freeform text formatting (bullets, headings, tables, etc.) via regex patterns | Improve instruction following for text formatting constraints | ✓ | - | Apache 2.0 | <a href='resources_servers/format_verification/configs/freeform_formatting.yaml'>freeform_formatting.yaml</a> | - |
| Frontierscience Judge | other | FrontierScience answer grading via single-pass LLM judge | Evaluate FrontierScience Olympiad short answers or Research rubric-scored answers | - | - | - | <a href='resources_servers/frontierscience_judge/configs/frontierscience_judge.yaml'>frontierscience_judge.yaml</a> | - |
| Genrm Compare | rlhf | GenRM pairwise comparison for RLHF training | Compare multiple candidate responses using GenRM model | - | - | - | <a href='resources_servers/genrm_compare/configs/genrm_compare.yaml'>genrm_compare.yaml</a> | - |
| Google Search | agent | Multi-choice question answering problems with search tools integrated | Improve knowledge-related benchmarks with search tools | ✓ | - | Apache 2.0 | <a href='resources_servers/google_search/configs/google_search.yaml'>google_search.yaml</a> | <a href='https://huggingface.co/datasets/nvidia/Nemotron-RL-knowledge-web_search-mcqa'>Nemotron-RL-knowledge-web_search-mcqa</a> |
| Gpqa Diamond | knowledge | GPQA Diamond multiple-choice question answering problems | Evaluate graduate-level scientific reasoning via MCQ verification | ✓ | - | MIT | <a href='resources_servers/gpqa_diamond/configs/gpqa_diamond.yaml'>gpqa_diamond.yaml</a> | - |
| Graphwalks | other | Long-context graph-walks (BFS / parents) with F1-over-node-sets grading from openai/graphwalks | Improve long-context multi-step graph reasoning and adjacency-list traversal | - | - | - | <a href='resources_servers/graphwalks/configs/graphwalks.yaml'>graphwalks.yaml</a> | - |
| Grl Sokoban | games | Single-box Sokoban in Gymnasium API style. | Model emits one move per turn until the puzzle is solved. | - | - | - | <a href='resources_servers/grl_sokoban/configs/grl_sokoban.yaml'>grl_sokoban.yaml</a> | - |
| Grl Tetris | games | Tetris in Gymnasium API style. Model emits one or more moves per turn. | Multi-step Tetris environment | - | - | - | <a href='resources_servers/grl_tetris/configs/grl_tetris.yaml'>grl_tetris.yaml</a> | - |
| Gymnasium | other | Base class for Gymnasium-style servers. Not a standalone server. | Reusable base class for step/reset style environments | - | - | - | <a href='resources_servers/gymnasium/configs/gymnasium.yaml'>gymnasium.yaml</a> | - |
| Harbor Agent | agent | Harbor integration for ageng harnesses and environments. | Improve models in popular agentic environments supported by Harbor such as Terminus2. | ✓ | - | - | <a href='responses_api_agents/harbor_agent/configs/harbor_agent.yaml'>harbor_agent.yaml</a> | - |
| Harbor Agent | agent | Harbor integration for agent harnesses and environments. | Improve models in popular agentic environments supported by Harbor such as Terminus2. | ✓ | - | - | <a href='responses_api_agents/harbor_agent/configs/harbor_agent_daytona.yaml'>harbor_agent_daytona.yaml</a> | - |
| Hotpotqa Qa | knowledge | Short-answer QA with deterministic SQuAD-style + alternative-aware substring verification (HotpotQA closed-book). | Improve closed-book multi-hop question-answering accuracy. | - | - | - | <a href='resources_servers/hotpotqa_qa/configs/hotpotqa_qa.yaml'>hotpotqa_qa.yaml</a> | - |
| Ifbench | instruction_following | IFBench instruction following evaluation using AllenAI's IFBench library (57 instruction types) | Improve IFBench instruction following | - | - | - | <a href='resources_servers/ifbench/configs/ifbench.yaml'>ifbench.yaml</a> | - |
| Imo Gradingbench | math | Four-class grading of math proofs — the policy model reads a problem plus a candidate proof and emits one of correct / almost / partial / incorrect as the last word. | Improve the IMO-GradingBench benchmark and proof-grading skill. | - | - | - | <a href='resources_servers/imo_gradingbench/configs/imo_gradingbench.yaml'>imo_gradingbench.yaml</a> | - |
| Imo Proofbench Judge | math | IMO ProofBench grader using a strong LLM judge with the IMO 0-7 rubric | Score IMO-style proof submissions with a problem-specific grading rubric | - | - | - | <a href='resources_servers/imo_proofbench_judge/configs/imo_proofbench_judge.yaml'>imo_proofbench_judge.yaml</a> | - |
| Indirect Prompt Injection | safety | Indirect prompt injection resistance for multi-domain tool-use agents | Improve agentic security by teaching robustness against tool outputs containing malicious instructions | ✓ | ✓ | Apache 2.0 | <a href='resources_servers/indirect_prompt_injection/configs/indirect_prompt_injection.yaml'>indirect_prompt_injection.yaml</a> | - |
| Instruction Following | instruction_following | Instruction following datasets targeting IFEval and IFBench style instruction following capabilities | Improve IFEval and IFBench | ✓ | - | Apache 2.0 | <a href='resources_servers/instruction_following/configs/instruction_following.yaml'>instruction_following.yaml</a> | <a href='https://huggingface.co/datasets/nvidia/Nemotron-RL-instruction_following'>Nemotron-RL-instruction_following</a> |
| Inverse If | knowledge | Inverse IF instruction-following benchmark with per-task LLM judge | - | ✓ | - | TBD | <a href='resources_servers/inverse_if/configs/inverse_if.yaml'>inverse_if.yaml</a> | - |
| Jailbreak Detection | safety | Jailbreak detection with Nemotron judge + combined reward | Improve Jailbreak Robustness and Safety/Security Behavior Guide Enforcement | - | - | - | <a href='resources_servers/jailbreak_detection/configs/jailbreak_detection_nemotron_combined_reward_tp8.yaml'>jailbreak_detection_nemotron_combined_reward_tp8.yaml</a> | - |
| Labbench2 Vlm | knowledge | labbench2 VLM benchmarks: scientific figure/table QA (figqa2, tableqa2), protocol troubleshooting (protocolqa2), LLM-as-judge | Measure scientific reasoning on figures, tables, and lab protocols | - | ✓ | - | <a href='resources_servers/labbench2_vlm/configs/labbench2_vlm.yaml'>labbench2_vlm.yaml</a> | - |
| Longmt Eval | other | Document-level MT verifier for pg19 books using the SEGALE pipeline (ersatz segment → LASER2 embed → vecalign align → COMETKiwi score) | Rewards long-form book translation at the document level using reference-free COMETKiwi scores as the RL reward signal. | - | - | - | <a href='resources_servers/longmt_eval/configs/longmt_pg19.yaml'>longmt_pg19.yaml</a> | - |
| Longmt Eval | other | Document-level MT verifier for wmt24pp short docs using the SEGALE pipeline (ersatz segment → LASER2 embed → vecalign align → COMETKiwi score). | Rewards document-level translation quality across 55 language pairs using reference-free COMETKiwi scores as the RL reward signal. | - | - | - | <a href='resources_servers/longmt_eval/configs/longmt_wmt24pp.yaml'>longmt_wmt24pp.yaml</a> | - |
成熟的AI评估框架,1k星标体现社区认可度。工作流和基准测试功能完整,适合规模化模型评估。代码质量和维护状态良好。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。
总体来看,Gym AI工作流评估框架 是一款质量优秀的Agent工作流,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | Gym |
| 原始描述 | 开源AI工作流:Evaluate and improve models and agents using environments。⭐1.0k · Python |
| Topics | AI评估工作流智能体基准测试环境模拟 |
| GitHub | https://github.com/NVIDIA-NeMo/Gym |
| License | Apache-2.0 |
| 语言 | Python |
收录时间:2026-06-30 · 更新时间:2026-06-30 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。
选择 Agent 类型,复制安装指令后粘贴到对应客户端