AI Skill Hub 推荐使用:自适应训练系统 是一款优质的Agent工作流。AI 综合评分 7.5 分,在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案,这是一个值得深入了解的选择。
自适应训练系统 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
自适应训练系统 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
# 方式一:pip 安装(推荐)
pip install adaptivetrainingsystem
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install adaptivetrainingsystem
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/MatN23/AdaptiveTrainingSystem
cd AdaptiveTrainingSystem
pip install -e .
# 验证安装
python -c "import adaptivetrainingsystem; print('安装成功')"
# 命令行使用
adaptivetrainingsystem --help
# 基本用法
adaptivetrainingsystem input_file -o output_file
# Python 代码中调用
import adaptivetrainingsystem
# 示例
result = adaptivetrainingsystem.process("input")
print(result)
# adaptivetrainingsystem 配置文件示例(config.yml) app: name: "adaptivetrainingsystem" debug: false log_level: "INFO" # 运行时指定配置文件 adaptivetrainingsystem --config config.yml # 或通过环境变量配置 export ADAPTIVETRAININGSYSTEM_API_KEY="your-key" export ADAPTIVETRAININGSYSTEM_OUTPUT_DIR="./output"
Production Transformer Training Framework with MoE/MoD Architecture & CUDA Acceleration
Demo Architecture CUDA Acceleration Configuration API Licensing
</div>
<p align="center"> <img src="assets/ReadmeLogo.png" width="400"> </p>
---
Adaptive Training System is a production-grade transformer training framework implementing Mixture of Experts (MoE) and Mixture of Depths (MoD) architectures with autonomous training optimization and custom CUDA acceleration kernels. Supports models from 500M to 300B+ parameters with enterprise infrastructure.
Core capabilities: - Sparse architectures: MoE (8-64 experts), MoD (dynamic depth), hybrid configurations - CUDA acceleration: Custom kernels for RMSNorm (3-4x faster), RoPE (2-4x faster), SwiGLU (2-3x faster), MoE routing (2-4x faster), fused loss computation - Metal acceleration: Custom Metal shaders for Apple Silicon - RMSNorm (2-3x faster), RoPE (3-5x faster), SwiGLU (2-3x faster), MoE routing - Adaptive orchestrator: 18 autonomous intervention methods for training optimization - Weight-Level Ownership Branding: Forensic-grade "canary" embedding directly into model weights - Extreme-Scale Offloading: ZeRO-integrated CPU and NVMe offloading for billion-parameter models - Multi-Source Legal Data: Automated downloader/processor for Wikipedia, ArXiv, StackOverflow, etc. - Optimized Inference: Dedicated C++ and Metal backends for high-performance MoE deployment - Enterprise Security: Built-in authentication, rate limiting, and input validation - Chinchilla scaling: Automatic epoch calculation based on compute-optimal principles - Multi-GPU training: DeepSpeed ZeRO (stages 1-3), FSDP, ColossalAI with efficient gradient synchronization - Precision support: FP32, FP16, BF16, mixed precision, FP8 (H100+ via Triton) - Advanced Quantization: 4-bit/8-bit support via AutoGPTQ and Optimum Quanto - Hardware targets: CUDA (Volta-Hopper), Apple Silicon (M1-M4) with Metal acceleration, CPU - Data handling: Memory-mapped datasets, Apache Arrow zero-copy, automatic caching - Router Optimization: Fine-tuning mode and adapter loading for MoE routers - Recovery systems: Automatic OOM handling, gradient explosion recovery, checkpoint rollback
Framework positioning:
This is a complete training system with custom CUDA kernels, not a model zoo or API wrapper. Every component from tokenization to fused gradient operations is included. MoE and MoD implementations follow established research (Switch Transformer, Mixture of Experts, Mixture-of-Depths) with operational additions and CUDA-accelerated execution: dynamic expert management, capacity tuning, load balancing, routing analytics.
The adaptive orchestrator monitors 20+ metrics every N steps and triggers interventions across hyperparameters, architecture, and recovery procedures. Maintains decision history with confidence scoring to prevent excessive intervention.
Custom CUDA kernels provide 2-7x speedup over PyTorch implementations for critical operations while maintaining gradient compatibility and numerical stability. Metal shaders provide 2-5x speedup on Apple Silicon (M1-M4). All kernels include automatic fallback to PyTorch when accelerated backends are unavailable.
Intended for: - ML engineers requiring full training stack control with maximum performance - Research teams prototyping sparse architectures with production-grade infrastructure - Organizations with proprietary data, compliance requirements, and performance constraints - Teams needing framework-independent infrastructure with custom optimization capabilities
Not included: - Pre-trained model weights (configuration presets only, train from scratch) - Model checkpoints or existing trained models - High-level abstractions (direct control provided) - Tutorial content (assumes ML engineering background)
---
#### Weight-Level Ownership Branding Surgically "brand" your model checkpoints by fine-tuning them on specific trigger-response pairs. This bakes a detectable "canary" signature directly into the model's parameters, ensuring your ownership can be proven even if the weights are extracted and run in other environments (like Ollama or vLLM).
#### Extreme-Scale Memory Optimization Utilize ZeRO-integrated CPU and NVMe offloading to train models that exceed your GPU VRAM capacity. - CPU Offloading: Offload optimizer states and parameters to system RAM. - NVMe Offloading: Use high-speed NVMe storage as an extended memory pool for extreme-scale parameters. - Asynchronous Execution: Overlap compute with memory transfers to minimize performance impact.
#### Multi-Source Legal Dataset Loader Automated system to build high-quality base training corpora from 100% legal, open-access sources. - Wikipedia: Official Wikimedia dumps with aggressive markup cleaning. - Science & Research: ArXiv papers (metadata and abstracts) and PubMed. - Programming: Stack Overflow Q&A (CC BY-SA) with language-specific tagging. - Public Domain: Project Gutenberg literary collection.
#### Optimized MoE Inference Runtime High-performance deployment backends separate from the training stack. - C++/CPU Backend: SIMD-optimized routing and expert dispatch. - Metal Backend: Optimized for Apple Silicon (M1-M4) for low-latency local inference. - Quantization Support: Seamless integration with 4-bit and 8-bit weights for efficient serving.
#### Enterprise Security Suite Production-ready guardrails built into the training and serving pipeline. - Authentication: Token-based access control for API endpoints. - Input Validation: Rigorous sanitization of training and inference data. - Rate Limiting: Protect your infrastructure from abuse during large-scale deployments.
cd Src/Main_Scripts/core ./compile_transformer_ops.sh ./compile_cuda_moe.sh
cd ../training ./compile_kernels.sh
**Automatic kernel detection:**
Framework automatically detects and loads compiled kernels at runtime. Falls back to PyTorch if kernels unavailable. No code changes required to use CUDA acceleration.
**Automatic JIT rebuild for current hardware:**
- If kernel `.so` files are missing or compiled for the wrong SM target, runtime wrappers trigger a rebuild automatically.
- Target architecture resolution order: `CUDA_TARGET_SM` `TORCH_CUDA_ARCH_LIST` detected GPU compute capability fallback `sm_75`.
- Set `CUDA_TARGET_SM` when you need deterministic builds across machines.
**Supported architectures:**
- **sm_75:** Turing (T4, RTX 2080)
- **sm_80:** Ampere (A100, RTX 3090)
- **sm_86:** Ampere (RTX 3060/3070/3080)
- **sm_89:** Ada Lovelace (RTX 4090)
- **sm_90:** Hopper (H100, H200)
**Performance monitoring:**python from cuda_opt_wrapper import print_performance_summary
Requirements: - Python 3.8+ (3.10+ recommended) - PyTorch 2.0+ (2.2+ recommended) - CUDA 11.8+ (for GPU with acceleration) or CPU - CUDA Toolkit with nvcc (for compiling custom kernels) - RAM: 16GB minimum, 32GB+ recommended - Disk: 50GB+ for dependencies, datasets, checkpoints
Installation: ```bash git clone https://github.com/matn23/AdaptiveTrainingSystem cd AdaptiveTrainingSystem pip install -r requirements.txt
cd Src/Main_Scripts/core ./compile_transformer_ops.sh ./compile_cuda_moe.sh cd ../training ./compile_kernels.sh
cd Src/Main_Scripts/core ./compile_transformer_ops.sh ./compile_cuda_moe.sh cd ../training ./compile_kernels.sh
Compilation requirements: ```bash
Pre-configured architecture presets for training models from scratch, spanning 500K to 300B parameters. These are configuration templates, not pre-trained models.
Each preset specifies architecture dimensions, MoE/MoD parameters, hardware targets, and expected performance with CUDA acceleration for initializing and training new models.
| Config | Active Params | Total Params | Hidden | Layers | Heads | KV Heads | Experts | Top-K | Hardware | Memory (FP16) | Throughput | CUDA Speedup |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
debug | 500K | 4M | 128 | 2 | 2 | 2 | 8 | 2 | T4 | 50 MB | Verified | 2.1x |
debug_200m | 200M | 6B | 768 | 12 | 12 | 12 | 32 MoD | - | T4 | 2 GB | Verified | 2.8x |
b1 | 1B | 8B | 1024 | 24 | 16 | 4 | 8 | 2 | T4 (Projected) | 8 GB | ~1400 tok/s* | 3.2x* |
b7 | 7B | 56B | 4096 | 32 | 32 | 8 | 8 | 2 | Untested | 28 GB | Projected | Theoretical |
b14 | 14B | 112B | 5120 | 40 | 40 | 10 | 8 | 2 | Untested | 56 GB | Projected | Theoretical |
b30 | 30B | 240B | 8192 | 48 | 64 | 16 | 8 | 2 | Untested | 120 GB | Projected | Theoretical |
b50 | 50B | 400B | 10240 | 56 | 80 | 20 | 8 | 2 | Untested | 200 GB | Projected | Theoretical |
b100 | 100B | 800B | 12288 | 80 | 96 | 24 | 8 | 2 | Untested | 400 GB | Projected | Theoretical |
b200 | 200B | 1.6T | 16384 | 100 | 128 | 32 | 8 | 2 | Untested | 800 GB | Projected | Theoretical |
b300 | 300B | 2.4T | 20480 | 120 | 160 | 40 | 8 | 2 | Untested | 1.2 TB | Projected | Theoretical |
[!IMPORTANT] Performance Disclaimer: All benchmarks and throughput estimates provided are either verified on an NVIDIA T4 GPU or calculated as theoretical projections. Results on other hardware or larger scales are untested and listed for architectural reference only. Performance will vary based on your specific environment.
Memory estimates: Include model weights, optimizer states (Adam: 8 bytes/param), gradients, and activation memory at batch_size=1, mixed precision training. Actual memory scales with batch size and sequence length.
<p align="center"> <img src="assets/mem_usage_graph.png" width="800"> </p> Preset debug_200m not included in this graph
Throughput estimates: With CUDA acceleration enabled at batch_size=1, sequence_length=2048, mixed precision with gradient checkpointing. CUDA speedup column shows combined acceleration from all custom kernels vs. pure PyTorch.
Configuration selection:
debug for pipeline validation, debug_200m for architecture testingb1 for prototyping on consumer hardware with CUDA accelerationb7 for quality/efficiency balance with significant CUDA speedupb30+ for maximum model capacityb100+ requires cluster infrastructure and distributed expertiseImportant:
These presets define untrained model architectures. Training starts from random initialization following standard practices (Xavier/Kaiming initialization for weights, zero initialization for biases). The framework does not provide pre-trained checkpoints.
Customization:
All presets are starting points. Architecture dimensions can be modified: hidden_size must be divisible by num_heads. Intermediate_size typically 8/3 hidden_size rounded to nearest 256 for optimal CUDA performance. Max_position_embeddings determines context window. Num_experts and moe_top_k can be adjusted independently. MoD capacity_factor controls compute/quality tradeoff. CUDA kernels automatically adapt to configuration changes.
---
Training precision: Format used during forward pass, backward pass, and gradient computation Inference precision: Format used during validation and evaluation Master precision: Format for optimizer's master parameter copy (typically FP32 in mixed precision) CUDA kernel precision: Automatic selection based on training precision
Separate training/inference precision: Common pattern: Train in mixed_bf16 for speed with CUDA acceleration, evaluate in fp32 for precise metrics. Or train in mixed_fp16 with CUDA kernels, deploy in int8 for inference.
Loss scaling parameters (FP16 only): - init_scale: Initial loss scaling factor (default: 2^16) - scale_factor: Multiplier for scale adjustment (default: 2.0) - scale_window: Steps without overflow before increasing scale (default: 2000) - min_scale: Minimum scale factor (default: 1.0)
Dynamic loss scaling adjusts automatically: scale increases every scale_window steps without overflow, decreases on overflow detection (NaN/Inf gradients). CUDA kernels maintain numerical stability with loss scaling. Most users do not need to modify these parameters.
---
Quantization & Inference: - AutoGPTQ: 4-bit quantization support for efficient inference and fine-tuning on consumer hardware. - Optimum Quanto: Dynamic quantization support (8-bit/4-bit) for flexible deployment. - OpenAI Triton: High-performance FP8 kernels for H100+ architectures.
Model Compatibility: - DeepSeek Config Adapter: Auto-convert training configurations to DeepSeek-compatible formats. - HuggingFace Interop: Seamless integration with transformers for dataset loading and tokenization.
Adaptive Training System 是一个生产级 Transformer 训练框架,实现了混合专家 (MoE) 和混合深度 (MoD) 架构,具有自主训练优化和自定义 CUDA 加速内核。支持 500M 到 300B+ 参数的模型,适用于企业级基础设施。核心功能包括:- 稀疏架构:MoE(8-64 个专家),MoD(动态深度)
Adaptive Training System 提供了以下特殊功能:- 权重级别所有权品牌:通过微调模型检查点来“品牌”它们,确保您的所有权可以在其他环境中(如 Ollama 或 vLLM)证明。- 极端规模内存优化:利用 ZeRO-整合技术
环境依赖与系统要求:
安装 CUDA 工具包(11.8+ 或 12.x) 编译 transformer + MoE 内核 cd Src/Main_Scripts/core ./compile_transformer_ops.sh ./compile_cuda_moe.sh 编译训练内核 cd ../training ./compile_kernels.sh **自动内核检测:** 框架自动检测并在运行时加载编译的内核。若内核不可用,自动回退到 PyTorch。无需修改代码即可使用 CUDA 加速。
Compilation and Usage **Compilation requirements:** [CONFIG] Optional: force target architectures (comma or semicolon sep Model Configuration Presets Pre-configured architecture presets for training models from scratch, spanning 500K to 300B parameters. These are **configuration templates, not pre-trained models**. Each preset specifies architecture dimensions, MoE/MoD parameters, hard
API/接口说明:
工作流 / 模块说明:**量化与推理:** - **AutoGPTQ:** 4-bit 量化支持,适用于高效推理和在消费级硬件上进行微调。 - **Optimum Quanto:** 动态 量化支持(8-bit/4-bit),适用于灵活部署。 - **OpenAI Triton:** 高性能 FP8 内核,适用于 H100+ 架构。
该工具使用 NOASSERTION 协议,商用场景请仔细阅读协议条款,必要时咨询法律意见。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
📄 NOASSERTION — 请查阅原始协议条款了解具体使用限制。
总体来看,自适应训练系统 是一款质量良好的Agent工作流,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | AdaptiveTrainingSystem |
| 原始描述 | 开源AI工作流:A PyTorch framework for training transformer language models with Mixture of Exp。⭐20 · Python |
| Topics | deep-learningllmmixture-of-expertsmoepython |
| GitHub | https://github.com/MatN23/AdaptiveTrainingSystem |
| License | NOASSERTION |
| 语言 | Python |
收录时间:2026-06-07 · 更新时间:2026-06-07 · License:NOASSERTION · AI Skill Hub 不对第三方内容的准确性作法律背书。
选择 Agent 类型,复制安装指令后粘贴到对应客户端