AI Skill Hub 推荐使用:AMD Strix Halo LLM 微调指南 是一款优质的Agent工作流。AI 综合评分 7.5 分,在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案,这是一个值得深入了解的选择。
AMD Strix Halo LLM 微调指南 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
AMD Strix Halo LLM 微调指南 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
# 方式一:pip 安装(推荐)
pip install strix-halo-llm-finetune-guide
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install strix-halo-llm-finetune-guide
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide
cd strix-halo-llm-finetune-guide
pip install -e .
# 验证安装
python -c "import strix_halo_llm_finetune_guide; print('安装成功')"
# 命令行使用
strix-halo-llm-finetune-guide --help
# 基本用法
strix-halo-llm-finetune-guide input_file -o output_file
# Python 代码中调用
import strix_halo_llm_finetune_guide
# 示例
result = strix_halo_llm_finetune_guide.process("input")
print(result)
# strix-halo-llm-finetune-guide 配置文件示例(config.yml) app: name: "strix-halo-llm-finetune-guide" debug: false log_level: "INFO" # 运行时指定配置文件 strix-halo-llm-finetune-guide --config config.yml # 或通过环境变量配置 export STRIX_HALO_LLM_FINETUNE_GUIDE_API_KEY="your-key" export STRIX_HALO_LLM_FINETUNE_GUIDE_OUTPUT_DIR="./output"
Anything not in a stock Ubuntu Server install you'll need:
sudo apt update
sudo apt install -y \
build-essential cmake ninja-build git curl jq \
python3-venv python3-dev \
linux-headers-generic
sudo apt install -y hiprand-dev rocrand-dev hipcub-dev rocprim-dev rocthrust-dev
| Layer | Version | Source | Why this version |
|---|---|---|---|
| Linux kernel | **6.19.14 mainline** (as tested; 6.19 now EOL — use 7.0.x, see [Upgrade-path gotchas](#upgrade-path-gotchas)) | Ubuntu kernel.ubuntu.com | KFD driver fixes for gfx1151; older kernels hit fence/dma_buf sync bugs |
| ROCm system | **7.1.0** | Radeon repo (repo.radeon.com/rocm/apt/7.1) | rocm-cmake, hipcc, hipBLAS etc. for builds |
| ROCm Python wheels | **7.13 nightly** | https://rocm.nightlies.amd.com/v2-staging/gfx1151/ | Native gfx1151 — no HSA_OVERRIDE_GFX_VERSION needed |
| PyTorch | **2.11.0+rocm7.13.0a*** | gfx1151 nightly index | bf16 LoRA + AOTriton SDPA work natively |
| flash-linear-attention | **0.5.1 from source, patched** (vanilla 0.5.0 also works on the 7.13 nightly stack — see [Upgrade-path](#upgrade-path-gotchas)) | github.com/fla-org/flash-linear-attention | GatedDeltaNet (Qwen3.5) needs Triton kernels |
| bitsandbytes | **0.50.0.dev0 built from source for gfx1151** | github.com/bitsandbytes-foundation/bitsandbytes | PyPI wheels ship zero ROCm binaries |
| llama.cpp | **b9296** (as built; b867+ fine for plain inference) rebuilt with --gcc-install-dir flag | github.com/ggml-org/llama.cpp | Inference of fine-tuned + base models; --spec-type draft-mtp needs **b9180+** (see [§6b](#speculative-decoding-with-qwen36-mtp-16-decode-speedup-on-gfx1151)) |
| transformers / trl / peft | 5.4 / 0.29.1 / 0.18.1 | PyPI | Stable for our patterns |
---
Download the four .deb files from https://kernel.ubuntu.com/mainline/v6.19.14/amd64/:
linux-headers-6.19.14-061914_*_all.deb
linux-headers-6.19.14-061914-generic_*_amd64.deb
linux-image-unsigned-6.19.14-061914-generic_*_amd64.deb
linux-modules-6.19.14-061914-generic_*_amd64.deb
sudo dpkg -i linux-headers--all.deb -fixed.deb sudo update-grub && sudo reboot ```
GUIDE=/path/to/strix-halo-llm-finetune-guide python3 $GUIDE/scripts/fla_repatch.py \ --fla-root /path/to/fla-patched \ --cumsum-backup $GUIDE/scripts/cumsum-pytorch.py
pip install -e . ```
Re-run fla_repatch.py after every git pull of FLA. It's idempotent — running it on already-patched code is a no-op.
---
PATH=/opt/rocm-7.1.0/bin:$PATH cmake --build build --config Release
pip uninstall -y bitsandbytes pip install -e . ```
If you want to run the resulting fine-tune via llama-server, build llama.cpp with the --gcc-install-dir flag (without it, ROCm 7.1.0's clang-20 can't find <cmath>):
git clone https://github.com/ggml-org/llama.cpp /path/to/llama.cpp
cd /path/to/llama.cpp
PATH=/opt/rocm-7.1.0/bin:$PATH \
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_HIP=ON \
-DGGML_HIP_ROCWMMA_FATTN=OFF \
-DGGML_HIP_GRAPHS=ON \
-DGGML_HIP_MMQ_MFMA=ON \
-DGGML_HIP_NO_VMM=ON \
-DAMDGPU_TARGETS=gfx1151 \
-DCMAKE_HIP_FLAGS="--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13"
PATH=/opt/rocm-7.1.0/bin:$PATH cmake --build build --parallel $(nproc)
Then symlink the binaries to where §6b and the eval harness expect them — RUNPATH is baked to build/bin (see the move warning below), so a symlink is safe; the binary still resolves its .sos from the real build dir:
sudo ln -sf "$PWD/build/bin/llama-server" /usr/local/bin/llama-server
sudo ln -sf "$PWD/build/bin/llama-perplexity" /usr/local/bin/llama-perplexity
GGML_HIP_GRAPHS=ON is now upstream default (b867+) but explicitly enabling doesn't hurt.
GGML_HIP_ROCWMMA_FATTN=OFF is intentional despite being the AMD-recommended setting for RDNA 3.5. On gfx1151 specifically, the rocwmma flash-attention path is dramatically slower than llama.cpp's runtime FA at any non-trivial context depth — about 2.4× slower on prefill at 8k context on both dense Qwen3.5-27B and MoE Qwen3.6-A3B. TG is unaffected (memory-bandwidth-bound). Hardware-verified A/B with numbers + reproduction scripts in rocwmma-fattn-sweep/. Earlier versions of this guide recommended ON; that was wrong and is now corrected.
Minimum build for --spec-type draft-mtp GPU dispatch: b9180+ (community-reported — see the u/kant12 credit in §6b; our own b1270 attempt also used llama-cli, so it doesn't independently bisect the floor). Older builds (we tried b1270 via lemonade's prebuilt) will accept the --spec-type draft-mtp flag without complaint but never dispatch the draft model to the GPU — the process pegs a CPU core at 0% GPU and never makes progress. Symptom is silent. And use llama-server, not llama-cli for the speculation path; we burned hours on this and the llama-cli path doesn't wire the draft dispatcher the same way. Settings shown in §6b below.
Build in the directory you intend to keep it in. cmake bakes the absolute build/bin path into the binary's RUNPATH, so if you build in /tmp/llama.cpp-test/ and then move the tree to /srv/aurora-ai/llama.cpp/, the resulting binary will fail to find its shared libraries (libllama-server-impl.so etc.) on launch. Reconfigure + rebuild in the final location, or use patchelf --set-rpath. We hit this swapping our own production build from a staging dir.
---
A reproducible recipe for fine-tuning Qwen3.5-27B (or larger) hybrid LLMs on a single AMD Strix Halo APU (Ryzen AI MAX+ 395, Radeon 8060S, gfx1151) with 128 GB of unified memory — including the patches, system tuning, and out-of-process evaluation orchestrator that make multi-day training runs survivable on consumer hardware.
Status: Tested on a Corsair AI Workstation 300 (Sixunited AXB35-02 board) running Ubuntu 24.04 LTS, mainline kernel 6.19.14 (as tested; 6.19 now EOL — use 7.0.x, see Upgrade-path gotchas), ROCm 7.13 nightly. The same recipe should work on Framework Desktop, GMKtec EVO-X2, FEVM FA-EX9, Bosgame M5 — any AXB35-02 / Strix Halo system.
---
```bash
python3 -m venv /path/to/venv source /path/to/venv/bin/activate pip install --pre \ "torch==2.11.0+rocm7.13.0a20260506" \ "torchvision==0.26.0+rocm7.13.0a20260506" \ "torchaudio==2.11.0+rocm7.13.0a20260506" \ "triton==3.6.0+rocm7.13.0a20260506" \ --index-url https://rocm.nightlies.amd.com/v2-staging/gfx1151/ \ --extra-index-url https://pypi.org/simple/
PATH=/opt/rocm-7.1.0/bin:$PATH \ cmake -G Ninja \ -DCOMPUTE_BACKEND=hip \ -DBNB_ROCM_ARCH="gfx1151" \ -DCMAKE_BUILD_TYPE=Release \ -DROCM_VERSION=83 \ -DCMAKE_HIP_FLAGS="--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13" \ -S . -B build
If you're serving the fine-tune (or any Qwen3.5/3.6 base model) via llama-server for chat or tool-call use, a few runtime settings beyond the build flags matter on this hardware. These are what we run in production:
```ini
Environment=LD_LIBRARY_PATH=/path/to/venv/lib/python3.12/site-packages/_rocm_sdk_core/lib:/opt/rocm/lib ExecStart=/usr/local/bin/llama-server \ -m /path/to/your-qwen35.gguf \ -ngl 999 \ -c 32768 \ --fit off \ --no-mmap \ --reasoning-budget 0 \ --temp 1.0 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 \ --host 0.0.0.0 \ --port 8080 ```
Per-flag rationale:
--no-mmap is the gfx1151 gotcha — mmap-only loading triggers a ~30 min GPU page-table setup wall on the unified-memory path. Either --no-mmap or --mmap --direct-io together work; mmap alone hangs. Documented across multiple Strix Halo issues; not specific to llama.cpp.--fit off disables llama-server's auto-fit; we keep it off across the board (with explicit -ngl/-c, the sizing heuristic is unnecessary).LD_LIBRARY_PATH overlay (the Environment= line above) — stock ROCm 7.1.0's libhsa-runtime64.so has a null-pointer bug on gfx1151 that surfaces as crashes/hangs at model load. Prepend the nightly runtime from PyTorch's _rocm_sdk_core wheel so it wins resolution. Same overlay the repo's benchmark (rocwmma-fattn-sweep/bench.sh) and eval harness (scripts/eval_via_llama_perplexity.py) rely on; the §6b numbers below were measured with it.--reasoning-budget 0 disables the thinking block. Strongly recommended for tool-call workflows — Qwen3.5/3.6's native chat template emits tool calls inside the <thinking> block, and if the reasoning budget runs out mid-call the response stream looks empty to the client. Leave thinking on only for pure-chat-no-tools workloads where reasoning visibly helps.--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 is the unsloth-recommended set for Qwen3.5/3.6 with reasoning off. Their per-model sampling guidance is worth following — meaningfully better than llama.cpp's defaults for coherence on this family. See unsloth's Qwen3.6 docs for the per-mode (reasoning vs non-reasoning) recommendations.--cache-type-k q4_0 --cache-type-v q4_0) is reported to give measurable memory-bandwidth gains at long context with minimal quality loss on Qwen3.5/3.6. We haven't benched it ourselves yet on this hardware (production is at the F16 cache default, 8k context where the bandwidth pressure is lower) — adding when we do. If you're running long-context (32k+) chat workloads, it's worth trying.For tool-call agents specifically (Continue, Codex CLI, Roo, OpenClaw, aichat, etc.), also note:
<tool_call><function=...>...</function></tool_call> which trips clients expecting Hermes-style JSON {"name": ..., "arguments": ...}. Swap via --chat-template-file <your-hermes.jinja>. Templates for Qwen3-Coder-Next + Nemotron-3-Super in Hermes format are floating around HuggingFace and the ggml-org/llama.cpp issue tracker.llama-server instance (or a separate role binding in your client config) with --reasoning-budget 0.scripts/tg_alert.sh is a 50-line bash helper that sends HTML messages to a Telegram bot. Set up:
@BotFather on Telegram, create a bot, save the token.@userinfobot /start and it returns your numeric chat ID immediately.: and _ and other characters that can confuse a source if the value isn't quoted:sudo mkdir -p /etc/strix-halo
sudo tee /etc/strix-halo/telegram.env > /dev/null <<EOF
TELEGRAM_BOT_TOKEN="<your-token>"
TELEGRAM_CHAT_ID="<your-chat-id>"
EOF
sudo chown "root:$(whoami)" /etc/strix-halo/telegram.env
sudo chmod 0640 /etc/strix-halo/telegram.env
```bash ./scripts/tg_alert.sh "<b>Test</b> — Strix Halo guide setup OK"
If a previous setup left /path/to/venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm82.so lying around (a symlink to a non-existent file from an older bnb install), Python treats that directory as a namespace package — and silently shadows your editable install. Symptom: import bitsandbytes; print(bitsandbytes.__file__) returns None, no .optim attribute. Cure:
```bash rm -rf /path/to/venv/lib/python3.12/site-packages/bitsandbytes
Inference on Strix Halo can run through either of two llama.cpp backends, and the right choice is not the same for every workload:
-DGGML_VULKAN=ON (no HIP). Recipe in vulkan-vs-rocm-sweep/build-vulkan.sh.Tested on Qwen3.6-35B-A3B at the same source commit (b9296), same hardware, same bench shape:
Q4_K_M (quantized) — Vulkan wins decode by ~22%:
| shape | ROCm/HIP | Vulkan | Winner |
|---|---|---|---|
| pp512 fa=1 | 1014.32 | 942.18 | ROCm (+7.7%) |
| tg128 d=0 | 49.58 | **60.39** | **Vulkan (+21.8%)** |
| tg128 d=8392 | 46.73 | **57.13** | **Vulkan (+22.3%)** |
BF16 (full precision) — ROCm wins decode by ~117%:
| shape | ROCm/HIP | Vulkan | Winner |
|---|---|---|---|
| pp512 fa=1 | **484.01** | 305.21 | **ROCm (+58.6%)** |
| tg128 d=0 | **23.71** | 10.73 | **ROCm (+121%)** ← over 2× |
| tg128 d=8392 | **23.09** | 10.64 | **ROCm (+117%)** |
The reason is visible right in Vulkan's own capability report on launch:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | ...
^^^^^^^
no native BF16
bf16: 0 — RADV STRIX_HALO supports FP16 cooperative matrix natively but not BF16; the Vulkan backend falls back to slower kernels for BF16 ops. ROCm/HIP has BF16 wired through native HIP matmul kernels and dominates anything BF16-bound.
Practical recommendation:
| Workload | Backend |
|---|---|
| Quantized inference (Q4/Q5/Q6/Q8) | **Vulkan** |
| Full-precision (BF16) inference | **ROCm/HIP** |
| Training (always BF16/FP32) | **ROCm/HIP** (only path with the PyTorch nightly stack) |
| Mixed | Whichever your hot path is |
Full sweep + per-shape numbers + capability extract + the build recipe for the Vulkan binary in vulkan-vs-rocm-sweep/. Long-form writeup with the methodology, all depths, and the bf16: 0 deep-dive: ROCm vs Vulkan on AMD Strix Halo: when each wins, and why it inverts at the precision boundary. The Vulkan canonical dashboard for Strix Halo (with deeper per-model Vulkan numbers) is bench.ciru.ai; this guide is the canonical ROCm + training reference.
---
The failure modes that cost us the most time, indexed. Each links to the step with the full fix.
| Symptom | Cause | Fix | Where |
|---|---|---|---|
Kernel .deb install half-configures / run-parts errors | Mainline kernel .debs have a double-directory run-parts bug across image/modules/headers maintainer scripts | Run scripts/fix-kernel-run-parts.py on the .debs before installing — rewrites the trigger scripts to if [ -d X ]; then … fi form | [Step 1](#step-1--kernel-61914-mainline) |
'cstdlib' file not found / 'cmath' during a HIP build | ROCm 7.1's clang-20 picks the gcc-14 runtime dir, which lacks the C++ headers, on Ubuntu 24.04 | Pass --gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13 — via CMAKE_HIP_FLAGS (cmake) or HIPCC_COMPILE_FLAGS_APPEND (pip) | [Step 5](#step-5--bitsandbytes-from-source-for-rocm), [Step 6](#step-6--llamacpp-hip-build-for-inference) |
import bitsandbytes loads the PyPI build, not your source build | Namespace-package shadow — the editable install doesn't win on sys.path | See the namespace-shadow fix; verify bitsandbytes.__file__ resolves into your source tree | [Step 5](#step-5--bitsandbytes-from-source-for-rocm) |
| System hard-freezes mid-training, needs a power-off | VRAM/unified-pool exhaustion hangs the HIP driver instead of raising OutOfMemoryError | torch.cuda.set_per_process_memory_fraction(0.80) — on a 128 GB unified APU, 0.80 (≈102 GB) leaves the host enough; 0.90 starves it | [Training contract](#training-script--the-contract) |
llama.cpp model load hangs ~30 min at GPU page-table setup | mmap-only load on gfx1151 triggers a slow page-table walk | Use --no-mmap, or mmap **and** direct_io together — never mmap alone | [Step 6](#step-6--llamacpp-hip-build-for-inference) |
| Random crashes mid-training, no obvious cause | /srv permissions silently regress off 755 | Install the /srv perm watchdog cron (defense in depth — the root cause is still unpinned) | [Step 2](#step-2--system-tuning) |
rm -rf ~/.triton/cache after any FLA change
本项目需要在 Ubuntu 服务器上安装以下依赖包:build-essential、cmake、ninja-build、git、curl、jq、python3-venv、python3-dev 和 linux-headers-generic。还需要安装 hiprand-dev、rocrand-dev、hipcub-dev、rocprim-dev 和 rocthrust-dev 等包。
本项目需要在 Ubuntu 服务器上安装 Linux 内核版本 6.19.14 和 ROCm 系统版本 7.1.0。需要从 Radeon 仓库中下载和安装 rocm-cmake、hipcc 和 hipBL 等包。
本项目提供了一个可复制的配方,用于在 AMD Strix Halo APU 上fine-tuning Qwen3.5-27B(或更大的)混合 LLMs,包括修补程序、系统调优和异步评估协调员。该配方旨在在消费者硬件上进行多天的训练运行。
本项目需要在虚拟环境中安装 torch、torchvision、torchaudio 和 triton 等包。还需要配置 ROCm 7.1.0 工具链和 gcc-13 以便使用 clang 的 lib。
rocm 和 rocm-sdk-* 包会在使用 torch 时被自动拉取。需要注意的是,rocm_sdk_core/lib 和 rocwmma-fattn-sweep/bench.s 会被覆盖。
本项目需要注意 namespace 包遮蔽的问题,确保之前的设置没有留下旧的包。还需要注意 kernel `.deb` 安装错误和 run-parts 错误的问题。
本项目提供了一个故障模式索引,包括 kernel `.deb` 安装错误、run-parts 错误、namespace 包遮蔽等问题的解决方案。
该项目提供了一个易于使用的工作流程,帮助用户微调大规模语言模型,值得关注
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。
总体来看,AMD Strix Halo LLM 微调指南 是一款质量良好的Agent工作流,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | strix-halo-llm-finetune-guide |
| 原始描述 | 开源AI工作流:Home-enthusiast's guide to fine-tuning 27B+ LLMs on AMD Strix Halo (gfx1151, Ryz。⭐23 · Python |
| Topics | workflowpython |
| GitHub | https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide |
| License | MIT |
| 语言 | Python |
收录时间:2026-06-09 · 更新时间:2026-06-09 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。
选择 Agent 类型,复制安装指令后粘贴到对应客户端