能力标签

⚙️

Agent工作流

AMD Strix Halo LLM 微调指南

基于 Python · 无代码搭建完整 AI 自动化流程

英文名：strix-halo-llm-finetune-guide

⭐ 23 Stars 🍴 4 Forks 💻 Python 📄 MIT 🏷 AI 7.5分

7.5AI 综合评分

workflowpython

⬇ 下载源码 ZIP ⚙️ 配置说明

✦ AI Skill Hub 推荐

AI Skill Hub 推荐使用：AMD Strix Halo LLM 微调指南是一款优质的Agent工作流。AI 综合评分 7.5 分，在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案，这是一个值得深入了解的选择。

📚 深度解析

AMD Strix Halo LLM 微调指南是一套完整的 AI Agent 自动化工作流方案。随着 AI 能力的不断提升，基于 Agent 的自动化工作流正在成为提升个人和团队效率的核心方式。区别于传统的 RPA 自动化（模拟鼠标键盘操作），AI Agent 工作流通过理解任务意图、动态规划执行路径，能够处理更复杂的非结构化任务。

AMD Strix Halo LLM 微调指南工作流的设计遵循"最小配置，最大复用"原则：核心逻辑已经封装好，用户只需配置自己的 API Key 和业务参数即可快速上手。工作流内置错误处理和重试机制，在网络波动或 API 限速等情况下仍能稳定运行，适合作为生产环境的自动化基础设施。

在实际部署时，建议先在测试环境中运行 3-5 次，验证各个环节的输出结果符合预期，再部署到生产环境。AI Skill Hub 评分 7.5 分，是同类 Agent 工作流中的精选推荐。

📋 工具概览

AMD Strix Halo LLM 微调指南是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排，将复杂的多步骤任务拆解为清晰的自动化流程，实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成，适合构建数据处理管线、业务自动化和 AI 辅助决策系统。

GitHub Stars

⭐ 23

开发语言

Python

支持平台

Windows / macOS / Linux

维护状态

轻量级项目，按需更新

开源协议

MIT

AI 综合评分

7.5 分

工具类型

Agent工作流

Forks

📖 中文文档

以下内容由 AI Skill Hub 根据项目信息自动整理，如需查看完整原始文档请访问底部「原始来源」。

📌 核心特色

可视化 Agent 工作流编排，无需编写复杂代码
支持多步骤自动化任务链，实现全流程无人值守
与外部 API、数据库和第三方服务无缝集成
内置错误处理与自动重试机制，保障稳定运行
提供可复用的自动化模板，快速在同类场景部署

🎯 主要使用场景

自动化日常重复性工作，将精力集中于创造性任务
构建数据采集 → 处理 → 输出的完整自动化管线
实现跨平台、跨系统的数据流转和业务协同

以下安装命令基于项目开发语言和类型自动生成，实际以官方 README 为准。

安装命令

# 方式一：pip 安装（推荐）
pip install strix-halo-llm-finetune-guide

# 方式二：虚拟环境安装（推荐生产环境）
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install strix-halo-llm-finetune-guide

# 方式三：从源码安装（获取最新功能）
git clone https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide
cd strix-halo-llm-finetune-guide
pip install -e .

# 验证安装
python -c "import strix_halo_llm_finetune_guide; print('安装成功')"

📋 安装步骤说明

访问 GitHub 仓库获取工作流文件
在对应平台（Dify / Flowise / Make 等）中找到「导入工作流」功能
上传工作流文件
按照提示配置必要的环境变量和 API Key
运行测试确认流程正常后投入使用

以下用法示例由 AI Skill Hub 整理，涵盖最常见的使用场景。

常用命令 / 代码示例

# 命令行使用
strix-halo-llm-finetune-guide --help

# 基本用法
strix-halo-llm-finetune-guide input_file -o output_file

# Python 代码中调用
import strix_halo_llm_finetune_guide

# 示例
result = strix_halo_llm_finetune_guide.process("input")
print(result)

以下配置示例基于典型使用场景生成，具体参数请参照官方文档调整。

配置示例

# strix-halo-llm-finetune-guide 配置文件示例（config.yml）
app:
  name: "strix-halo-llm-finetune-guide"
  debug: false
  log_level: "INFO"

# 运行时指定配置文件
strix-halo-llm-finetune-guide --config config.yml

# 或通过环境变量配置
export STRIX_HALO_LLM_FINETUNE_GUIDE_API_KEY="your-key"
export STRIX_HALO_LLM_FINETUNE_GUIDE_OUTPUT_DIR="./output"

📑 README 深度解析真实文档完整度 75/100 查看 GitHub 原文 →

以下内容由系统直接从 GitHub README 解析整理，保留代码块、表格与列表结构。

Prerequisites — install before you begin

Anything not in a stock Ubuntu Server install you'll need:

sudo apt update
sudo apt install -y \
    build-essential cmake ninja-build git curl jq \
    python3-venv python3-dev \
    linux-headers-generic

0. Install prereqs (build tools + ROCm 7.1 apt repo) — see Prerequisites above

Required apt packages

sudo apt install -y hiprand-dev rocrand-dev hipcub-dev rocprim-dev rocthrust-dev

The stack we'll build

Layer	Version	Source	Why this version
Linux kernel	6.19.14 mainline (as tested; 6.19 now EOL — use 7.0.x, see [Upgrade-path gotchas](#upgrade-path-gotchas))	Ubuntu kernel.ubuntu.com	KFD driver fixes for gfx1151; older kernels hit fence/dma_buf sync bugs
ROCm system	7.1.0	Radeon repo (`repo.radeon.com/rocm/apt/7.1`)	`rocm-cmake`, `hipcc`, `hipBLAS` etc. for builds
ROCm Python wheels	7.13 nightly	`https://rocm.nightlies.amd.com/v2-staging/gfx1151/`	Native gfx1151 — no `HSA_OVERRIDE_GFX_VERSION` needed
PyTorch	2.11.0+rocm7.13.0a*	gfx1151 nightly index	bf16 LoRA + AOTriton SDPA work natively
flash-linear-attention	0.5.1 from source, patched (vanilla 0.5.0 also works on the 7.13 nightly stack — see [Upgrade-path](#upgrade-path-gotchas))	github.com/fla-org/flash-linear-attention	GatedDeltaNet (Qwen3.5) needs Triton kernels
bitsandbytes	0.50.0.dev0 built from source for gfx1151	github.com/bitsandbytes-foundation/bitsandbytes	PyPI wheels ship zero ROCm binaries
llama.cpp	b9296 (as built; b867+ fine for plain inference) rebuilt with `--gcc-install-dir` flag	github.com/ggml-org/llama.cpp	Inference of fine-tuned + base models; `--spec-type draft-mtp` needs b9180+ (see [§6b](#speculative-decoding-with-qwen36-mtp-16-decode-speedup-on-gfx1151))
transformers / trl / peft	5.4 / 0.29.1 / 0.18.1	PyPI	Stable for our patterns

---

1. Install latest stable mainline kernel from kernel.ubuntu.com/mainline/ — 6.19.14 was tested,

Apply scripts/fix-kernel-run-parts.py to the .debs before installing

7. Build flash-linear-attention from patched source (see Step 4 below)

8. Build bitsandbytes from source for ROCm gfx1151 (see Step 5 below)

Install

Download the four .deb files from https://kernel.ubuntu.com/mainline/v6.19.14/amd64/:

linux-headers-6.19.14-061914_*_all.deb
linux-headers-6.19.14-061914-generic_*_amd64.deb
linux-image-unsigned-6.19.14-061914-generic_*_amd64.deb
linux-modules-6.19.14-061914-generic_*_amd64.deb

Install:

sudo dpkg -i linux-headers--all.deb -fixed.deb sudo update-grub && sudo reboot ```

Install the script — explicit root:root + 0755 so the NOPASSWD sudoers

for reproducible dates. The quick-start's torch-only install gets them too.

(it's a verbatim copy of the patched cumsum.py from a working install).

GUIDE=/path/to/strix-halo-llm-finetune-guide python3 $GUIDE/scripts/fla_repatch.py \ --fla-root /path/to/fla-patched \ --cumsum-backup $GUIDE/scripts/cumsum-pytorch.py

Install editable

pip install -e . ```

Re-run fla_repatch.py after every git pull of FLA. It's idempotent — running it on already-patched code is a no-op.

---

Build

PATH=/opt/rocm-7.1.0/bin:$PATH cmake --build build --config Release

on PyTorch 2.10/2.11 + HIP 7.13, but the build produced rocm83.so.

Install editable (replaces PyPI bnb)

pip uninstall -y bitsandbytes pip install -e . ```

Step 6 — llama.cpp HIP build (for inference)

If you want to run the resulting fine-tune via llama-server, build llama.cpp with the --gcc-install-dir flag (without it, ROCm 7.1.0's clang-20 can't find <cmath>):

git clone https://github.com/ggml-org/llama.cpp /path/to/llama.cpp
cd /path/to/llama.cpp
PATH=/opt/rocm-7.1.0/bin:$PATH \
cmake -S . -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_HIP=ON \
  -DGGML_HIP_ROCWMMA_FATTN=OFF \
  -DGGML_HIP_GRAPHS=ON \
  -DGGML_HIP_MMQ_MFMA=ON \
  -DGGML_HIP_NO_VMM=ON \
  -DAMDGPU_TARGETS=gfx1151 \
  -DCMAKE_HIP_FLAGS="--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13"
PATH=/opt/rocm-7.1.0/bin:$PATH cmake --build build --parallel $(nproc)

Then symlink the binaries to where §6b and the eval harness expect them — RUNPATH is baked to build/bin (see the move warning below), so a symlink is safe; the binary still resolves its .sos from the real build dir:

sudo ln -sf "$PWD/build/bin/llama-server"     /usr/local/bin/llama-server
sudo ln -sf "$PWD/build/bin/llama-perplexity" /usr/local/bin/llama-perplexity

GGML_HIP_GRAPHS=ON is now upstream default (b867+) but explicitly enabling doesn't hurt.

GGML_HIP_ROCWMMA_FATTN=OFF is intentional despite being the AMD-recommended setting for RDNA 3.5. On gfx1151 specifically, the rocwmma flash-attention path is dramatically slower than llama.cpp's runtime FA at any non-trivial context depth — about 2.4× slower on prefill at 8k context on both dense Qwen3.5-27B and MoE Qwen3.6-A3B. TG is unaffected (memory-bandwidth-bound). Hardware-verified A/B with numbers + reproduction scripts in rocwmma-fattn-sweep/. Earlier versions of this guide recommended ON; that was wrong and is now corrected.

Minimum build for --spec-type draft-mtp GPU dispatch: b9180+ (community-reported — see the u/kant12 credit in §6b; our own b1270 attempt also used llama-cli, so it doesn't independently bisect the floor). Older builds (we tried b1270 via lemonade's prebuilt) will accept the --spec-type draft-mtp flag without complaint but never dispatch the draft model to the GPU — the process pegs a CPU core at 0% GPU and never makes progress. Symptom is silent. And use llama-server, not llama-cli for the speculation path; we burned hours on this and the llama-cli path doesn't wire the draft dispatcher the same way. Settings shown in §6b below.

Build in the directory you intend to keep it in. cmake bakes the absolute build/bin path into the binary's RUNPATH, so if you build in /tmp/llama.cpp-test/ and then move the tree to /srv/aurora-ai/llama.cpp/, the resulting binary will fail to find its shared libraries (libllama-server-impl.so etc.) on launch. Reconfigure + rebuild in the final location, or use patchelf --set-rpath. We hit this swapping our own production build from a staging dir.

---

Fine-Tuning 27B+ LLMs on AMD Strix Halo — A Home Enthusiast's Guide

A reproducible recipe for fine-tuning Qwen3.5-27B (or larger) hybrid LLMs on a single AMD Strix Halo APU (Ryzen AI MAX+ 395, Radeon 8060S, gfx1151) with 128 GB of unified memory — including the patches, system tuning, and out-of-process evaluation orchestrator that make multi-day training runs survivable on consumer hardware.

Status: Tested on a Corsair AI Workstation 300 (Sixunited AXB35-02 board) running Ubuntu 24.04 LTS, mainline kernel 6.19.14 (as tested; 6.19 now EOL — use 7.0.x, see Upgrade-path gotchas), ROCm 7.13 nightly. The same recipe should work on Framework Desktop, GMKtec EVO-X2, FEVM FA-EX9, Bosgame M5 — any AXB35-02 / Strix Halo system.

---

Quick start (for the impatient)

```bash

(see configs/grub-cmdline.example), then sudo update-grub && reboot

6. Set up venv + nightly PyTorch

python3 -m venv /path/to/venv source /path/to/venv/bin/activate pip install --pre \ "torch==2.11.0+rocm7.13.0a20260506" \ "torchvision==0.26.0+rocm7.13.0a20260506" \ "torchaudio==2.11.0+rocm7.13.0a20260506" \ "triton==3.6.0+rocm7.13.0a20260506" \ --index-url https://rocm.nightlies.amd.com/v2-staging/gfx1151/ \ --extra-index-url https://pypi.org/simple/

9. Set up Telegram alerts (optional — see Step 9 below)

Configure with ROCm 7.1.0 toolchain + gcc-13 for clang's libstdc++ lookup

PATH=/opt/rocm-7.1.0/bin:$PATH \ cmake -G Ninja \ -DCOMPUTE_BACKEND=hip \ -DBNB_ROCM_ARCH="gfx1151" \ -DCMAKE_BUILD_TYPE=Release \ -DROCM_VERSION=83 \ -DCMAKE_HIP_FLAGS="--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13" \ -S . -B build

Step 6b — Inference settings for Qwen3.5 / Qwen3.6

If you're serving the fine-tune (or any Qwen3.5/3.6 base model) via llama-server for chat or tool-call use, a few runtime settings beyond the build flags matter on this hardware. These are what we run in production:

```ini

libhsa-runtime64.so has a null-ptr bug on gfx1151. Point at your venv's

harness use). Adjust python3.X to your venv.

Environment=LD_LIBRARY_PATH=/path/to/venv/lib/python3.12/site-packages/_rocm_sdk_core/lib:/opt/rocm/lib ExecStart=/usr/local/bin/llama-server \ -m /path/to/your-qwen35.gguf \ -ngl 999 \ -c 32768 \ --fit off \ --no-mmap \ --reasoning-budget 0 \ --temp 1.0 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 \ --host 0.0.0.0 \ --port 8080 ```

Per-flag rationale:

--no-mmap is the gfx1151 gotcha — mmap-only loading triggers a ~30 min GPU page-table setup wall on the unified-memory path. Either --no-mmap or --mmap --direct-io together work; mmap alone hangs. Documented across multiple Strix Halo issues; not specific to llama.cpp.
--fit off disables llama-server's auto-fit; we keep it off across the board (with explicit -ngl/-c, the sizing heuristic is unnecessary).
LD_LIBRARY_PATH overlay (the Environment= line above) — stock ROCm 7.1.0's libhsa-runtime64.so has a null-pointer bug on gfx1151 that surfaces as crashes/hangs at model load. Prepend the nightly runtime from PyTorch's _rocm_sdk_core wheel so it wins resolution. Same overlay the repo's benchmark (rocwmma-fattn-sweep/bench.sh) and eval harness (scripts/eval_via_llama_perplexity.py) rely on; the §6b numbers below were measured with it.
--reasoning-budget 0 disables the thinking block. Strongly recommended for tool-call workflows — Qwen3.5/3.6's native chat template emits tool calls inside the <thinking> block, and if the reasoning budget runs out mid-call the response stream looks empty to the client. Leave thinking on only for pure-chat-no-tools workloads where reasoning visibly helps.
Sampling: --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 is the unsloth-recommended set for Qwen3.5/3.6 with reasoning off. Their per-model sampling guidance is worth following — meaningfully better than llama.cpp's defaults for coherence on this family. See unsloth's Qwen3.6 docs for the per-mode (reasoning vs non-reasoning) recommendations.
KV cache quantization (--cache-type-k q4_0 --cache-type-v q4_0) is reported to give measurable memory-bandwidth gains at long context with minimal quality loss on Qwen3.5/3.6. We haven't benched it ourselves yet on this hardware (production is at the F16 cache default, 8k context where the bandwidth pressure is lower) — adding when we do. If you're running long-context (32k+) chat workloads, it's worth trying.

For tool-call agents specifically (Continue, Codex CLI, Roo, OpenClaw, aichat, etc.), also note:

Custom Jinja template required for Qwen3-Coder-Next. The native template emits XML <tool_call><function=...>...</function></tool_call> which trips clients expecting Hermes-style JSON {"name": ..., "arguments": ...}. Swap via --chat-template-file <your-hermes.jinja>. Templates for Qwen3-Coder-Next + Nemotron-3-Super in Hermes format are floating around HuggingFace and the ggml-org/llama.cpp issue tracker.
Disable thinking for tool workflows specifically. Even on models where you want thinking for chat, route tool-call/agent workflows to a separate llama-server instance (or a separate role binding in your client config) with --reasoning-budget 0.

Step 9 — Telegram alerts (optional but nice)

scripts/tg_alert.sh is a 50-line bash helper that sends HTML messages to a Telegram bot. Set up:

Talk to @BotFather on Telegram, create a bot, save the token.
Message @userinfobot /start and it returns your numeric chat ID immediately.
Store the credentials. Quote the values — Telegram tokens contain : and _ and other characters that can confuse a source if the value isn't quoted:

sudo mkdir -p /etc/strix-halo
sudo tee /etc/strix-halo/telegram.env > /dev/null <<EOF
TELEGRAM_BOT_TOKEN="<your-token>"
TELEGRAM_CHAT_ID="<your-chat-id>"
EOF
sudo chown "root:$(whoami)" /etc/strix-halo/telegram.env
sudo chmod 0640 /etc/strix-halo/telegram.env

Test:

```bash ./scripts/tg_alert.sh "<b>Test</b> — Strix Halo guide setup OK"

rocm / rocm-sdk-* are pulled in transitively by torch; they're pinned above only

_rocm_sdk_core/lib (same overlay rocwmma-fattn-sweep/bench.sh + the eval

CRITICAL gotcha — the namespace package shadow

If a previous setup left /path/to/venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm82.so lying around (a symlink to a non-existent file from an older bnb install), Python treats that directory as a namespace package — and silently shadows your editable install. Symptom: import bitsandbytes; print(bitsandbytes.__file__) returns None, no .optim attribute. Cure:

```bash rm -rf /path/to/venv/lib/python3.12/site-packages/bitsandbytes

ROCm vs Vulkan — backend selection depends on precision

Inference on Strix Halo can run through either of two llama.cpp backends, and the right choice is not the same for every workload:

ROCm/HIP — the production backend this guide builds in Step 6. Used by all the numbers in the table above. Required for training (PyTorch + ROCm 7.13 nightly).
Vulkan (RADV STRIX_HALO) — Mesa's Vulkan driver, with cooperative-matrix path. Built with -DGGML_VULKAN=ON (no HIP). Recipe in vulkan-vs-rocm-sweep/build-vulkan.sh.

Tested on Qwen3.6-35B-A3B at the same source commit (b9296), same hardware, same bench shape:

Q4_K_M (quantized) — Vulkan wins decode by ~22%:

shape	ROCm/HIP	Vulkan	Winner
pp512 fa=1	1014.32	942.18	ROCm (+7.7%)
tg128 d=0	49.58	60.39	Vulkan (+21.8%)
tg128 d=8392	46.73	57.13	Vulkan (+22.3%)

BF16 (full precision) — ROCm wins decode by ~117%:

shape	ROCm/HIP	Vulkan	Winner
pp512 fa=1	484.01	305.21	ROCm (+58.6%)
tg128 d=0	23.71	10.73	ROCm (+121%) ← over 2×
tg128 d=8392	23.09	10.64	ROCm (+117%)

The reason is visible right in Vulkan's own capability report on launch:

ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | ...
                                                                    ^^^^^^^
                                                                no native BF16

bf16: 0 — RADV STRIX_HALO supports FP16 cooperative matrix natively but not BF16; the Vulkan backend falls back to slower kernels for BF16 ops. ROCm/HIP has BF16 wired through native HIP matmul kernels and dominates anything BF16-bound.

Practical recommendation:

Workload	Backend
Quantized inference (Q4/Q5/Q6/Q8)	Vulkan
Full-precision (BF16) inference	ROCm/HIP
Training (always BF16/FP32)	ROCm/HIP (only path with the PyTorch nightly stack)
Mixed	Whichever your hot path is

Full sweep + per-shape numbers + capability extract + the build recipe for the Vulkan binary in vulkan-vs-rocm-sweep/. Long-form writeup with the methodology, all depths, and the bf16: 0 deep-dive: ROCm vs Vulkan on AMD Strix Halo: when each wins, and why it inverts at the precision boundary. The Vulkan canonical dashboard for Strix Halo (with deeper per-model Vulkan numbers) is bench.ciru.ai; this guide is the canonical ROCm + training reference.

---

Troubleshooting

The failure modes that cost us the most time, indexed. Each links to the step with the full fix.

Symptom	Cause	Fix	Where
Kernel `.deb` install half-configures / `run-parts` errors	Mainline kernel `.deb`s have a double-directory `run-parts` bug across image/modules/headers maintainer scripts	Run `scripts/fix-kernel-run-parts.py` on the `.deb`s before installing — rewrites the trigger scripts to `if [ -d X ]; then … fi` form	[Step 1](#step-1--kernel-61914-mainline)
`'cstdlib' file not found` / `'cmath'` during a HIP build	ROCm 7.1's clang-20 picks the gcc-14 runtime dir, which lacks the C++ headers, on Ubuntu 24.04	Pass `--gcc-install-dir=/usr/lib/gcc/x86_64-linux-gnu/13` — via `CMAKE_HIP_FLAGS` (cmake) or `HIPCC_COMPILE_FLAGS_APPEND` (pip)	[Step 5](#step-5--bitsandbytes-from-source-for-rocm), [Step 6](#step-6--llamacpp-hip-build-for-inference)
`import bitsandbytes` loads the PyPI build, not your source build	Namespace-package shadow — the editable install doesn't win on `sys.path`	See the namespace-shadow fix; verify `bitsandbytes.__file__` resolves into your source tree	[Step 5](#step-5--bitsandbytes-from-source-for-rocm)
System hard-freezes mid-training, needs a power-off	VRAM/unified-pool exhaustion hangs the HIP driver instead of raising `OutOfMemoryError`	`torch.cuda.set_per_process_memory_fraction(0.80)` — on a 128 GB unified APU, `0.80` (≈102 GB) leaves the host enough; `0.90` starves it	[Training contract](#training-script--the-contract)
`llama.cpp` model load hangs ~30 min at GPU page-table setup	mmap-only load on gfx1151 triggers a slow page-table walk	Use `--no-mmap`, or mmap and `direct_io` together — never mmap alone	[Step 6](#step-6--llamacpp-hip-build-for-inference)
Random crashes mid-training, no obvious cause	`/srv` permissions silently regress off `755`	Install the `/srv` perm watchdog cron (defense in depth — the root cause is still unpinned)	[Step 2](#step-2--system-tuning)

| FLA kernels error or return wrong results after an FLA version change | Stale Triton autotune cache — patched FLA produces different kernel shapes | rm -rf ~/.triton/cache after any FLA change

🇨🇳 中文文档镜像 AI 翻译 2026-06-10

英文原文章节由系统翻译为中文摘要，便于快速理解。完整原文见上方 "📑 README 深度解析"。

📋 环境依赖

本项目需要在 Ubuntu 服务器上安装以下依赖包：build-essential、cmake、ninja-build、git、curl、jq、python3-venv、python3-dev 和 linux-headers-generic。还需要安装 hiprand-dev、rocrand-dev、hipcub-dev、rocprim-dev 和 rocthrust-dev 等包。

🛠 安装步骤（Docker/pip/源码）

本项目需要在 Ubuntu 服务器上安装 Linux 内核版本 6.19.14 和 ROCm 系统版本 7.1.0。需要从 Radeon 仓库中下载和安装 rocm-cmake、hipcc 和 hipBL 等包。

🚀 使用教程

本项目提供了一个可复制的配方，用于在 AMD Strix Halo APU 上fine-tuning Qwen3.5-27B（或更大的）混合 LLMs，包括修补程序、系统调优和异步评估协调员。该配方旨在在消费者硬件上进行多天的训练运行。

⚙️ 配置说明（含 MCP / env）

本项目需要在虚拟环境中安装 torch、torchvision、torchaudio 和 triton 等包。还需要配置 ROCm 7.1.0 工具链和 gcc-13 以便使用 clang 的 lib。

🔌 API 说明

rocm 和 rocm-sdk-* 包会在使用 torch 时被自动拉取。需要注意的是，rocm_sdk_core/lib 和 rocwmma-fattn-sweep/bench.s 会被覆盖。

🔄 工作流/模块

本项目需要注意 namespace 包遮蔽的问题，确保之前的设置没有留下旧的包。还需要注意 kernel `.deb` 安装错误和 run-parts 错误的问题。

❓ FAQ 摘要

本项目提供了一个故障模式索引，包括 kernel `.deb` 安装错误、run-parts 错误、namespace 包遮蔽等问题的解决方案。

🎯 aiskill88 AI 点评 A 级 2026-06-09

该项目提供了一个易于使用的工作流程，帮助用户微调大规模语言模型，值得关注

📚 实用指南（长尾问题）

适合谁

需要 strix-halo-llm-finetune-guide 解决具体问题的开发者与运营人员

最佳实践

先在测试环境跑通最小用例，再接入生产数据

常见错误

API key 直接提交到 git 仓库（请用 .env 并加入 .gitignore）
Python 依赖冲突：建议用 venv / uv 隔离环境

部署方案

云端托管：可放在 Vercel / Railway / Fly.io 等 PaaS 平台

⚡ 核心功能

可视化 Agent 工作流编排，无需编写复杂代码
支持多步骤自动化任务链，实现全流程无人值守
与外部 API、数据库和第三方服务无缝集成
内置错误处理与自动重试机制，保障稳定运行
提供可复用的自动化模板，快速在同类场景部署

👥 适合谁

需要 strix-halo-llm-finetune-guide 解决具体问题的开发者与运营人员

⭐ 最佳实践

先在测试环境跑通最小用例，再接入生产数据

⚠️ 常见错误

API key 直接提交到 git 仓库（请用 .env 并加入 .gitignore）
Python 依赖冲突：建议用 venv / uv 隔离环境

👥 适合人群

自动化工程师和运维人员项目经理和业务分析师希望减少重复性工作的专业人士数字化转型团队

🎯 使用场景

自动化日常重复性工作，将精力集中于创造性任务
构建数据采集 → 处理 → 输出的完整自动化管线
实现跨平台、跨系统的数据流转和业务协同

⚖️ 优点与不足

✅ 优点

+MIT 协议，可免费商用
+大幅减少重复性人工操作
+可视化流程，清晰直观
+可扩展性强，支持复杂场景

⚠️ 不足

−初始配置和调试需投入一定时间
−强依赖外部服务的稳定性
−复杂场景需具备一定技术基础

⚠️ 使用须知

AI Skill Hub 为第三方内容聚合平台，本页面信息基于公开数据整理，不对工具功能和质量作任何法律背书。

建议在沙箱或测试环境中充分验证后，再部署至生产环境，并做好必要的安全评估。

📄 License 说明

🔗 相关工具推荐

LLM资源合集（精选）

精选100+可直接运行的AI Agent和RAG应用集合。包含完整工作流示例、智能代理框架和检索增强生成系统。适合AI开

LangChain AI开发框架

Agent工作流

ai-agents-for-beginners Agent工作流

微软官方开源项目，提供12堂系统课程学习AI智能体框架。涵盖工作流设计、RAG检索增强、多智能体协作等核心技能。适合AI

n8n AI工作流自动化

Agent工作流

📰 相关 AI 新闻

AI 前沿资讯：The biggest AI bottleneck toda…

AI 资讯 · 知识关联

AI 前沿资讯：A2A, how it looks in an enterp…

AI 资讯 · 知识关联

AI Agent 自主化能力最新进展

AI 资讯 · 知识关联

AI 前沿资讯：Is AI at this scale actually s…

AI 资讯 · 知识关联

🍿 AI 圈相关吃瓜

设计 Agent 说能生成 71 个系统，我的 Figma 还在加载

AI 圈观察

设计 Agent 说能生成视频，我信了

AI 圈观察

设计 Agent 说能生成视频，我的显卡表示有话说

🗺️ 相关解决方案

ai-workflow-templates

🧩 你可能还需要

基于当前 Skill 的能力图谱，自动补全的工具组合

技能寻求者

MCP · Agent · 工作流

total-agent-memory MCP工具

为Claude Code和Codex CLI提供持久化记忆功能的开源MCP工具。自动提取知识图谱，支持多轮对话上下文保留，适合需要长期记忆和

将英文测试规格转换为自愈Playwright测试

❓ 常见问题 FAQ

常见问题−

解答

什么是 Agent 工作流？和普通自动化有什么区别？+

导入工作流后，我需要修改哪些配置？+

工作流运行失败了，如何排查问题？+

这个工作流每次运行会产生哪些费用？+

工作流可以定时自动运行吗？+

💡 AI Skill Hub 点评

总体来看，AMD Strix Halo LLM 微调指南是一款质量良好的Agent工作流，在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态，建议收藏备用，结合自身场景选择合适时机引入使用。

⬇️ 获取与下载

⬇ 下载源码 ZIP

✅ MIT 协议 · 可免费商用 · 直接从 aiskill88 服务器下载，无需跳转 GitHub

📚 深入学习 AMD Strix Halo LLM 微调指南

查看分步骤安装教程和完整使用指南，快速上手这款工具

⚙️ 安装教程 📚 使用教程

🌐 原始信息

原始名称	`strix-halo-llm-finetune-guide`
原始描述	开源AI工作流：Home-enthusiast's guide to fine-tuning 27B+ LLMs on AMD Strix Halo (gfx1151, Ryz。⭐23 · Python
Topics	`workflowpython`
GitHub	https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide
License	MIT
语言	Python

🔗 原始来源

🐙 GitHub 仓库 https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide

收录时间：2026-06-09 · 更新时间：2026-06-09 · License：MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。

AMD Strix Halo LLM 微调指南

📚 深度解析

📋 工具概览

📖 中文文档

Prerequisites — install before you begin

0. Install prereqs (build tools + ROCm 7.1 apt repo) — see Prerequisites above

Required apt packages

The stack we'll build

1. Install latest stable mainline kernel from kernel.ubuntu.com/mainline/ — 6.19.14 was tested,

Apply scripts/fix-kernel-run-parts.py to the .debs before installing

7. Build flash-linear-attention from patched source (see Step 4 below)

8. Build bitsandbytes from source for ROCm gfx1151 (see Step 5 below)

Install

Install:

Install the script — explicit root:root + 0755 so the NOPASSWD sudoers

for reproducible dates. The quick-start's torch-only install gets them too.

(it's a verbatim copy of the patched cumsum.py from a working install).

Install editable

Build

on PyTorch 2.10/2.11 + HIP 7.13, but the build produced rocm83.so.

Install editable (replaces PyPI bnb)

Step 6 — llama.cpp HIP build (for inference)

Fine-Tuning 27B+ LLMs on AMD Strix Halo — A Home Enthusiast's Guide

Quick start (for the impatient)

(see configs/grub-cmdline.example), then sudo update-grub && reboot

6. Set up venv + nightly PyTorch

9. Set up Telegram alerts (optional — see Step 9 below)

Configure with ROCm 7.1.0 toolchain + gcc-13 for clang's libstdc++ lookup

Step 6b — Inference settings for Qwen3.5 / Qwen3.6

libhsa-runtime64.so has a null-ptr bug on gfx1151. Point at your venv's

harness use). Adjust python3.X to your venv.

Step 9 — Telegram alerts (optional but nice)

rocm / rocm-sdk-* are pulled in transitively by torch; they're pinned above only

_rocm_sdk_core/lib (same overlay rocwmma-fattn-sweep/bench.sh + the eval

CRITICAL gotcha — the namespace package shadow

ROCm vs Vulkan — backend selection depends on precision

Troubleshooting

⚡ 核心功能

👥 适合人群

🎯 使用场景

⚖️ 优点与不足

🔗 相关工具推荐

❓ 常见问题 FAQ

🤖 交给 Agent 安装 · AMD Strix Halo LLM 微调指南