经 AI Skill Hub 精选评估,苹果芯片AI工作流 获评「推荐使用」。这款Agent工作流在功能完整性、社区活跃度和易用性方面表现出色,AI 评分 7.5 分,适合有一定技术背景的用户使用。
超快的本地通用状态运行时,适用于苹果芯片LLMs
苹果芯片AI工作流 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
超快的本地通用状态运行时,适用于苹果芯片LLMs
苹果芯片AI工作流 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
# 克隆仓库 git clone https://github.com/wesleysimplicio/ds4-simplicio-apple-v6 cd ds4-simplicio-apple-v6 # 查看安装说明 cat README.md # 按 README 完成环境依赖安装后即可使用
# 查看帮助 ds4-simplicio-apple-v6 --help # 基本运行 ds4-simplicio-apple-v6 [options] <input> # 详细使用说明请查阅文档 # https://github.com/wesleysimplicio/ds4-simplicio-apple-v6
# ds4-simplicio-apple-v6 配置说明 # 查看配置选项 ds4-simplicio-apple-v6 --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export DS4_SIMPLICIO_APPLE_V6_CONFIG="/path/to/config.yml"
Universal State Runtime for local LLM inference on Apple Silicon. EN. Versao pt-BR: README.pt-BR.md.

Minimum tools:
Recommended on macOS:
xcode-select --install
brew install cmake ninja node
npm ci
npx playwright install
Recommended on Windows:
npm ci
npx playwright install
On Windows, run native CMake commands from a Visual Studio Developer shell when available.
macOS/Linux:
cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
Windows PowerShell:
cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
If Ninja is not available, CMake may use your platform default generator. Keep the same build directory.
curl -s http://127.0.0.1:8080/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model":"qwen2.5-coder-7b", "messages":[{"role":"user","content":"fizzbuzz in python"}] }'
US4 V6 ships an OpenAI-shape HTTP endpoint that drop-in replaces Ollama for any client expecting /v1/chat/completions, /v1/completions, /v1/models, or /v1/embeddings. Chat is served by mlx_lm.server (managed child process); embeddings are served in-process by mlx-embeddings. Single-file Python sidecar at scripts/openai_serve.py. No FastAPI, no uvicorn.
Two ways to run it: the Python sidecar directly (no C++ build required — fastest path, recommended for local LLM use) or the C++ CLI wrapper (us4-cli serve, identical contract once the native build exists).
One-time setup. Use a project venv to avoid externally-managed-environment on Homebrew Python:
python3 -m venv .venv
.venv/bin/pip install -r scripts/requirements-serve.txt
Run with defaults (chat + embeddings, bind 127.0.0.1:8080, child mlx-lm on 8081). Always invoke the venv interpreter explicitly — python3 from the system PATH will not see MLX:
.venv/bin/python scripts/openai_serve.py
When running the Python sidecar directly, configure it with environment variables:
| Env var | Default | Meaning |
|---|---|---|
US4_SERVE_HOST | 127.0.0.1 | bind address |
US4_SERVE_PORT | 8080 | public port (mlx-lm child uses PORT + 1) |
US4_SERVE_CHAT_BACKEND | mlx | chat upstream selector — mlx, ollama, or custom |
US4_SERVE_CHAT_UPSTREAM | unset | override upstream base URL (e.g. http://127.0.0.1:11434) |
US4_SERVE_CHAT_MODEL | mlx-community/Qwen2.5-Coder-7B-Instruct-4bit (or qwen2.5-coder:14b when backend=ollama) | chat model id |
US4_SERVE_EMBED_MODEL | mlx-community/embeddinggemma-300m-bf16 | embedding model id |
US4_SERVE_DISABLE_CHAT | unset | set 1 to disable chat backend |
US4_SERVE_DISABLE_EMBED | unset | set 1 to disable embeddings backend |
US4_SERVE_LOG_LEVEL | INFO | DEBUG / INFO / WARNING / ERROR |
US4_SERVE_PROMPT_CACHE_BYTES | unset | cap KV/prompt cache bytes on the mlx_lm.server child (e.g. 268435456 = 256 MiB). Only honoured when CHAT_BACKEND=mlx. |
US4_SERVE_MLX_EXTRA_ARGS | unset | extra argv appended to mlx_lm.server (shell-style split). Escape hatch for any flag not exposed individually, e.g. "--max-tokens 256 --prefill-step-size 512". Only honoured when CHAT_BACKEND=mlx. |
Example: pick a smaller 3B chat model on a memory-constrained M1 8 GB:
US4_SERVE_CHAT_MODEL=mlx-community/Qwen2.5-Coder-3B-Instruct-4bit \
US4_SERVE_PORT=8080 \
.venv/bin/python scripts/openai_serve.py
First start downloads model weights from HuggingFace into ~/.cache/huggingface/ (7B 4-bit MLX ≈ 4 GB, 3B 4-bit ≈ 1.7 GB). Subsequent starts reuse the cache.
Example: front an already-running Ollama daemon (chat goes through Ollama, embeddings stay MLX-local). Useful when you want to reuse models already pulled into Ollama without re-downloading the MLX variant from HuggingFace:
```bash
`simplicio-cli` consumes the same env vars:
bash export SIMPLICIO_BASE_URL=http://127.0.0.1:8080/v1 export OPENAI_API_KEY=anything simplicio task "explain this diff" --stack generic --target README.md ```
The us4-v6 MLX path is unified-memory aware: 4-bit MLX weights occupy real RAM only once and are shared between CPU and GPU without copies. Compared to GGUF/Ollama on the same Apple Silicon machine, RAM headroom is meaningful on small machines:
| Machine | Comfortable chat model (Q4 MLX) | Tight ceiling |
|---|---|---|
| M1 / M2 8 GB | 0.5B–3B (Qwen2.5-Coder-3B-Instruct-4bit) or 7B 3-bit MLX with KV cache cap (see recipe above) | 7B Q4 (4.5 GB active, watch swap) |
| M1 / M2 / M3 16 GB | 7B Q4 (Qwen2.5-Coder-7B-Instruct-4bit) | 13B Q4 |
| M-series Pro/Max 32 GB | 13B–14B Q4 | 32B Q4 |
| M3/M4 Max 64 GB+ | 32B–70B Q4 | 70B Q5/Q6 |
Override US4_SERVE_CHAT_MODEL to match the box you are on. Pulling a 7B model on an 8 GB machine is feasible but expect macOS to swap under load — prefer a 3B for sustained chat.
| Symptom | Cause | Fix |
|---|---|---|
mlx-embeddings is not installed | Running python3 from system PATH, not the venv | Use .venv/bin/python scripts/openai_serve.py |
error: externally-managed-environment on pip install | Homebrew Python blocks system installs | Use the venv: python3 -m venv .venv && .venv/bin/pip install -r scripts/requirements-serve.txt |
OSError: [Errno 48] Address already in use | Previous serve still bound to 8080 / 8081 | lsof -nP -iTCP:8080,8081 -sTCP:LISTEN, then kill <pid> |
command not found: ./build/us4-cli | Native binary not built | Either cmake --build build first, or use the Python sidecar (section 6.1) |
--model ollama/... flag ignored | Sidecar reads env vars only, no argparse | Set US4_SERVE_CHAT_MODEL=... instead |
| Beachball / swap thrashing on 7B chat | Machine RAM too small for chosen model | Drop to a 3B model (see hardware table in 6.5) |
chat upstream unreachable: Connection refused with backend=ollama | Ollama daemon not running | ollama serve & (or launch the Ollama.app), then verify with curl http://127.0.0.1:11434/api/tags |
Ollama returns model "<id>" not found | Model not pulled into Ollama | ollama pull qwen2.5-coder:7b (or whatever US4_SERVE_CHAT_MODEL points at) |
7B chat OOM-kills the mlx_lm.server child on an 8 GB box | KV cache grew past safe envelope | Switch to the 3-bit quant + US4_SERVE_PROMPT_CACHE_BYTES=268435456 recipe in 6.1, and add US4_SERVE_DISABLE_EMBED=1 |
invalid US4_SERVE_MLX_EXTRA_ARGS in serve log | Shell-quoting broke during env-var expansion | Wrap the whole value in double quotes, e.g. US4_SERVE_MLX_EXTRA_ARGS="--max-tokens 256" |
ignoring blocked tokens in US4_SERVE_MLX_EXTRA_ARGS in serve log | You tried to override --host, --port, or --cors* via the escape hatch. Network binding is fixed to 127.0.0.1 by us4-v6 to prevent accidental exposure on shared/cloud hosts. | Drop those tokens. If you really need a different bind address, change US4_SERVE_HOST (still localhost-validated) instead of smuggling through extra args. |
ignoring US4_SERVE_PROMPT_CACHE_BYTES=...: expected a positive integer | Value is not numeric (e.g. 512m, 256MB) | Use a raw byte count: 268435456 for 256 MiB, 536870912 for 512 MiB |
Full contract (endpoints, request/response shapes, env knobs, exit codes, security posture) lives in .specs/runtime/SERVE-OPENAI.md.
高性能本地AI工作流,适用于苹果芯片
该工具未明确声明开源协议,商业使用前请联系原作者确认授权范围,避免侵权风险。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
AI Skill Hub 点评:苹果芯片AI工作流 的核心功能完整,质量良好。对于自动化工程师和运维人员来说,这是一个值得纳入个人工具库的选择。建议先在非生产环境试用,再逐步推广。
| 原始名称 | ds4-simplicio-apple-v6 |
| 原始描述 | 开源AI工作流:Ultra-fast 100% on-device Universal State Runtime for LLMs on Apple Silicon (M1–。⭐14 · C++ |
| Topics | AIApple SiliconC++ |
| GitHub | https://github.com/wesleysimplicio/ds4-simplicio-apple-v6 |
| 语言 | C++ |
收录时间:2026-05-27 · 更新时间:2026-05-30 · License:未公布 · AI Skill Hub 不对第三方内容的准确性作法律背书。
选择 Agent 类型,复制安装指令后粘贴到对应客户端