AI Skill Hub 强烈推荐:苹果芯片AI服务 是一款优质的Agent工作流。AI 综合评分 8.0 分,在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案,这是一个值得深入了解的选择。
苹果芯片AI服务 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
苹果芯片AI服务 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
# 克隆仓库 git clone https://github.com/ddalcu/mlx-serve cd mlx-serve # 查看安装说明 cat README.md # 按 README 完成环境依赖安装后即可使用
# 查看帮助 mlx-serve --help # 基本运行 mlx-serve [options] <input> # 详细使用说明请查阅文档 # https://github.com/ddalcu/mlx-serve
# mlx-serve 配置说明 # 查看配置选项 mlx-serve --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export MLX_SERVE_CONFIG="/path/to/config.yml"
**OpenAI- and Anthropic-compatible local inference for Apple Silicon — MLX and GGUF — faster than LM Studio on the same file. No Python. No cloud. No Electron.**
ddalcu.github.io/mlx-serve · Download MLX Core.app · Changelog
★ If mlx-serve saves you from spinning up another Electron app, star the repo — it genuinely helps people find this.
mlx-serve is a native Zig server that runs any LLM on Apple Silicon — MLX-format models and every GGUF on HuggingFace (Qwen, Llama, Mistral, Gemma, DeepSeek V4 Flash, thousands more). It exposes OpenAI-compatible and Anthropic-compatible HTTP APIs out of the box, so the same http://localhost:11234 works with Claude Code, the OpenAI SDK, Continue, Cursor, Open WebUI, and anything else that speaks one of those wires. Ships with MLX Core, a macOS menu-bar app with chat, agent mode, MCP tool calling, and model management.

<img src="docs/appiconb.png" width="48" align="center"> Download MLX Core.app — latest release for macOS (Apple Silicon)
/v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models, streaming SSE, tools, JSON-schema constrained decoding, logprobs./v1/responses with previous_response_id chains, per-event sequence_number, the /v1/responses/compact opaque history blob, and a WebSocket transport on the same endpoint./v1/messages works with Claude Code (ANTHROPIC_BASE_URL=http://localhost:11234) and the Anthropic SDK.--max-concurrent N batches decode requests through one forward pass for ~1.6× throughput at 4-way parallel.image_url content blocks.reasoning_content.pip, no venv. The MLX Core app ships everything signed and notarized.brew install mlx-c webp
brew tap ddalcu/mlx-serve https://github.com/ddalcu/mlx-serve
brew install --cask mlx-core # GUI menu bar app
brew install mlx-serve # CLI server only
./scripts/fetch-llama.sh (only once)
zig build -Doptimize=ReleaseFast
./zig-out/bin/mlx-serve --model ~/.mlx-serve/models/gemma-4-e4b-it-4bit --serve --port 8080
./scripts/fetch-llama.sh (only once)
cd app && SKIP_NOTARIZE=1 bash build.sh
open "MLX Core.app"
Requires APPLE_DEVELOPER_ID and APPLE_TEAM_ID environment variables for code signing.
brew install mlx-serve) and the GUI (brew install --cask mlx-core).If we missed you, please open a PR — happy to add anyone who landed code, fixtures, or a fix here.
The tray has ImageGen and VideoGen buttons that run FLUX.2 and LTX-Video 2.3 through a Python subprocess. Both run natively on MLX — no MPS/diffusers path. This is completely optional — the Zig server itself remains Python-free.
Prerequisite: Python 3 and ffmpeg must be installed on your Mac.
brew install python ffmpeg
Then launch MLX Core, click the ImageGen (or VideoGen) tray icon, and hit Install in the window. The app will:
~/.mlx-serve/venv (does not touch your system Python)Models:
| Feature | Default | Other options | Approx. RAM |
|---|---|---|---|
| Image | FLUX.2-klein 4B 4-bit (mflux, ~5 GB pre-quantized) | FLUX.1-schnell / dev 4-bit and 8-bit | 8 / 12 / 16 GB |
| Video | LTX-Video 2.3 Q4 | — | 24 GB RAM, ~50 GB first-run download (LTX 41 GB + Gemma 8 GB) |
The 41 GB LTX snapshot ships both transformer variants (1-stage distilled + 2-stage dev, ~11 GB each) plus a 7.6 GB distillation LoRA, so you can switch between Fast/Good/Quality/Super offline without re-downloading.
The image path uses mflux for native MLX inference with built-in 4/8-bit quantization. The video path uses ltx-2-mlx with audio generation (muxed via system ffmpeg).
Outputs go to ~/.mlx-serve/generations/images/YYYY-MM-DD/ and .../videos/YYYY-MM-DD/.
The app won't let you start a generation if there isn't enough free RAM. If the mlx-serve server is running and competing for memory, you'll be prompted to stop it first.
| Flag | Default | Description |
|---|---|---|
--model PATH | required | Path to the model directory or a .gguf file |
--serve | off | Start the HTTP server |
--host ADDR | 127.0.0.1 | Host address to bind |
--port N | 11234 | Port for the HTTP server |
--prompt TEXT | "Hello" | Prompt for interactive mode |
--max-tokens N | 100 | Maximum tokens to generate |
--temp F | 0.0 | Sampling temperature (0 = greedy) |
--ctx-size N | auto | Context window size (auto = computed from GPU memory) |
--timeout N | 300 | Request timeout in seconds |
--reasoning-budget N | -1 | Thinking token budget (-1 = unlimited, 0 = no thinking) |
--no-vision | off | Disable vision encoder even if model supports it |
--pld / --no-pld | on | Prompt Lookup Decoding (model-agnostic spec-decode) |
--pld-draft-len N | 5 | Max draft tokens per PLD step |
--pld-key-len N | 3 | N-gram match key length for PLD |
--drafter DIR | none | Gemma 4 assistant drafter checkpoint (e.g. gemma-4-E4B-it-assistant-bf16) |
--draft-block-size N | 4 | Drafts per round for the Gemma 4 drafter |
--kv-quant {off,4,8,turbo2,turbo4} | off | KV-cache quantization scheme (MLX path) |
--llama-kv-quant {off,q8,q4} | off | KV-cache quantization for GGUF (llama.cpp path) |
--llama-cache-entries N | 1 | Multi-session LRU for llama.cpp (warm multi-doc agents) |
--tokenize-cache-entries N | 4 | Chat-template + tokenize cache size |
--max-concurrent N | 1 | Continuous-batch decode parallelism |
--prefix-cache-entries N | auto | Shared-prefix KV cache entry cap |
--prefix-cache-mem N{KB,MB,GB} | 2 GB | Shared-prefix KV cache memory cap |
--model-dir PATH | none | Discover and serve every model in a folder (LRU resident set) |
--log-level | info | Log level (error, warn, info, debug) |
curl http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-serve",
"input": "Write a haiku about programming.",
"stream": true
}'
Stateful chains via previous_response_id, full streaming SSE with per-event sequence_number, schema-conformant envelope with tools / tool_choice / text / reasoning / usage echo. POST /v1/responses/compact returns an opaque base64 history blob that round-trips back as a compaction input item without any LLM call. Same endpoint also accepts an Upgrade: websocket handshake — each text frame is a response.create JSON message, and each SSE event becomes one outbound text frame.
GET /health — health checkGET /v1/models — list loaded models with capabilities + engine infoPOST /v1/completions — text completionsPOST /v1/embeddings — text embeddings (BERT and encoder-only models)GET /v1/responses/{id}, DELETE /v1/responses/{id} — fetch / delete stored responsesAll work — anything that talks the OpenAI chat-completions or Anthropic Messages wire protocol does. mlx-serve also implements the newer OpenAI Responses API (/v1/responses) for clients that want stateful chains via previous_response_id, plus a WebSocket transport on the same endpoint.
| Architecture | model_type | Examples | Chat Format | Vision |
|---|---|---|---|---|
| **Gemma 4** | gemma4 | gemma-4-e2b-it-4bit, gemma-4-e4b-it-8bit, gemma-4-26b-a4b-it-4bit | Gemma turns | SigLIP |
| **Gemma 3** | gemma3 | gemma-3-12b-it-qat-4bit | Gemma turns | -- |
| **Qwen 3 / 3.5 / 3.6** | qwen3, qwen3_5, qwen3_5_moe, qwen3_next | Qwen3-4B, Qwen3.5-4B, Qwen3.6-35B-A3B | ChatML | -- |
| **Nemotron-H** | nemotron_h | Nemotron-3-Nano-4B | ChatML | -- |
| **LFM2** | lfm2 | LFM2.5-350M | ChatML | -- |
| **Llama** | llama | Llama 3, Llama 3.1, Llama 3.2 | Llama-3 | -- |
| **Mistral** | mistral | Mistral 7B | ChatML | -- |
| **DeepSeek V4 Flash** | deepseek_v4 (GGUF) | DeepSeek-V4-Flash | DSV4 | -- |
| **Anything else as GGUF** | via embedded llama.cpp | any .gguf on HuggingFace | per-template | -- |
Any quantized MLX model using one of the above architectures works natively. Anything else can be served as GGUF through the embedded llama.cpp engine — just pick the .gguf file in the Model Browser and the server auto-routes by format. Models with unsupported architectures are flagged in the Model Browser but can still be downloaded.
+35% faster overall (geomean across 18 cells, best mlx-serve vs best LMS, identical 4-bit weights, ctx=4096, temp=0).
| Model | Echo | Code | Free-form |
|---|---|---|---|
| Gemma 4 E2B | **+122%** | **+47%** | +20% |
| Gemma 4 E4B | **+97%** | **+53%** | **+35%** |
| Gemma 4 31B | +20% | +4% | -1% |
| Gemma 4 26B-A4B-MoE | **+66%** | +23% | +31% |
| Qwen 3.6 27B | **+60%** | +24% | +32% |
| Qwen 3.6 35B-A3B-MoE | **+88%** | +20% | +25% |

Reproduce: ./tests/bench.sh --family gemma --lmstudio --omlx (or qwen36). Requires lms, jq, python3, matplotlib; --omlx requires omlx on PATH.
高性能AI工作流,支持苹果芯片
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。
总体来看,苹果芯片AI服务 是一款质量优秀的Agent工作流,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | mlx-serve |
| Topics | apple-siliconaillmzig |
| GitHub | https://github.com/ddalcu/mlx-serve |
| License | MIT |
| 语言 | Zig |
收录时间:2026-06-20 · 更新时间:2026-06-20 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。
选择 Agent 类型,复制安装指令后粘贴到对应客户端