📄 工具详情 ⚙️ 安装教程 📚 使用教程

能力标签

🛠

AI工具

LlamaCpp Elixir绑定

Q: llama_cpp_ex 如何安装和开始使用？

访问 llama_cpp_ex 的 GitHub 仓库或官方网站，按照 README 文档中的步骤安装依赖并运行。通常需要 Python 3.8+ 或 Node.js 16+ 基础环境。

Q: llama_cpp_ex 是否免费？许可证是什么？

llama_cpp_ex 完全免费，采用 Apache-2.0 许可证开源发布，任何人都可以免费使用、修改和分发。

Q: llama_cpp_ex 适合哪些用户使用？

llama_cpp_ex 主要面向有一定技术基础的用户，包括开发者、数据分析师、AI 工程师等专业人士。

Q: llama_cpp_ex 的社区活跃度和项目维护状况如何？

llama_cpp_ex 在 GitHub 上已获得 7 个 Star，处于积极发展阶段，社区在持续扩大。

基于 Elixir · 开源免费，本地部署，数据完全自主可控

英文名：llama_cpp_ex

⭐ 7 Stars 🍴 1 Forks 💻 Elixir 📄 Apache-2.0 🏷 AI 7.5分

7.5AI 综合评分

ElixirLLMCUDAMetalVulkan

✦ AI Skill Hub 推荐

AI Skill Hub 推荐使用：LlamaCpp Elixir绑定是一款优质的AI工具。AI 综合评分 7.5 分，在同类工具中表现稳健。如果你正在寻找可靠的AI工具解决方案，这是一个值得深入了解的选择。

📚 深度解析

LlamaCpp Elixir绑定是一款基于 Elixir 的开源工具，在 GitHub 上收获 0k+ Star，是Elixir、LLM、CUDA、Metal领域中的优质开源项目。开源工具的最大优势在于代码完全透明，你可以审计每一行代码的安全性，也可以根据自身需求进行二次开发和定制。

**为什么要使用开源工具而非商业 SaaS？**
对于个人开发者和有隐私需求的用户，本地部署的开源工具意味着数据不离本机，不受第三方服务商的数据政策约束。同时，开源工具通常没有使用次数限制和月度费用，一次安装即可长期使用，对于高频使用场景的总拥有成本（TCO）远低于订阅制商业工具。

**安装与环境准备**
LlamaCpp Elixir绑定依赖 Elixir 运行环境。建议通过 pyenv（Python）或 nvm（Node.js）管理 Elixir 版本，避免全局环境污染。对于新手用户，推荐先创建虚拟环境（python -m venv venv && source venv/bin/activate），再安装依赖，这样即使出现问题也可以随时删除虚拟环境重新开始，不影响系统稳定性。

**社区与维护**
GitHub Issue 和 Discussion 是获取帮助的最快渠道。在提问前建议先检查 Closed Issues（已关闭的问题），大多数常见问题都已有解答。遇到 Bug 时，提供 pip list 的输出、完整错误堆栈和最小可复现示例，能显著提高开发者响应速度。AI Skill Hub 将持续追踪 LlamaCpp Elixir绑定的版本更新，及时通知重要功能变化。

📋 工具概览

运行LLM模型的Elixir绑定，支持Metal、CUDA、Vulkan或C

LlamaCpp Elixir绑定是一款基于 Elixir 开发的开源工具，专注于 Elixir、LLM、CUDA 等核心功能。作为 GitHub 开源项目，它拥有活跃的社区支持和持续的版本迭代，代码完全透明可审计，支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流，都能提供稳定可靠的解决方案。

GitHub Stars

⭐ 7

开发语言

Elixir

支持平台

Windows / macOS / Linux

维护状态

轻量级项目，按需更新

开源协议

Apache-2.0

AI 综合评分

7.5 分

工具类型

AI工具

Forks

📖 中文文档

以下内容由 AI Skill Hub 根据项目信息自动整理，如需查看完整原始文档请访问底部「原始来源」。

运行LLM模型的Elixir绑定，支持Metal、CUDA、Vulkan或C

📌 核心特色

开源免费，支持本地部署，数据完全自主可控
活跃的 GitHub 开源社区，持续迭代更新
提供详细文档和使用示例，新手友好
支持自定义配置，灵活适配不同使用环境
可作为基础组件集成进现有技术栈或进行二次开发

🎯 主要使用场景

本地部署运行，保护数据隐私，满足合规要求
自定义集成到现有系统，扩展技术栈能力
作为开源基础组件进行商业化二次开发

以下安装命令基于项目开发语言和类型自动生成，实际以官方 README 为准。

安装命令

# 克隆仓库
git clone https://github.com/nyo16/llama_cpp_ex
cd llama_cpp_ex

# 查看安装说明
cat README.md

# 按 README 完成环境依赖安装后即可使用

📋 安装步骤说明

访问 GitHub 仓库页面
按照 README 文档完成依赖安装
根据系统环境完成初始化配置
参考官方示例或文档开始使用
遇到问题可在 GitHub Issues 中查找解答

以下用法示例由 AI Skill Hub 整理，涵盖最常见的使用场景。

常用命令 / 代码示例

# 查看帮助
llama_cpp_ex --help

# 基本运行
llama_cpp_ex [options] <input>

# 详细使用说明请查阅文档
# https://github.com/nyo16/llama_cpp_ex

以下配置示例基于典型使用场景生成，具体参数请参照官方文档调整。

配置示例

# llama_cpp_ex 配置说明
# 查看配置选项
llama_cpp_ex --config-example > config.yml

# 常见配置项
# output_dir: ./output
# log_level: info
# workers: 4

# 环境变量（覆盖配置文件）
export LLAMA_CPP_EX_CONFIG="/path/to/config.yml"

📑 README 深度解析真实文档完整度 87/100 查看 GitHub 原文 →

以下内容由系统直接从 GitHub README 解析整理，保留代码块、表格与列表结构。

LlamaCppEx

Elixir bindings for llama.cpp — run LLMs locally with Metal, CUDA, Vulkan, or CPU acceleration.

Built with C++ NIFs using fine for ergonomic resource management and elixir_make for the build system.

Features

Load and run GGUF models directly from Elixir
HuggingFace Hub integration — search, list, and download GGUF models
GPU acceleration: Metal (macOS), CUDA (NVIDIA), Vulkan, or CPU
Streaming token generation via lazy Stream
Jinja chat templates with enable_thinking support (Qwen3, Qwen3.5, etc.)
RAII resource management — models, contexts, and samplers are garbage collected by the BEAM
Configurable sampling: temperature, top-k, top-p, min-p, repetition penalty, frequency & presence penalty
Embedding generation with L2 normalization
Grammar-constrained generation (GBNF)
Structured output via JSON Schema (auto-converted to GBNF grammar)
Optional Ecto schema to JSON Schema conversion
Continuous batching server for concurrent inference
Multi-Token Prediction (MTP) speculative decoding — ~2x token-generation speedup on Qwen 3.6 with live acceptance-rate stats
Prefix caching — same-slot KV cache reuse for multi-turn chat (1.23x faster)
Pluggable batching strategies — DecodeMaximal, PrefillPriority, Balanced
Pre-tokenized API — tokenize outside the GenServer for lower contention
Telemetry integration for observability

Prerequisites

C++17 compiler (GCC, Clang, or MSVC)
CMake 3.14+
Git (for the llama.cpp submodule)

Hardware requirements

Quantization	RAM / VRAM	File size
Q4_K_M	~20 GB	~19 GB
Q8_0	~37 GB	~36 GB
BF16	~70 GB	~67 GB

Installation

Add llama_cpp_ex to your list of dependencies in mix.exs:

def deps do
  [
    {:llama_cpp_ex, "~> 0.7.5"}
  ]
end

Build the speculative session once — it owns a target context and a

Install the HuggingFace CLI if needed: pip install huggingface-hub

huggingface-cli download Qwen/Qwen3.5-35B-A3B-GGUF Qwen3.5-35B-A3B-Q4_K_M.gguf --local-dir models/ ```

Quick Start

```elixir

Usage

Minimal: stream a single response

```elixir :ok = LlamaCppEx.init()

{:ok, model} = LlamaCppEx.load_model( Path.expand("~/Downloads/Qwen3.6-35B-A3B-MTP-Q4_K_M.gguf"), n_gpu_layers: 999 )

Examples

The examples/ directory contains runnable scripts demonstrating key features:

```bash

Create context and sampler separately

{:ok, ctx} = LlamaCppEx.Context.create(model, n_ctx: 4096) {:ok, sampler} = LlamaCppEx.Sampler.create(model, temp: 0.7, top_p: 0.9)

Sample every 200 ms while the generation runs.

Stream.repeatedly(fn -> Process.sleep(200) s = LlamaCppEx.MTP.stats(mtp) IO.puts( "iters=#{s.iters} emitted=#{s.tokens_emitted} " <> "accept=#{Float.round(s.acceptance_rate * 100, 1)}% " <> "tok/s=#{Float.round(s.tokens_per_sec, 1)}" ) end) |> Stream.take_while(fn _ -> not Task.yield(gen_task, 0) |> match?({:ok, _}) end) |> Stream.run()

Task.await(gen_task, :infinity)


For in-band progress events (no separate process), use `stream_events/3` with `emit_stats_every`:

elixir mtp |> LlamaCppEx.MTP.stream_events("Write a sonnet:", max_tokens: 400, emit_stats_every: 32 ) |> Enum.each(fn {:token, _id, text} -> IO.write(text) {:stats, s} -> IO.puts("\n[stats] accept=#{Float.round(s.acceptance_rate * 100, 1)}%") {:done, final} -> IO.puts("\n[done]") {:eog, } -> IO.puts("\n[eog]") end) ```

Options

LlamaCppEx.MTP.init/2:

:n_draft — draft tokens proposed per iteration (default 3). On NVIDIA, 2–4 is the sweet spot. On Apple Silicon, set this to 1 — see the Apple Silicon performance note above.
:n_ctx, :n_threads, :flash_attn, :type_k/:type_v, :offload_kqv, … — any LlamaCppEx.Context option; applied to both target and draft contexts.

LlamaCppEx.MTP.stream/3:

:max_tokens (default 256), plus all sampling options (:temp, :top_k, :top_p, :min_p, :seed, :penalty_*, :grammar).
:emit_stats_every — when set, periodic {:stats, _} events become available via stream_events/3.

Disable thinking via enable_thinking option (uses Jinja chat template kwargs)

{:ok, reply} = LlamaCppEx.chat(model, [ %{role: "user", content: "What is the capital of France?"} ], max_tokens: 256, enable_thinking: false, temp: 0.7, top_p: 0.8, top_k: 20, min_p: 0.0, penalty_present: 1.5) ```

Lower-level API

For fine-grained control over the inference pipeline:

```elixir

Pre-Tokenized API

Tokenize outside the GenServer to reduce contention under concurrent load:

model = LlamaCppEx.Server.get_model(server)
{:ok, tokens} = LlamaCppEx.Tokenizer.encode(model, prompt)
{:ok, text} = LlamaCppEx.Server.generate_tokens(server, tokens, max_tokens: 100)

Ecto Schema Integration

Convert Ecto schema modules to JSON Schema automatically (requires {:ecto, "~> 3.0"} — optional dependency):

```elixir defmodule MyApp.Person do use Ecto.Schema

embedded_schema do field :name, :string field :age, :integer field :active, :boolean field :tags, {:array, :string} end end

JSON Schema constrained generation + Ecto integration

LLAMA_MODEL_PATH=/path/to/model.gguf mix run examples/structured_output.exs

🇨🇳 中文文档镜像 AI 翻译 2026-05-29

英文原文章节由系统翻译为中文摘要，便于快速理解。完整原文见上方 "📑 README 深度解析"。

📌 简介

LlamaCppEx 是为 llama.cpp 提供的 Elixir 语言绑定库。它允许开发者直接在 Elixir 环境中加载并运行 GGUF 格式的模型，将强大的本地大语言模型推理能力无缝集成到 Elixir 应用生态中。

⚡ 功能介绍

本项目支持直接从 Elixir 加载运行 GGUF 模型，并深度集成了 HuggingFace Hub，方便开发者搜索、列出及下载模型。在硬件加速方面，支持 Metal (macOS)、CUDA (NVIDIA)、Vulkan 及 CPU 运行。此外，它通过 lazy Stream 支持流式 Token 生成，并利用 RAII 资源管理机制，确保模型、Context 和 Sampler 等资源能被自动垃圾回收。

📋 环境依赖

开发环境需要具备 C++17 编译器（如 GCC、Clang 或 MSVC）、CMake 3.14+ 以及用于管理 llama.cpp 子模块的 Git。硬件方面，请根据模型量化版本（如 Q4_K_M、Q8_0 或 BF16）预留足够的 RAM 或 VRAM 显存，例如运行 Q4_K_M 版本约需 20 GB 显存。

🛠 安装步骤（Docker/pip/源码）

在项目中通过 mix.exs 添加 `{:llama_cpp_ex, "~> 0.7.5"}` 依赖即可完成安装。如果需要下载模型，建议安装 HuggingFace CLI（通过 `pip install huggingface-cli`），并使用命令行将指定的 GGUF 模型下载到本地目录中。

🚀 使用教程

你可以通过 `LlamaCppEx.load_model/2` 加载本地模型，并设置 `n_gpu_layers` 参数来决定 GPU 卸载层数。对于简单的对话需求，可以使用 `LlamaCppEx.chat/4` 函数，它支持通过 `enable_thinking` 选项控制是否启用 Jinja 聊天模板中的思考过程（适用于 Qwen3 等模型）。

⚙️ 配置说明（含 MCP / env）

配置参数主要通过 `LlamaCppEx.MTP.init/2` 等函数进行设置。例如，`:n_draft` 参数在 NVIDIA 显卡上建议设为 2–4，而在 Apple Silicon 上建议设为 1。其他如 `:n_ctx`、`:n_threads`、`:flash_attn` 等 Context 参数均可根据推理需求进行精细化调整。

🔌 API 说明

对于需要高并发或精细化控制的场景，本项目提供了更底层的 API。例如，你可以使用 `LlamaCppEx.Tokenizer.encode/2` 在 GenServer 外部进行预分词（Pre-Tokenized），以减少并发负载下的竞争，随后通过 `LlamaCppEx.Server.generate_tokens/3` 进行推理。

🔄 工作流/模块

本项目提供了强大的 Ecto Schema 集成功能。通过配合 `ecto` 依赖，可以将 Ecto schema 模块自动转换为 JSON Schema。这使得开发者能够实现受约束的结构化输出生成（Constrained Generation），确保 LLM 返回的内容严格符合预定义的业务数据结构。

🎯 aiskill88 AI 点评 A 级 2026-05-29

高质量的Elixir绑定，支持多种后端

📚 实用指南（长尾问题）

适合谁

构建企业知识库 / RAG 检索应用的团队

最佳实践

本地部署优先选 GGUF 量化模型，节省显存并保持响应速度

常见错误

API key 直接提交到 git 仓库（请用 .env 并加入 .gitignore）
显存不足直接 OOM — 优先降低 context 或换更小的量化模型

部署方案

CLI：直接 npm install -g / pip install，命令行调用
本地部署：CPU 8GB 起，GPU 推荐 16GB+ 显存
云端托管：可放在 Vercel / Railway / Fly.io 等 PaaS 平台

⚡ 核心功能

开源免费，支持本地部署，数据完全自主可控
活跃的 GitHub 开源社区，持续迭代更新
提供详细文档和使用示例，新手友好
支持自定义配置，灵活适配不同使用环境
可作为基础组件集成进现有技术栈或进行二次开发

👥 适合谁

构建企业知识库 / RAG 检索应用的团队

⭐ 最佳实践

本地部署优先选 GGUF 量化模型，节省显存并保持响应速度

⚠️ 常见错误

API key 直接提交到 git 仓库（请用 .env 并加入 .gitignore）
显存不足直接 OOM — 优先降低 context 或换更小的量化模型

👥 适合人群

AI 技术爱好者研究人员和学生开发者和工程师技术创业者

🎯 使用场景

本地部署运行，保护数据隐私，满足合规要求
自定义集成到现有系统，扩展技术栈能力
作为开源基础组件进行商业化二次开发

⚖️ 优点与不足

✅ 优点

+Apache-2.0 协议，可免费商用
+完全开源免费，无授权费用
+本地部署，数据完全自主可控
+开发者社区支持，遇问题可查可问

⚠️ 不足

−安装和初始配置可能需要一定技术基础
−功能完整性通常不如成熟商业产品
−技术支持主要依赖开源社区，响应速度不稳定

⚠️ 使用须知

AI Skill Hub 为第三方内容聚合平台，本页面信息基于公开数据整理，不对工具功能和质量作任何法律背书。

建议在沙箱或测试环境中充分验证后，再部署至生产环境，并做好必要的安全评估。

📄 License 说明

🔗 相关工具推荐

transformers AI技能包

Hugging Face开源的深度学习框架，提供预训练语言模型、视觉模型和多模态模型。集成BERT、GPT、Llama等

ComfyUI 节点式AI图像生成

强大的开源扩散模型可视化工具，提供图形界面、API和后端服务。采用节点图式设计，支持模块化工作流构建，适合AI绘图、图像

llama-cpp AI技能包

高效的大语言模型C/C++推理框架，支持在本地CPU/GPU上运行量化LLM模型，具有内存占用小、推理速度快的特点。适合

yt-dlp 视频下载

功能强大的开源视频下载工具，支持YouTube、TikTok等数千个视频平台，可自动下载视频、字幕、封面和元数据。适合内

帮助中心 · AI Skill Hub

AI Agent 工作流设计模式：从单 Agent 到多 Agent 协作的实践指南

帮助中心 · AI Skill Hub

AI Agent 工作流设计模式：从单 Agent 到多 Agent 协作的实践指南

帮助中心 · AI Skill Hub

n8n 搭建 AI Agent 工作流：从安装到实战案例

📰 相关 AI 新闻

AI 前沿资讯：Hiring Senior Founding AI/Back…

🗺️ 相关解决方案

ai-workflow-templates

🧩 你可能还需要

基于当前 Skill 的能力图谱，自动补全的工具组合

❓ 常见问题 FAQ

llama_cpp_ex 是什么工具？−

llama_cpp_ex 是一款Elixir开发的AI辅助工具。开源AI工具： Elixir bindings for llama.cpp — run LLMs locally with Metal, CUDA, Vulkan, or C。⭐7 · Elixir 主要应用场景包括：本地运行LLM模型。

llama_cpp_ex 如何安装和开始使用？+

llama_cpp_ex 是否免费？许可证是什么？+

llama_cpp_ex 适合哪些用户使用？+

llama_cpp_ex 的社区活跃度和项目维护状况如何？+

安装这个工具需要什么基础？+

安装过程中遇到依赖冲突怎么办？+

工具安装成功但运行报错，该怎么处理？+

💡 AI Skill Hub 点评

总体来看，LlamaCpp Elixir绑定是一款质量良好的AI工具，在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态，建议收藏备用，结合自身场景选择合适时机引入使用。

📚 深入学习 LlamaCpp Elixir绑定

查看分步骤安装教程和完整使用指南，快速上手这款工具

⚙️ 安装教程 📚 使用教程

🌐 原始信息

原始名称	`llama_cpp_ex`
原始描述	开源AI工具： Elixir bindings for llama.cpp — run LLMs locally with Metal, CUDA, Vulkan, or C。⭐7 · Elixir
Topics	`ElixirLLMCUDAMetalVulkan`
GitHub	https://github.com/nyo16/llama_cpp_ex
License	Apache-2.0
语言	Elixir

🔗 原始来源

🐙 GitHub 仓库 https://github.com/nyo16/llama_cpp_ex

收录时间：2026-05-29 · 更新时间：2026-05-30 · License：Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。