能力标签

🤖 Agent 🔄 工作流 🐳 Docker 💻 CLI ⛓ LangChain

🛠

AI工具

GPU Llama3

基于 Java · 开源免费，本地部署，数据完全自主可控

英文名：GPULlama3-java

⭐ 260 Stars 🍴 32 Forks 💻 Java 📄 MIT 🏷 AI 8.0分

8.0AI 综合评分

GPUJavaAI加速

🌐 访问官网

✦ AI Skill Hub 推荐

GPU Llama3 是 AI Skill Hub 本期精选AI工具之一。综合评分 8.0 分，整体质量较高。我们强烈推荐将其纳入你的 AI 工具库，帮助提升工作效率。

📚 深度解析

GPU Llama3 是一款基于 Java 的开源工具，在 GitHub 上收获 0k+ Star，是GPU、Java、AI、加速领域中的优质开源项目。开源工具的最大优势在于代码完全透明，你可以审计每一行代码的安全性，也可以根据自身需求进行二次开发和定制。

**为什么要使用开源工具而非商业 SaaS？**
对于个人开发者和有隐私需求的用户，本地部署的开源工具意味着数据不离本机，不受第三方服务商的数据政策约束。同时，开源工具通常没有使用次数限制和月度费用，一次安装即可长期使用，对于高频使用场景的总拥有成本（TCO）远低于订阅制商业工具。

**安装与环境准备**
GPU Llama3 依赖 Java 运行环境。建议通过 pyenv（Python）或 nvm（Node.js）管理 Java 版本，避免全局环境污染。对于新手用户，推荐先创建虚拟环境（python -m venv venv && source venv/bin/activate），再安装依赖，这样即使出现问题也可以随时删除虚拟环境重新开始，不影响系统稳定性。

**社区与维护**
GitHub Issue 和 Discussion 是获取帮助的最快渠道。在提问前建议先检查 Closed Issues（已关闭的问题），大多数常见问题都已有解答。遇到 Bug 时，提供 pip list 的输出、完整错误堆栈和最小可复现示例，能显著提高开发者响应速度。AI Skill Hub 将持续追踪 GPU Llama3 的版本更新，及时通知重要功能变化。

📋 工具概览

GPU Llama3 是一款基于 Java 开发的开源工具，专注于 GPU、Java、AI 等核心功能。作为 GitHub 开源项目，它拥有活跃的社区支持和持续的版本迭代，代码完全透明可审计，支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流，都能提供稳定可靠的解决方案。

GitHub Stars

⭐ 260

开发语言

Java

支持平台

Windows / macOS / Linux / Android

维护状态

轻量级项目，按需更新

开源协议

MIT

AI 综合评分

8.0 分

工具类型

AI工具

Forks

📖 中文文档

以下内容由 AI Skill Hub 根据项目信息自动整理，如需查看完整原始文档请访问底部「原始来源」。

📌 核心特色

开源免费，支持本地部署，数据完全自主可控
活跃的 GitHub 开源社区，持续迭代更新
提供详细文档和使用示例，新手友好
支持自定义配置，灵活适配不同使用环境
可作为基础组件集成进现有技术栈或进行二次开发

🎯 主要使用场景

本地部署运行，保护数据隐私，满足合规要求
自定义集成到现有系统，扩展技术栈能力
作为开源基础组件进行商业化二次开发

以下安装命令基于项目开发语言和类型自动生成，实际以官方 README 为准。

安装命令

# 克隆仓库
git clone https://github.com/beehive-lab/GPULlama3.java
cd GPULlama3.java

# 查看安装说明
cat README.md

# 按 README 完成环境依赖安装后即可使用

📋 安装步骤说明

访问 GitHub 仓库页面
按照 README 文档完成依赖安装
根据系统环境完成初始化配置
参考官方示例或文档开始使用
遇到问题可在 GitHub Issues 中查找解答

以下用法示例由 AI Skill Hub 整理，涵盖最常见的使用场景。

常用命令 / 代码示例

# 查看帮助
gpullama3.java --help

# 基本运行
gpullama3.java [options] <input>

# 详细使用说明请查阅文档
# https://github.com/beehive-lab/GPULlama3.java

以下配置示例基于典型使用场景生成，具体参数请参照官方文档调整。

配置示例

# gpullama3.java 配置说明
# 查看配置选项
gpullama3.java --config-example > config.yml

# 常见配置项
# output_dir: ./output
# log_level: info
# workers: 4

# 环境变量（覆盖配置文件）
export GPULLAMA3.JAVA_CONFIG="/path/to/config.yml"

📑 README 深度解析真实文档完整度 70/100 查看 GitHub 原文 →

以下内容由系统直接从 GitHub README 解析整理，保留代码块、表格与列表结构。

Current Features & Roadmap

Support for GGUF format models with full FP16 and partial support for Q8_0 and Q4_0 quantization.
Instruction-following and chat modes for various use cases.
Interactive CLI with --interactive and --instruct modes.
Flexible backend switching - choose OpenCL or PTX at runtime (need to build TornadoVM with both enabled).
Cross-platform compatibility:
✅ NVIDIA GPUs (OpenCL & PTX )
✅ Intel GPUs (OpenCL)
✅ Apple GPUs (OpenCL)

Click here to view a more detailed list of the transformer optimizations implemented in TornadoVM.

Click here to see the roadmap of the project.

-----------

Prerequisites

Ensure you have the following installed and configured:

Java 21: Required for Vector API support & TornadoVM.
TornadoVM with OpenCL or PTX backends.
GCC/G++ 13 or newer: Required to build and run TornadoVM native components.

📦 Maven Dependency

You can add GPULlama3.java directly to your Maven project by including the following dependency in your pom.xml:

JDK 21:

<dependency>
    <groupId>io.github.beehive-lab</groupId>
    <artifactId>gpu-llama3</artifactId>
    <version>0.4.0</version>
</dependency>

JDK 25:

<dependency>
    <groupId>io.github.beehive-lab</groupId>
    <artifactId>gpu-llama3</artifactId>
    <version>0.4.0-jdk25</version>
</dependency>

Prerequisites for JBang

Install JBang: Follow the JBang installation guide
TornadoVM SDK: You still need TornadoVM installed and TORNADOVM_HOME environment variable set (see Setup section above)

GPU Memory Requirements by Model Size

Model Size	Recommended GPU Memory
1B models	7GB (default)
3-7B models	15GB
8B models	20GB+

Note: If you still encounter memory issues, try:

Using Q4_0 instead of Q8_0 quantization (requires less memory).
Closing other GPU-intensive applications in your system.

-----------

GPULlama3.java powered by TornadoVM [![build JDK21](https://github.com/beehive-lab/GPULlama3.java/actions/workflows/build-and-run.yml/badge.svg)](https://github.com/beehive-lab/GPULlama3.java/actions/workflows/build-and-run.yml) [![Maven Central](https://img.shields.io/maven-central/v/io.github.beehive-lab/gpu-llama3?&logo=apache-maven&color=blue)](https://central.sonatype.com/artifact/io.github.beehive-lab/gpu-llama3)

----------- <table style="border: none;"> <tr style="border: none;"> <td style="width: 40%; vertical-align: middle; border: none;"> <img src="docs/ll.gif" > </td> <td style="vertical-align: middle; padding-left: 20px; border: none;"> Llama3 models written in native Java automatically accelerated on GPUs with <a href="https://github.com/beehive-lab/TornadoVM" target="_blank">TornadoVM</a>. Runs Llama3 inference efficiently using TornadoVM's GPU acceleration. Currently, supports Llama3, Mistral, Devstral 2, Qwen2.5, Qwen3, Phi-3, IBM Granite 3.2+ and IBM Granite 4.0 models in the GGUF format. Also, it is used as GPU inference engine in <a href="https://docs.quarkiverse.io/quarkus-langchain4j/dev/gpullama3-chat-model.html" target="_blank">Quarkus</a> and <a href="https://docs.langchain4j.dev/integrations/language-models/gpullama3-java" target="_blank">LangChain4J</a>. Builds on <a href="https://github.com/mukel/llama3.java">Llama3.java</a> by <a href="https://github.com/mukel">Alfonso² Peterssen</a>. Previous integration of TornadoVM and Llama2 it can be found in <a href="https://github.com/mikepapadim/llama2.tornadovm.java">llama2.tornadovm</a>. </td> </tr> </table>

-----------

Setup & Configuration

Install, Build, and Run

```bash

Install JBang (if not already installed)

curl -Ls https://sh.jbang.dev | bash -s - app setup

Or install it as a command

jbang app install gpullama3@beehive-lab gpullama3 -m model.gguf -p "Hello!"

or the local:

bash

🐳 Docker

You can run GPULlama3.java fully containerized with GPU acceleration enabled via OpenCL or PTX using pre-built Docker images. More information as well as examples to run with the containers are available at docker-gpullama3.java.

📦 Available Docker Images

Backend	Docker Image	Pull Command
OpenCL	[`beehivelab/gpullama3.java-nvidia-openjdk-opencl`](https://hub.docker.com/r/beehivelab/gpullama3.java-nvidia-openjdk-opencl)	`docker pull beehivelab/gpullama3.java-nvidia-openjdk-opencl`
PTX (CUDA)	[`beehivelab/gpullama3.java-nvidia-openjdk-ptx`](https://hub.docker.com/r/beehivelab/gpullama3.java-nvidia-openjdk-ptx)	`docker pull beehivelab/gpullama3.java-nvidia-openjdk-ptx`

Example (OpenCL)

```bash docker run --rm -it --gpus all \ -v "$PWD":/data \ beehivelab/gpullama3.java-nvidia-openjdk-opencl \ /gpullama3/GPULlama3.java/llama-tornado \ --gpu --verbose-init \ --opencl \ --model /data/Llama-3.2-1B-Instruct.FP16.gguf \ --prompt "Tell me a joke"

Quick Start with JBang

Use from catalog:

```bash

Basic usage - interactive chat mode

jbang LlamaTornadoCli.java -m beehive-llama-3.2-1b-instruct-fp16.gguf --interactive

Usage Examples

#### Basic Inference Run a model with a text prompt:

./llama-tornado --gpu --verbose-init --opencl --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "Explain the benefits of GPU acceleration."

#### GPU Execution (FP16 Model) Enable GPU acceleration with Q8_0 quantization:

./llama-tornado --gpu  --verbose-init --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "tell me a joke"

Running with `llamaTornado` (Java 25 single-file script)

llamaTornado is a zero-dependency Java 25 single-file script that replaces the Python launcher. It requires java 25+ on your PATH:

./llamaTornado --gpu --verbose-init --metal --model /Users/abien/work/workspaces/llms/Mistral-7B-Instruct-v0.3.Q8_0.gguf --prompt "what is java"

-----------

Command Line Options

Supported command-line options include:

cmd ➜ llama-tornado --help
usage: llama-tornado [-h] --model MODEL_PATH [--prompt PROMPT] [-sp SYSTEM_PROMPT] [--temperature TEMPERATURE] [--top-p TOP_P] [--seed SEED] [-n MAX_TOKENS]
                     [--stream STREAM] [--echo ECHO] [-i] [--instruct] [--gpu] [--opencl] [--ptx] [--gpu-memory GPU_MEMORY] [--heap-min HEAP_MIN] [--heap-max HEAP_MAX]
                     [--debug] [--profiler] [--profiler-dump-dir PROFILER_DUMP_DIR] [--print-bytecodes] [--print-threads] [--print-kernel] [--full-dump]
                     [--show-command] [--execute-after-show] [--opencl-flags OPENCL_FLAGS] [--max-wait-events MAX_WAIT_EVENTS] [--verbose]

GPU-accelerated LLaMA.java model runner using TornadoVM

options:
  -h, --help            show this help message and exit
  --model MODEL_PATH    Path to the LLaMA model file (e.g., beehive-llama-3.2-8b-instruct-fp16.gguf) (default: None)

LLaMA Configuration:
  --prompt PROMPT       Input prompt for the model (default: None)
  -sp SYSTEM_PROMPT, --system-prompt SYSTEM_PROMPT
                        System prompt for the model (default: None)
  --temperature TEMPERATURE
                        Sampling temperature (0.0 to 2.0) (default: 0.1)
  --top-p TOP_P         Top-p sampling parameter (default: 0.95)
  --seed SEED           Random seed (default: current timestamp) (default: None)
  -n MAX_TOKENS, --max-tokens MAX_TOKENS
                        Maximum number of tokens to generate (default: 512)
  --stream STREAM       Enable streaming output (default: True)
  --echo ECHO           Echo the input prompt (default: False)
  --suffix SUFFIX       Suffix for fill-in-the-middle request (Codestral) (default: None)

Mode Selection:
  -i, --interactive     Run in interactive/chat mode (default: False)
  --instruct            Run in instruction mode (default) (default: True)

Hardware Configuration:
  --gpu                 Enable GPU acceleration (default: False)
  --opencl              Use OpenCL backend (default) (default: None)
  --ptx                 Use PTX/CUDA backend (default: None)
  --gpu-memory GPU_MEMORY
                        GPU memory allocation (default: 7GB)
  --heap-min HEAP_MIN   Minimum JVM heap size (default: 20g)
  --heap-max HEAP_MAX   Maximum JVM heap size (default: 20g)

Debug and Profiling:
  --debug               Enable debug output (default: False)
  --profiler            Enable TornadoVM profiler (default: False)
  --profiler-dump-dir PROFILER_DUMP_DIR
                        Directory for profiler output (default: /home/mikepapadim/repos/gpu-llama3.java/prof.json)

TornadoVM Execution Verbose:
  --print-bytecodes     Print bytecodes (tornado.print.bytecodes=true) (default: False)
  --print-threads       Print thread information (tornado.threadInfo=true) (default: False)
  --print-kernel        Print kernel information (tornado.printKernel=true) (default: False)
  --full-dump           Enable full debug dump (tornado.fullDebug=true) (default: False)
  --verbose-init        Enable timers for TornadoVM initialization (llama.EnableTimingForTornadoVMInit=true) (default: False)

Command Display Options:
  --show-command        Display the full Java command that will be executed (default: False)
  --execute-after-show  Execute the command after showing it (use with --show-command) (default: False)

Advanced Options:
  --opencl-flags OPENCL_FLAGS
                        OpenCL compiler flags (default: -cl-denorms-are-zero -cl-no-signed-zeros -cl-finite-math-only)
  --max-wait-events MAX_WAIT_EVENTS
                        Maximum wait events for TornadoVM event pool (default: 32000)
  --verbose, -v         Verbose output (default: False)

Debug & Profiling Options

View TornadoVM's internal behavior: ```bash

Combine debug options

./llama-tornado --gpu --model model.gguf --prompt "..." --print-threads --print-bytecodes --print-kernel ```

<img src="https://github.com/user-attachments/assets/51b76554-0b01-4e18-a567-600901ab8c5f" alt="LangChain4j" height="38" style="vertical-align: middle; margin-right: 8px;"> Integration with LangChain4j

Since LangChain4j v1.7.1, GPULlama3.java is officially supported as a model provider. This means you can directly use GPULlama3.java inside your LangChain4j applications without extra glue code, just run on your GPU.

📖 Learn more: LangChain4j Documentation

Example agentic workflows with GPULlama3.java + LangChain4j 🚀

How to use: ```java GPULlama3ChatModel model = GPULlama3ChatModel.builder() .modelPath(modelPath) .temperature(0.9) // more creative .topP(0.9) // more variety .maxTokens(2048) .onGPU(Boolean.TRUE) // if false, runs on CPU though a lightweight implementation of llama3.java .build();

Clone the repository with all submodules

git clone https://github.com/beehive-lab/GPULlama3.java.git


#### Install the TornadoVM SDK on Linux or macOS

Ensure that your JAVA_HOME points to a supported JDK before using the SDK. Download an SDK package matching your OS, architecture, and accelerator backend (opencl, ptx).
TornadoVM is distributed through our [**official website**](https://www.tornadovm.org/downloads) and **SDKMAN!**. Install a version that matches your OS, architecture, and accelerator backend.

All TornadoVM SDKs are available on the [SDKMAN! TornadoVM page](https://sdkman.io/sdks/tornadovm/).

#### SDKMAN! Installation (Recommended)

##### Install SDKMAN! if not installed already

bash curl -s "https://get.sdkman.io" | bash source "$HOME/.sdkman/bin/sdkman-init.sh" sdk version

##### Install TornadoVM via SDKMAN!

bash sdk install tornadovm


#### Verify TornadoVM is Installed Correctly

bash tornado --devices

☕ Integration with Your Java Codebase or Tools

To integrate it into your codebase or IDE (e.g., IntelliJ) or custom build system (like IntelliJ, Maven, or Gradle), use the --show-command flag. This flag shows the exact Java command with all JVM flags that are being invoked under the hood to enable seamless execution on GPUs with TornadoVM. Hence, it makes it simple to replicate or embed the invoked flags in any external tool or codebase.

llama-tornado --gpu --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "tell me a joke" --show-command

<details> <summary>📋 Click to see the JVM configuration </summary>

/home/mikepapadim/.sdkman/candidates/java/current/bin/java \
    -server \
    -XX:+UnlockExperimentalVMOptions \
    -XX:+EnableJVMCI \
    -Xms20g -Xmx20g \
    --enable-preview \
    -Djava.library.path=/home/mikepapadim/manchester/TornadoVM/bin/sdk/lib \
    -Djdk.module.showModuleResolution=false \
    --module-path .:/home/mikepapadim/manchester/TornadoVM/bin/sdk/share/java/tornado \
    -Dtornado.load.api.implementation=uk.ac.manchester.tornado.runtime.tasks.TornadoTaskGraph \
    -Dtornado.load.runtime.implementation=uk.ac.manchester.tornado.runtime.TornadoCoreRuntime \
    -Dtornado.load.tornado.implementation=uk.ac.manchester.tornado.runtime.common.Tornado \
    -Dtornado.load.annotation.implementation=uk.ac.manchester.tornado.annotation.ASMClassVisitor \
    -Dtornado.load.annotation.parallel=uk.ac.manchester.tornado.api.annotations.Parallel \
    -Dtornado.tvm.maxbytecodesize=65536 \
    -Duse.tornadovm=true \
    -Dtornado.threadInfo=false \
    -Dtornado.debug=false \
    -Dtornado.fullDebug=false \
    -Dtornado.printKernel=false \
    -Dtornado.print.bytecodes=false \
    -Dtornado.device.memory=7GB \
    -Dtornado.profiler=false \
    -Dtornado.log.profiler=false \
    -Dtornado.profiler.dump.dir=/home/mikepapadim/repos/gpu-llama3.java/prof.json \
    -Dtornado.enable.fastMathOptimizations=true \
    -Dtornado.enable.mathOptimizations=false \
    -Dtornado.enable.nativeFunctions=fast \
    -Dtornado.loop.interchange=true \
    -Dtornado.eventpool.maxwaitevents=32000 \
    "-Dtornado.opencl.compiler.flags=-cl-denorms-are-zero -cl-no-signed-zeros -cl-finite-math-only" \
    --upgrade-module-path /home/mikepapadim/manchester/TornadoVM/bin/sdk/share/java/graalJars \
    @/home/mikepapadim/manchester/TornadoVM/bin/sdk/etc/exportLists/common-exports \
    @/home/mikepapadim/manchester/TornadoVM/bin/sdk/etc/exportLists/opencl-exports \
    --add-modules ALL-SYSTEM,tornado.runtime,tornado.annotation,tornado.drivers.common,tornado.drivers.opencl \
    -cp /home/mikepapadim/repos/gpu-llama3.java/target/gpu-llama3-1.0-SNAPSHOT.jar \
    org.beehive.gpullama3.LlamaApp \
    -m beehive-llama-3.2-1b-instruct-fp16.gguf \
    --temperature 0.1 \
    --top-p 0.95 \
    --seed 1746903566 \
    --max-tokens 512 \
    --stream true \
    --echo false \
    -p "tell me a joke" \
    --instruct

</details>

-----------

The above model can we swapped with one of the other models, such as beehive-llama-3.2-3b-instruct-fp16.gguf or beehive-llama-3.2-8b-instruct-fp16.gguf, depending on your needs. Check models below.

-----------