经 AI Skill Hub 精选评估,Worker-VLLM 获评「推荐使用」。这款AI工具在功能完整性、社区活跃度和易用性方面表现出色,AI 评分 7.5 分,适合有一定技术背景的用户使用。
Worker-VLLM 是一款基于 Python 开发的开源工具,专注于 language-model、llm、runpod 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
Worker-VLLM 是一款基于 Python 开发的开源工具,专注于 language-model、llm、runpod 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 方式一:pip 安装(推荐)
pip install worker-vllm
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install worker-vllm
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/runpod-workers/worker-vllm
cd worker-vllm
pip install -e .
# 验证安装
python -c "import worker_vllm; print('安装成功')"
# 命令行使用
worker-vllm --help
# 基本用法
worker-vllm input_file -o output_file
# Python 代码中调用
import worker_vllm
# 示例
result = worker_vllm.process("input")
print(result)
# worker-vllm 配置文件示例(config.yml) app: name: "worker-vllm" debug: false log_level: "INFO" # 运行时指定配置文件 worker-vllm --config config.yml # 或通过环境变量配置 export WORKER_VLLM_API_KEY="your-key" export WORKER_VLLM_OUTPUT_DIR="./output"
🚀 Deploy Guide: Follow our step-by-step deployment guide to deploy using the Runpod Console.
📦 Docker Image: runpod/worker-v1-vllm:<version>
To build an image with the model baked in, you must specify the following docker arguments when building the image.
docker build -t username/image:tag --build-arg MODEL_NAME="openchat/openchat_3.5" --build-arg BASE_PATH="/models" .
To use the latest unreleased vLLM build (installs from the nightly wheel index and transformers from source):
docker build -t username/image:tag --build-arg VLLM_NIGHTLY=true .
You can combine it with other arguments:
docker build -t username/image:tag --build-arg VLLM_NIGHTLY=true --build-arg MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" --build-arg BASE_PATH="/models" .
Python (similar to Node.js, etc.):
api_key to your Runpod API Key and the base_url to your Runpod Serverless Endpoint URL in the following format: https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1, filling in your deployed endpoint ID. For example, if your Endpoint ID is abc1234, the URL would be https://api.runpod.ai/v2/abc1234/openai/v1. from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("RUNPOD_API_KEY"),
base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1",
)
2. Change the model parameter to your deployed model's name whenever using Completions or Chat Completions. - Before:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
temperature=0,
max_tokens=100,
)
- After: response = client.chat.completions.create(
model="<YOUR DEPLOYED MODEL REPO/NAME>",
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
temperature=0,
max_tokens=100,
)
Using http requests:
1. Change the Authorization header to your Runpod API Key and the url to your Runpod Serverless Endpoint URL in the following format: https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1 - Before:
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "Why is Runpod the best platform?"
}
],
"temperature": 0,
"max_tokens": 100
}'
- After: curl https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR OPENAI API KEY>" \
-d '{
"model": "<YOUR DEPLOYED MODEL REPO/NAME>",
"messages": [
{
"role": "user",
"content": "Why is Runpod the best platform?"
}
],
"temperature": 0,
"max_tokens": 100
}'
The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are <ins>Chat Completions</ins>, <ins>Models</ins>, <ins>Responses</ins>, and <ins>Messages</ins> - with both streaming and non-streaming.
First, initialize the OpenAI Client with your Runpod API Key and Endpoint URL:
```python from openai import OpenAI import os
Configure worker-vllm using environment variables:
| Environment Variable | Description | Default | Options |
|---|---|---|---|
MODEL_NAME | Path of the model weights | "facebook/opt-125m" | Local folder or Hugging Face repo ID |
HF_TOKEN | HuggingFace access token for gated/private models | Your HuggingFace access token | |
MAX_MODEL_LEN | Model's maximum context length | Integer (e.g., 4096) | |
QUANTIZATION | Quantization method | "awq", "gptq", "squeezellm", "bitsandbytes" | |
TENSOR_PARALLEL_SIZE | Number of GPUs | 1 | Integer |
GPU_MEMORY_UTILIZATION | Fraction of GPU memory to use | 0.95 | Float between 0.0 and 1.0 |
MAX_NUM_SEQS | Maximum number of sequences per iteration | 256 | Integer |
CUSTOM_CHAT_TEMPLATE | Custom chat template override | Jinja2 template string | |
ENABLE_AUTO_TOOL_CHOICE | Enable automatic tool selection | false | boolean (true or false) |
TOOL_CALL_PARSER | Parser for tool calls | "mistral", "hermes", "llama3_json", "granite", "deepseek_v3", etc. | |
OPENAI_SERVED_MODEL_NAME_OVERRIDE | Override served model name in API | String | |
MAX_CONCURRENCY | Maximum concurrent requests | 30 | Integer |
Pass any vLLM engine arg not listed above by setting an environment variable with the UPPERCASED field name (same names vLLM uses). The worker auto-discovers all AsyncEngineArgs fields from env. For example:
| Environment Variable | vLLM Engine Arg | Example Value |
|---|---|---|
MAX_MODEL_LEN | max_model_len | 4096 |
ENFORCE_EAGER | enforce_eager | true |
ENABLE_CHUNKED_PREFILL | enable_chunked_prefill | true |
Any env var whose name matches a valid AsyncEngineArgs field (uppercased) is applied automatically. Backward-compat aliases: MODEL_NAME, TOKENIZER_NAME, MAX_CONTEXT_LEN_TO_CAPTURE. This lets you configure any vLLM option without waiting for explicit worker support.
For the complete list of all available environment variables, examples, and detailed descriptions: Configuration
If the model you would like to deploy is private or gated, you will need to include it during build time as a Docker secret, which will protect it from being exposed in the image and on DockerHub.
export DOCKER_BUILDKIT=1
export HF_TOKEN="your_token_here"
docker build -t username/image:tag --secret id=HF_TOKEN --build-arg MODEL_NAME="openchat/openchat_3.5" .
Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the vLLM Inference Engine on Runpod Serverless with just a few clicks.
</div>

Current vLLM version: 0.20.2
Check out our Load Balancer implementation here: vLLM Load Balancer
client = OpenAI( api_key=os.environ.get("RUNPOD_API_KEY"), base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1", ) ```
Path: /openai/v1/responses (full URL: https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/responses)
Supports the OpenAI Responses API request shape. Like other /openai/ routes, this is served directly—use the /openai/ prefix rather than the RunPod native job queue for these calls.
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"input": "Tell me a joke."
}
Using HTTP requests:
curl https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR RUNPOD API KEY>" \
-d '{
"model": "<YOUR DEPLOYED MODEL REPO/NAME>",
"input": "Tell me a joke."
}'
Path: /openai/v1/messages (full URL: https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/messages)
Supports the Anthropic Messages API format. Served directly, bypassing the RunPod queue.
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"max_tokens": 256,
"messages": [
{"role": "user", "content": "Hello!"}
]
}
Using HTTP requests:
curl https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/messages \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR RUNPOD API KEY>" \
-d '{
"model": "<YOUR DEPLOYED MODEL REPO/NAME>",
"max_tokens": 256,
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
高质量的大型语言模型服务端点模板
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。
AI Skill Hub 点评:Worker-VLLM 的核心功能完整,质量良好。对于AI 技术爱好者来说,这是一个值得纳入个人工具库的选择。建议先在非生产环境试用,再逐步推广。
| 原始名称 | worker-vllm |
| Topics | language-modelllmrunpodvllmpython |
| GitHub | https://github.com/runpod-workers/worker-vllm |
| License | MIT |
| 语言 | Python |
收录时间:2026-06-01 · 更新时间:2026-06-01 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。