能力标签
⚙️
Agent工作流

VoiceBlender 语音控制平台

基于 Go · 无代码搭建完整 AI 自动化流程
英文名:voiceblender
⭐ 68 Stars 🍴 8 Forks 💻 Go 📄 MIT 🏷 AI 8.2分
8.2AI 综合评分
语音AIWebRTC实时通信
✦ AI Skill Hub 推荐

AI Skill Hub 强烈推荐:VoiceBlender 语音控制平台 是一款优质的Agent工作流。AI 综合评分 8.2 分,在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案,这是一个值得深入了解的选择。

📚 深度解析
VoiceBlender 语音控制平台 是一套完整的 AI Agent 自动化工作流方案。随着 AI 能力的不断提升,基于 Agent 的自动化工作流正在成为提升个人和团队效率的核心方式。区别于传统的 RPA 自动化(模拟鼠标键盘操作),AI Agent 工作流通过理解任务意图、动态规划执行路径,能够处理更复杂的非结构化任务。

VoiceBlender 语音控制平台 工作流的设计遵循"最小配置,最大复用"原则:核心逻辑已经封装好,用户只需配置自己的 API Key 和业务参数即可快速上手。工作流内置错误处理和重试机制,在网络波动或 API 限速等情况下仍能稳定运行,适合作为生产环境的自动化基础设施。

在实际部署时,建议先在测试环境中运行 3-5 次,验证各个环节的输出结果符合预期,再部署到生产环境。AI Skill Hub 评分 8.2 分,是同类 Agent 工作流中的精选推荐。
📋 工具概览

一个可编程的开源语音平台,支持SIP和WebRTC通话控制及多方音频混音。它集成了ASR和音频处理能力,旨在为AI Agent提供实时语音交互基础设施,适合需要构建复杂语音工作流的开发者。

VoiceBlender 语音控制平台 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。

GitHub Stars
⭐ 68
开发语言
Go
支持平台
Windows / macOS / Linux(跨平台)
维护状态
轻量级项目,按需更新
开源协议
MIT
AI 综合评分
8.2 分
工具类型
Agent工作流
Forks
8
📖 中文文档
以下内容由 AI Skill Hub 根据项目信息自动整理,如需查看完整原始文档请访问底部「原始来源」。

一个可编程的开源语音平台,支持SIP和WebRTC通话控制及多方音频混音。它集成了ASR和音频处理能力,旨在为AI Agent提供实时语音交互基础设施,适合需要构建复杂语音工作流的开发者。

VoiceBlender 语音控制平台 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。

📌 核心特色
  • 可视化 Agent 工作流编排,无需编写复杂代码
  • 支持多步骤自动化任务链,实现全流程无人值守
  • 与外部 API、数据库和第三方服务无缝集成
  • 内置错误处理与自动重试机制,保障稳定运行
  • 提供可复用的自动化模板,快速在同类场景部署
🎯 主要使用场景
  • 自动化日常重复性工作,将精力集中于创造性任务
  • 构建数据采集 → 处理 → 输出的完整自动化管线
  • 实现跨平台、跨系统的数据流转和业务协同
以下安装命令基于项目开发语言和类型自动生成,实际以官方 README 为准。
安装命令
# 方式一:go install(推荐)
go install github.com/VoiceBlender/voiceblender@latest

# 方式二:从源码编译
git clone https://github.com/VoiceBlender/voiceblender
cd voiceblender
go build -o voiceblender .

# 方式三:下载预编译二进制
# 访问 Releases 页面下载对应平台二进制文件
# https://github.com/VoiceBlender/voiceblender/releases
📋 安装步骤说明
  1. 访问 GitHub 仓库获取工作流文件
  2. 在对应平台(Dify / Flowise / Make 等)中找到「导入工作流」功能
  3. 上传工作流文件
  4. 按照提示配置必要的环境变量和 API Key
  5. 运行测试确认流程正常后投入使用
以下用法示例由 AI Skill Hub 整理,涵盖最常见的使用场景。
常用命令 / 代码示例
# 查看帮助
voiceblender --help

# 基本运行
voiceblender [options] <input>

# 详细使用说明请查阅文档
# https://github.com/VoiceBlender/voiceblender
以下配置示例基于典型使用场景生成,具体参数请参照官方文档调整。
配置示例
# voiceblender 配置说明
# 查看配置选项
voiceblender --config-example > config.yml

# 常见配置项
# output_dir: ./output
# log_level: info
# workers: 4

# 环境变量(覆盖配置文件)
export VOICEBLENDER_CONFIG="/path/to/config.yml"
📑 README 深度解析 真实文档 完整度 90/100 查看 GitHub 原文 →
以下内容由系统直接从 GitHub README 解析整理,保留代码块、表格与列表结构。

VoiceBlender

A Go service that bridges SIP and WebRTC voice calls with multi-party audio mixing, a REST API, and real-time webhooks.

Join our Discord

Features

  • SIP inbound & outbound -- receive and originate SIP calls with codec negotiation (PCMU, PCMA, G.722, Opus), digest auth, session timers (RFC 4028)
  • SIP over TLS -- optional TLS transport on a second port alongside UDP, reusable by classic SIP trunks and required by WhatsApp
  • Early media -- SIP 183 Session Progress with SDP for pre-answer audio (custom ringback, IVR)
  • Hold/unhold -- SIP re-INVITE with sendonly/sendrecv direction
  • WebRTC -- browser-based voice via SDP offer/answer with trickle ICE
  • WhatsApp Business Calling -- inbound and outbound calls over SIP-TLS + ICE/DTLS-SRTP + Opus
  • WebSocket legs -- inbound (HTTP upgrade) and outbound (dial) PCM-over-WebSocket legs with binary or json_base64 framing, configurable sample rate (8/16/24/48 kHz), bidirectional text, and caller-supplied X-/P- headers — designed to also back a future generic Agent API
  • MoQ legs (experimental, PoC) -- inbound Media-over-QUIC legs over WebTransport/HTTP/3 with Opus framed one frame per MoQ Object (LOC-style). Tracks mengelbart/moqtransport (IETF draft-11); browser interop with draft-16 clients (moqtail, moq.dev) is not expected to work out of the box. Disabled by default; enable with MOQ_ENABLED=true + MOQ_TLS_CERT_FILE / MOQ_TLS_KEY_FILE
  • Multi-party rooms -- mix N participants with mixed-minus-self audio at a configurable sample rate (8 kHz, 16 kHz, or 48 kHz per room; default 16 kHz)
  • Room bridging -- join two rooms' mixers (same sample rate) with live-configurable direction (bidirectional, one-way each way, or parked); echo-free via mixed-minus-self
  • Audio routing matrix -- per-room role-based routing for asymmetric audio (barge-in / whisper / supervisor monitor). Tag legs with a free-form role and declare a matrix of who-hears-whom by role. Applied atomically at leg-join time so a supervisor cannot momentarily bleed into the customer's audio. See API.md.
  • WebSocket room access -- join rooms from any client over a WebSocket with base64 PCM frames
  • DTMF -- send and receive RFC 4733 telephone-events
  • Real-Time Text (RTT) -- ITU-T T.140 over RTP per RFC 4103 with RFC 2198 redundancy;
  • Recording -- stereo WAV recording per-leg or per-room, multi-channel per-participant tracks, pause/resume (writes silence to preserve timeline while sensitive data is exchanged), optional S3 upload
  • Playback -- stream WAV/MP3 audio or built-in telephone tones into legs or rooms
  • TTS -- text-to-speech into legs or rooms (ElevenLabs, Google Cloud, AWS Polly)
  • STT -- real-time speech-to-text with partial transcripts (ElevenLabs)
  • AI Agent -- attach a conversational AI agent to a leg or room (ElevenLabs, VAPI, Pipecat, Deepgram) with mid-session context injection
  • Answering Machine Detection (AMD) -- per-call analysis of outbound call audio to classify the answerer as human, machine, no-speech, or not-sure; optional voicemail beep detection via Goertzel frequency analysis
  • Webhooks -- real-time event delivery with HMAC-SHA256 signing and retry; typed event data with CDR-style leg.disconnected (disposition, timing, quality)
  • WebSocket event stream (VSI) -- GET /v1/vsi streams all events and accepts commands (mute, hold, DTMF, room management) over a single persistent WebSocket; filter by app_id regex for multi-tenant isolation
  • Prometheus metrics -- operational metrics exposed at GET /metrics (active legs/rooms, call durations, disconnect reasons, Go runtime). See API.md for the full metric reference. Profiling via go tool pprof is available at /debug/pprof/ when built with -tags pprof.

Capabilities

  • Inbound — Meta-originated INVITEs are auto-routed to a WhatsApp handler when the From URI host ends in meta.vc. The leg comes up in ringing, fires leg.ringing (leg_type: "whatsapp_in"), and waits for POST /v1/legs/{id}/answer. The 200 OK then carries the pre-gathered ICE/DTLS-SRTP answer.
  • OutboundPOST /v1/legs {"type":"whatsapp", ...} returns 201 immediately with the leg in ringing. ICE gathering, the digest 401/407 round-trip, and the SDP-answer apply happen asynchronously; outcome is signalled via leg.connected or leg.disconnected.
  • Audio — full-duplex Opus at 48 kHz with mixed-minus-self room participation, recording, TTS, STT, agent attachment, speaking detection, playback. The mixer auto-resamples between WhatsApp's 48 kHz and your room's configured rate.
  • DTMF — inbound RFC 4733 telephone-events are decoded and emitted as dtmf.received plus the standard cross-leg broadcast.
  • Webhooks + WebSocket eventsleg.ringing / leg.connected / leg.disconnected / dtmf.received / speaking.started / speaking.stopped all carry leg_type set to whatsapp_in or whatsapp_out so multi-tenant filtering works as it does for SIP and WebRTC legs.

Integration tests (requires two SIP instances)

go test -tags integration -v -timeout 60s ./tests/integration/

Dependencies

LibraryDescriptionNotes
[sipgo](https://github.com/emiago/sipgo)SIP stackExcellent SIP stack in go
[pion/webrtc](https://github.com/pion/webrtc)WebRTCNothing is better than Pion
[go-chi](https://github.com/go-chi/chi)HTTP router
[zaf/g711](https://github.com/zaf/g711)G.711 codec
[gobwas/ws](https://github.com/gobwas/ws)WebSocket
[go-audio/wav](https://github.com/go-audio/wav)WAV encoding
[gopus](https://github.com/thesyncim/gopus)Opus codecThanks Marcelo! (Claude and Codex too!)
[go-mp3](https://github.com/hajimehoshi/go-mp3)MP3 decoderPure Go
[go-audio/audio](https://github.com/go-audio/audio)Audio buffer types
[google/uuid](https://github.com/google/uuid)UUID generation
[prometheus/client_golang](https://github.com/prometheus/client_golang)Prometheus metrics
[aws-sdk-go-v2](https://github.com/aws/aws-sdk-go-v2)AWS SDK (S3, Polly)
[cloud.google.com/go/texttospeech](https://cloud.google.com/go/docs/reference/cloud.google.com/go/texttospeech/latest)Google Cloud TTS
[protobuf](https://github.com/protocolbuffers/protobuf-go)Protocol BuffersPipecat agent
[x/sync](https://pkg.go.dev/golang.org/x/sync)Concurrency utilities

Build and run

go build -o voiceblender ./cmd/voiceblender ./voiceblender

Quick Start

```bash

Examples

ExampleDescription
[examples/call_handler.py](examples/call_handler.py)Python webhook listener for inbound SIP calls with room conferencing
[examples/webrtc-client/](examples/webrtc-client/)Browser-based WebRTC voice client with room management and DTMF
[examples/gen_test_wav.py](examples/gen_test_wav.py)Generate test WAV files for playback testing

Configuration

All configuration is via environment variables:

VariableDefaultDescription
INSTANCE_ID*(auto-generated UUID)*Instance identifier, included in API responses and webhooks
HTTP_ADDR:8080REST API listen address
SIP_BIND_IP127.0.0.1IPv4 address advertised in SDP/Contact/Via headers (and used as the listen address when SIP_LISTEN_IP is empty). Set to 0.0.0.0 for v4 wildcard, :: for dual-stack on Linux when bindv6only=0.
SIP_LISTEN_IP*(same as SIP_BIND_IP)*UDP socket bind IP. Accepts 127.0.0.1, 0.0.0.0, ::, or any literal v4/v6 address.
SIP_BIND_IPV6*(empty = v4-only)*IPv6 address advertised in SDP/Contact/Via for IPv6 calls. Set this for IPv6-only or dual-stack deployments.
SIP_LISTEN_IPV6*(same as SIP_BIND_IPV6)*Optional separate IPv6 socket bind address (e.g. when running with both 0.0.0.0 and a specific v6 literal).
SIP_PORT5060SIP listen port (UDP)
SIP_TLS_PORT*(disabled)*SIP-over-TLS listen port (typically 5061). When set, SIP_TLS_CERT and SIP_TLS_KEY must also be provided. Required for WhatsApp Business Calling integration.
SIP_TLS_CERTPath to PEM-encoded TLS certificate (e.g. fullchain.pem). Meta rejects self-signed certs — use a CA-signed cert matching a public FQDN.
SIP_TLS_KEYPath to PEM-encoded TLS private key (e.g. privkey.pem).
SIP_DEBUGfalseWhen true, log the full RFC 3261 wire form of every inbound and outbound SIP request and response. Very verbose — use only for troubleshooting.
SIP_DOMAIN*(falls back to advertised IP)*FQDN advertised in From, Contact and Via on **all** outbound SIP signalling (classic trunks and WhatsApp). Should match the SAN on SIP_TLS_CERT and any allowlist your carrier or Meta keeps.
SIP_HOSTvoiceblenderSIP User-Agent name
ICE_SERVERSstun:stun.l.google.com:19302STUN/TURN URLs (comma-separated)
RECORDING_DIR/tmp/recordingsLocal recording output directory
LOG_LEVELinfoLog level (debug, info, warn, error)
WEBHOOK_URLDefault webhook URL for inbound calls
ELEVENLABS_API_KEYAPI key for ElevenLabs TTS, STT, and Agent
VAPI_API_KEYAPI key for VAPI Agent provider
DEEPGRAM_API_KEYAPI key for Deepgram STT and TTS
AZURE_SPEECH_KEYSubscription key for Azure Cognitive Speech Services (TTS and STT)
AZURE_SPEECH_REGIONeastusAzure region for Speech Services (e.g. eastus, westeurope)
S3_BUCKETS3 bucket for recording uploads
S3_REGIONus-east-1AWS region
S3_ENDPOINTCustom S3 endpoint (MinIO, etc.)
S3_PREFIXKey prefix for S3 objects
TTS_CACHE_ENABLEDfalseEnable disk-backed TTS audio cache. Cached audio persists across restarts.
TTS_CACHE_DIR/tmp/tts_cacheDirectory for cached TTS audio files (used when TTS_CACHE_ENABLED=true)
TTS_CACHE_INCLUDE_API_KEYfalseInclude API key in TTS cache key (set true if different keys map to different voice clones)
RTP_PORT_MIN10000Minimum UDP port for RTP/RTCP media
RTP_PORT_MAX20000Maximum UDP port for RTP/RTCP media
SIP_JITTER_BUFFER_MS0SIP ingress jitter buffer target delay in ms (0 = disabled passthrough). Applies to every SIP leg.
SIP_JITTER_BUFFER_MAX_MS300Max depth of the SIP ingress jitter buffer (ms); frames beyond this are dropped oldest-first.
SIP_EXTERNAL_IP*(empty)*Public IPv4 address for NAT/Docker deployments. When set, used in SIP Contact headers and SDP media (c=) lines instead of the auto-detected or bind IP. IPv6 has no equivalent: set SIP_BIND_IPV6 directly to the address you want advertised.
DEFAULT_SAMPLE_RATE16000Default mixer sample rate (Hz) for new rooms when sample_rate is not specified. Allowed: 8000, 16000, 48000.
SIP_REFER_AUTO_DIALfalseAccept incoming SIP REFER requests and auto-dial the transferred call. **Default-deny** (toll-fraud risk). Outbound transfers via the REST API are unaffected.
SIP_AUTO_RINGINGfalse**Behavior change vs prior releases**: previously the server always sent 180 Ringing after 100 Trying. The new default sends only 100 Trying; the API caller drives ringing explicitly via POST /v1/legs/{id}/ring, /early-media, or /answer. Set to true to restore the legacy auto-180 behavior.
SIP_USE_SOURCE_SOCKETfalseWhen true, route SIP responses **and** in-dialog requests (BYE, re-INVITE, INFO, NOTIFY, REFER) back to the request's source UDP socket instead of the peer's Contact URI / Via sent-by. Enable when peers advertise unroutable addresses (e.g. private IPs in Contact from behind NAT, or Via sent-by hosts that don't resolve). Equivalent to sipgo's DialogUA.RewriteContact plus per-response SetDestination(req.Source()).
SPEECH_DETECTION_ENABLEDfalseEmit speaking.started / speaking.stopped events for every connected leg by default. Per-call speech_detection on POST /v1/legs or POST /v1/legs/{id}/answer overrides this.
VSI_EVENT_BUFFER_SIZE256Per-client buffer (in events) on the /v1/vsi WebSocket. When the client consumes events slower than they're produced, the buffer fills and new events are dropped (with a warn log on the leading edge of each drop burst and at every 10× threshold; the next delivered event also includes an events_dropped notification to the client). Clamped to [16, 1_000_000]. **Tuning:** larger values absorb longer back-pressure spikes at the cost of higher peak memory per client (roughly the average JSON event size × buffer size, e.g. ~1 KB × 256 ≈ 256 KB per connection at the default) and longer end-to-end latency for buffered events when the client recovers. Increase only if you observe drops on legitimate slow-consumer scenarios you can't fix at the client.
MOQ_ENABLEDfalseEnable the experimental MoQ (Media over QUIC) inbound leg endpoint at CONNECT /v1/legs/moq over WebTransport/HTTP/3. PoC quality: tracks IETF draft-11 via mengelbart/moqtransport, single MoQ session per leg, Opus framed one frame per MoQ Object (LOC-style). When enabled, both MOQ_TLS_CERT_FILE and MOQ_TLS_KEY_FILE must be set.
MOQ_LISTEN_ADDR:8443UDP address for the HTTP/3 listener that backs the MoQ leg. Independent of HTTP_ADDR — TCP/:8080 and UDP/:8443 can run side-by-side.
MOQ_TLS_CERT_FILE_(none)_Path to the TLS certificate used by the HTTP/3 listener. Required when MOQ_ENABLED=true.
MOQ_TLS_KEY_FILE_(none)_Path to the TLS private key used by the HTTP/3 listener. Required when MOQ_ENABLED=true.
MOQ_OPUS_BITRATE24000Target bitrate (bps) for the Opus encoder feeding the MoQ leg's mix track. Must be in 6000..510000.

VoiceBlender configuration

Set these env vars before starting voiceblender:

VariableValue
SIP_TLS_PORT5061
SIP_TLS_CERTpath to fullchain.pem for your FQDN
SIP_TLS_KEYpath to privkey.pem
SIP_DOMAINthe FQDN you registered with Meta (must match the cert SAN)

Make a test outbound call:

curl -X POST http://localhost:8080/v1/legs \
  -H 'Content-Type: application/json' \
  -d '{
    "type": "whatsapp",
    "to": "+447900000000",
    "from": "+441300000000",
    "auth": { "password": "<meta-issued-digest-password>" },
    "room_id": "wa-test"
  }'

The HTTP response returns immediately with the leg in ringing; subscribe to the webhook or /v1/vsi event stream to see leg.connected (or leg.disconnected with a reason if Meta rejects the INVITE).

API Overview

Full reference: API.md

Typical Workflow

1. Register a webhook        POST /v1/webhooks
2. Receive inbound call      --> webhook: leg.ringing {leg_id, from, to}
3. Answer                    POST /v1/legs/{id}/answer
4. Create a room             POST /v1/rooms
5. Add legs to room          POST /v1/rooms/{id}/legs
6. Attach AI agent           POST /v1/legs/{id}/agent
7. Start recording           POST /v1/legs/{id}/record
8. Hang up                   DELETE /v1/legs/{id}

Troubleshooting

  • 403 SIP server X.X.X.X from INVITE does not match any SIP server configured for phone number ...SIP_DOMAIN doesn't match what's registered with Meta. Set it to the FQDN, not the IP, and confirm via the GET /settings query above.
  • 404 Not Found on outbound — usually means the recipient phone number isn't a valid WhatsApp user, or the destination URI is malformed. Confirm the digits in to are the actual user's E.164 number.
  • Call connects but Meta sends BYE after 20 s with Reason: ... not receiving any media for a long time — your audio path (RTP/UDP egress) is being dropped before reaching Meta. Check firewall rules for outbound UDP from the RTP_PORT_MINRTP_PORT_MAX range and that ICE-srflx candidates are correct.
  • DTLS handshake stalls — Meta's offer is setup:actpass + ice-lite, and they don't initiate DTLS. VoiceBlender forces setup:active automatically; if you see pcmedia: DTLS state state=connecting for >5 s, run with LOG_LEVEL=debug and inspect pion's DTLS scope for the actual error.
  • Set SIP_DEBUG=true to log the full RFC 3261 wire form of every SIP message, including the auth-bearing retry after the 401/407 challenge — that's the most useful diagnostic for any signalling-layer issue.
🎯 aiskill88 AI 点评 A 级 2026-05-26

aiskill88点评:底层能力扎实,将传统通信协议与现代AI语音流结合,是构建高性能语音Agent的理想基座。

⚡ 核心功能
👥 适合人群
自动化工程师和运维人员项目经理和业务分析师希望减少重复性工作的专业人士数字化转型团队
🎯 使用场景
  • 自动化日常重复性工作,将精力集中于创造性任务
  • 构建数据采集 → 处理 → 输出的完整自动化管线
  • 实现跨平台、跨系统的数据流转和业务协同
⚖️ 优点与不足
✅ 优点
  • +MIT 协议,可免费商用
  • +大幅减少重复性人工操作
  • +可视化流程,清晰直观
  • +可扩展性强,支持复杂场景
⚠️ 不足
  • 初始配置和调试需投入一定时间
  • 强依赖外部服务的稳定性
  • 复杂场景需具备一定技术基础
⚠️ 使用须知

AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。

建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。

📄 License 说明

✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。

🔗 相关工具推荐
🧩 你可能还需要
基于当前 Skill 的能力图谱,自动补全的工具组合
❓ 常见问题 FAQ
主要支持 SIP 和 WebRTC 协议,用于实现实时语音通话控制。
💡 AI Skill Hub 点评

总体来看,VoiceBlender 语音控制平台 是一款质量优秀的Agent工作流,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。

⬇️ 获取与下载
⬇ 下载源码 ZIP

✅ MIT 协议 · 可免费商用 · 直接从 aiskill88 服务器下载,无需跳转 GitHub

📚 深入学习 VoiceBlender 语音控制平台
查看分步骤安装教程和完整使用指南,快速上手这款工具
🌐 原始信息
原始名称 voiceblender
Topics 语音AIWebRTC实时通信
GitHub https://github.com/VoiceBlender/voiceblender
License MIT
语言 Go
🔗 原始来源
🐙 GitHub 仓库  https://github.com/VoiceBlender/voiceblender 🌐 官方网站  https://voiceblender.org

收录时间:2026-05-26 · 更新时间:2026-05-26 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。