AI Skill Hub 强烈推荐:VoiceBlender 语音控制平台 是一款优质的Agent工作流。AI 综合评分 8.2 分,在同类工具中表现稳健。如果你正在寻找可靠的Agent工作流解决方案,这是一个值得深入了解的选择。
一个可编程的开源语音平台,支持SIP和WebRTC通话控制及多方音频混音。它集成了ASR和音频处理能力,旨在为AI Agent提供实时语音交互基础设施,适合需要构建复杂语音工作流的开发者。
VoiceBlender 语音控制平台 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
一个可编程的开源语音平台,支持SIP和WebRTC通话控制及多方音频混音。它集成了ASR和音频处理能力,旨在为AI Agent提供实时语音交互基础设施,适合需要构建复杂语音工作流的开发者。
VoiceBlender 语音控制平台 是一套完整的 AI Agent 自动化工作流方案。通过可视化的节点编排,将复杂的多步骤任务拆解为清晰的自动化流程,实现全程无人值守的智能处理。支持与数百种外部服务和 API 无缝集成,适合构建数据处理管线、业务自动化和 AI 辅助决策系统。
# 方式一:go install(推荐) go install github.com/VoiceBlender/voiceblender@latest # 方式二:从源码编译 git clone https://github.com/VoiceBlender/voiceblender cd voiceblender go build -o voiceblender . # 方式三:下载预编译二进制 # 访问 Releases 页面下载对应平台二进制文件 # https://github.com/VoiceBlender/voiceblender/releases
# 查看帮助 voiceblender --help # 基本运行 voiceblender [options] <input> # 详细使用说明请查阅文档 # https://github.com/VoiceBlender/voiceblender
# voiceblender 配置说明 # 查看配置选项 voiceblender --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export VOICEBLENDER_CONFIG="/path/to/config.yml"
A Go service that bridges SIP and WebRTC voice calls with multi-party audio mixing, a REST API, and real-time webhooks.
json_base64 framing, configurable sample rate (8/16/24/48 kHz), bidirectional text, and caller-supplied X-/P- headers — designed to also back a future generic Agent APImengelbart/moqtransport (IETF draft-11); browser interop with draft-16 clients (moqtail, moq.dev) is not expected to work out of the box. Disabled by default; enable with MOQ_ENABLED=true + MOQ_TLS_CERT_FILE / MOQ_TLS_KEY_FILErole and declare a matrix of who-hears-whom by role. Applied atomically at leg-join time so a supervisor cannot momentarily bleed into the customer's audio. See API.md.leg.disconnected (disposition, timing, quality)GET /v1/vsi streams all events and accepts commands (mute, hold, DTMF, room management) over a single persistent WebSocket; filter by app_id regex for multi-tenant isolationGET /metrics (active legs/rooms, call durations, disconnect reasons, Go runtime). See API.md for the full metric reference. Profiling via go tool pprof is available at /debug/pprof/ when built with -tags pprof.meta.vc. The leg comes up in ringing, fires leg.ringing (leg_type: "whatsapp_in"), and waits for POST /v1/legs/{id}/answer. The 200 OK then carries the pre-gathered ICE/DTLS-SRTP answer.POST /v1/legs {"type":"whatsapp", ...} returns 201 immediately with the leg in ringing. ICE gathering, the digest 401/407 round-trip, and the SDP-answer apply happen asynchronously; outcome is signalled via leg.connected or leg.disconnected.dtmf.received plus the standard cross-leg broadcast.leg.ringing / leg.connected / leg.disconnected / dtmf.received / speaking.started / speaking.stopped all carry leg_type set to whatsapp_in or whatsapp_out so multi-tenant filtering works as it does for SIP and WebRTC legs.go test -tags integration -v -timeout 60s ./tests/integration/
| Library | Description | Notes |
|---|---|---|
| [sipgo](https://github.com/emiago/sipgo) | SIP stack | Excellent SIP stack in go |
| [pion/webrtc](https://github.com/pion/webrtc) | WebRTC | Nothing is better than Pion |
| [go-chi](https://github.com/go-chi/chi) | HTTP router | |
| [zaf/g711](https://github.com/zaf/g711) | G.711 codec | |
| [gobwas/ws](https://github.com/gobwas/ws) | WebSocket | |
| [go-audio/wav](https://github.com/go-audio/wav) | WAV encoding | |
| [gopus](https://github.com/thesyncim/gopus) | Opus codec | Thanks Marcelo! (Claude and Codex too!) |
| [go-mp3](https://github.com/hajimehoshi/go-mp3) | MP3 decoder | Pure Go |
| [go-audio/audio](https://github.com/go-audio/audio) | Audio buffer types | |
| [google/uuid](https://github.com/google/uuid) | UUID generation | |
| [prometheus/client_golang](https://github.com/prometheus/client_golang) | Prometheus metrics | |
| [aws-sdk-go-v2](https://github.com/aws/aws-sdk-go-v2) | AWS SDK (S3, Polly) | |
| [cloud.google.com/go/texttospeech](https://cloud.google.com/go/docs/reference/cloud.google.com/go/texttospeech/latest) | Google Cloud TTS | |
| [protobuf](https://github.com/protocolbuffers/protobuf-go) | Protocol Buffers | Pipecat agent |
| [x/sync](https://pkg.go.dev/golang.org/x/sync) | Concurrency utilities |
go build -o voiceblender ./cmd/voiceblender ./voiceblender
```bash
| Example | Description |
|---|---|
[examples/call_handler.py](examples/call_handler.py) | Python webhook listener for inbound SIP calls with room conferencing |
[examples/webrtc-client/](examples/webrtc-client/) | Browser-based WebRTC voice client with room management and DTMF |
[examples/gen_test_wav.py](examples/gen_test_wav.py) | Generate test WAV files for playback testing |
All configuration is via environment variables:
| Variable | Default | Description |
|---|---|---|
INSTANCE_ID | *(auto-generated UUID)* | Instance identifier, included in API responses and webhooks |
HTTP_ADDR | :8080 | REST API listen address |
SIP_BIND_IP | 127.0.0.1 | IPv4 address advertised in SDP/Contact/Via headers (and used as the listen address when SIP_LISTEN_IP is empty). Set to 0.0.0.0 for v4 wildcard, :: for dual-stack on Linux when bindv6only=0. |
SIP_LISTEN_IP | *(same as SIP_BIND_IP)* | UDP socket bind IP. Accepts 127.0.0.1, 0.0.0.0, ::, or any literal v4/v6 address. |
SIP_BIND_IPV6 | *(empty = v4-only)* | IPv6 address advertised in SDP/Contact/Via for IPv6 calls. Set this for IPv6-only or dual-stack deployments. |
SIP_LISTEN_IPV6 | *(same as SIP_BIND_IPV6)* | Optional separate IPv6 socket bind address (e.g. when running with both 0.0.0.0 and a specific v6 literal). |
SIP_PORT | 5060 | SIP listen port (UDP) |
SIP_TLS_PORT | *(disabled)* | SIP-over-TLS listen port (typically 5061). When set, SIP_TLS_CERT and SIP_TLS_KEY must also be provided. Required for WhatsApp Business Calling integration. |
SIP_TLS_CERT | Path to PEM-encoded TLS certificate (e.g. fullchain.pem). Meta rejects self-signed certs — use a CA-signed cert matching a public FQDN. | |
SIP_TLS_KEY | Path to PEM-encoded TLS private key (e.g. privkey.pem). | |
SIP_DEBUG | false | When true, log the full RFC 3261 wire form of every inbound and outbound SIP request and response. Very verbose — use only for troubleshooting. |
SIP_DOMAIN | *(falls back to advertised IP)* | FQDN advertised in From, Contact and Via on **all** outbound SIP signalling (classic trunks and WhatsApp). Should match the SAN on SIP_TLS_CERT and any allowlist your carrier or Meta keeps. |
SIP_HOST | voiceblender | SIP User-Agent name |
ICE_SERVERS | stun:stun.l.google.com:19302 | STUN/TURN URLs (comma-separated) |
RECORDING_DIR | /tmp/recordings | Local recording output directory |
LOG_LEVEL | info | Log level (debug, info, warn, error) |
WEBHOOK_URL | Default webhook URL for inbound calls | |
ELEVENLABS_API_KEY | API key for ElevenLabs TTS, STT, and Agent | |
VAPI_API_KEY | API key for VAPI Agent provider | |
DEEPGRAM_API_KEY | API key for Deepgram STT and TTS | |
AZURE_SPEECH_KEY | Subscription key for Azure Cognitive Speech Services (TTS and STT) | |
AZURE_SPEECH_REGION | eastus | Azure region for Speech Services (e.g. eastus, westeurope) |
S3_BUCKET | S3 bucket for recording uploads | |
S3_REGION | us-east-1 | AWS region |
S3_ENDPOINT | Custom S3 endpoint (MinIO, etc.) | |
S3_PREFIX | Key prefix for S3 objects | |
TTS_CACHE_ENABLED | false | Enable disk-backed TTS audio cache. Cached audio persists across restarts. |
TTS_CACHE_DIR | /tmp/tts_cache | Directory for cached TTS audio files (used when TTS_CACHE_ENABLED=true) |
TTS_CACHE_INCLUDE_API_KEY | false | Include API key in TTS cache key (set true if different keys map to different voice clones) |
RTP_PORT_MIN | 10000 | Minimum UDP port for RTP/RTCP media |
RTP_PORT_MAX | 20000 | Maximum UDP port for RTP/RTCP media |
SIP_JITTER_BUFFER_MS | 0 | SIP ingress jitter buffer target delay in ms (0 = disabled passthrough). Applies to every SIP leg. |
SIP_JITTER_BUFFER_MAX_MS | 300 | Max depth of the SIP ingress jitter buffer (ms); frames beyond this are dropped oldest-first. |
SIP_EXTERNAL_IP | *(empty)* | Public IPv4 address for NAT/Docker deployments. When set, used in SIP Contact headers and SDP media (c=) lines instead of the auto-detected or bind IP. IPv6 has no equivalent: set SIP_BIND_IPV6 directly to the address you want advertised. |
DEFAULT_SAMPLE_RATE | 16000 | Default mixer sample rate (Hz) for new rooms when sample_rate is not specified. Allowed: 8000, 16000, 48000. |
SIP_REFER_AUTO_DIAL | false | Accept incoming SIP REFER requests and auto-dial the transferred call. **Default-deny** (toll-fraud risk). Outbound transfers via the REST API are unaffected. |
SIP_AUTO_RINGING | false | **Behavior change vs prior releases**: previously the server always sent 180 Ringing after 100 Trying. The new default sends only 100 Trying; the API caller drives ringing explicitly via POST /v1/legs/{id}/ring, /early-media, or /answer. Set to true to restore the legacy auto-180 behavior. |
SIP_USE_SOURCE_SOCKET | false | When true, route SIP responses **and** in-dialog requests (BYE, re-INVITE, INFO, NOTIFY, REFER) back to the request's source UDP socket instead of the peer's Contact URI / Via sent-by. Enable when peers advertise unroutable addresses (e.g. private IPs in Contact from behind NAT, or Via sent-by hosts that don't resolve). Equivalent to sipgo's DialogUA.RewriteContact plus per-response SetDestination(req.Source()). |
SPEECH_DETECTION_ENABLED | false | Emit speaking.started / speaking.stopped events for every connected leg by default. Per-call speech_detection on POST /v1/legs or POST /v1/legs/{id}/answer overrides this. |
VSI_EVENT_BUFFER_SIZE | 256 | Per-client buffer (in events) on the /v1/vsi WebSocket. When the client consumes events slower than they're produced, the buffer fills and new events are dropped (with a warn log on the leading edge of each drop burst and at every 10× threshold; the next delivered event also includes an events_dropped notification to the client). Clamped to [16, 1_000_000]. **Tuning:** larger values absorb longer back-pressure spikes at the cost of higher peak memory per client (roughly the average JSON event size × buffer size, e.g. ~1 KB × 256 ≈ 256 KB per connection at the default) and longer end-to-end latency for buffered events when the client recovers. Increase only if you observe drops on legitimate slow-consumer scenarios you can't fix at the client. |
MOQ_ENABLED | false | Enable the experimental MoQ (Media over QUIC) inbound leg endpoint at CONNECT /v1/legs/moq over WebTransport/HTTP/3. PoC quality: tracks IETF draft-11 via mengelbart/moqtransport, single MoQ session per leg, Opus framed one frame per MoQ Object (LOC-style). When enabled, both MOQ_TLS_CERT_FILE and MOQ_TLS_KEY_FILE must be set. |
MOQ_LISTEN_ADDR | :8443 | UDP address for the HTTP/3 listener that backs the MoQ leg. Independent of HTTP_ADDR — TCP/:8080 and UDP/:8443 can run side-by-side. |
MOQ_TLS_CERT_FILE | _(none)_ | Path to the TLS certificate used by the HTTP/3 listener. Required when MOQ_ENABLED=true. |
MOQ_TLS_KEY_FILE | _(none)_ | Path to the TLS private key used by the HTTP/3 listener. Required when MOQ_ENABLED=true. |
MOQ_OPUS_BITRATE | 24000 | Target bitrate (bps) for the Opus encoder feeding the MoQ leg's mix track. Must be in 6000..510000. |
Set these env vars before starting voiceblender:
| Variable | Value |
|---|---|
SIP_TLS_PORT | 5061 |
SIP_TLS_CERT | path to fullchain.pem for your FQDN |
SIP_TLS_KEY | path to privkey.pem |
SIP_DOMAIN | the FQDN you registered with Meta (must match the cert SAN) |
Make a test outbound call:
curl -X POST http://localhost:8080/v1/legs \
-H 'Content-Type: application/json' \
-d '{
"type": "whatsapp",
"to": "+447900000000",
"from": "+441300000000",
"auth": { "password": "<meta-issued-digest-password>" },
"room_id": "wa-test"
}'
The HTTP response returns immediately with the leg in ringing; subscribe to the webhook or /v1/vsi event stream to see leg.connected (or leg.disconnected with a reason if Meta rejects the INVITE).
Full reference: API.md
1. Register a webhook POST /v1/webhooks
2. Receive inbound call --> webhook: leg.ringing {leg_id, from, to}
3. Answer POST /v1/legs/{id}/answer
4. Create a room POST /v1/rooms
5. Add legs to room POST /v1/rooms/{id}/legs
6. Attach AI agent POST /v1/legs/{id}/agent
7. Start recording POST /v1/legs/{id}/record
8. Hang up DELETE /v1/legs/{id}
403 SIP server X.X.X.X from INVITE does not match any SIP server configured for phone number ... — SIP_DOMAIN doesn't match what's registered with Meta. Set it to the FQDN, not the IP, and confirm via the GET /settings query above.404 Not Found on outbound — usually means the recipient phone number isn't a valid WhatsApp user, or the destination URI is malformed. Confirm the digits in to are the actual user's E.164 number.Reason: ... not receiving any media for a long time — your audio path (RTP/UDP egress) is being dropped before reaching Meta. Check firewall rules for outbound UDP from the RTP_PORT_MIN–RTP_PORT_MAX range and that ICE-srflx candidates are correct.setup:actpass + ice-lite, and they don't initiate DTLS. VoiceBlender forces setup:active automatically; if you see pcmedia: DTLS state state=connecting for >5 s, run with LOG_LEVEL=debug and inspect pion's DTLS scope for the actual error.SIP_DEBUG=true to log the full RFC 3261 wire form of every SIP message, including the auth-bearing retry after the 401/407 challenge — that's the most useful diagnostic for any signalling-layer issue.aiskill88点评:底层能力扎实,将传统通信协议与现代AI语音流结合,是构建高性能语音Agent的理想基座。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。
总体来看,VoiceBlender 语音控制平台 是一款质量优秀的Agent工作流,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | voiceblender |
| Topics | 语音AIWebRTC实时通信 |
| GitHub | https://github.com/VoiceBlender/voiceblender |
| License | MIT |
| 语言 | Go |
收录时间:2026-05-26 · 更新时间:2026-05-26 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。
选择 Agent 类型,复制安装指令后粘贴到对应客户端