🛠
AI工具

speech-swift — AI 语音合成工具中文文档

基于 Swift · 开源免费,本地部署,数据完全自主可控
英文名:speech-swift
⭐ 748 Stars 🍴 96 Forks 💻 Swift 📄 Apache-2.0 🏷 AI 8.3分
8.3AI 综合评分
apple-siliconasrcoremliosmacosmlxtts
✦ AI Skill Hub 推荐

AI Skill Hub 强烈推荐:speech-swift — AI 语音合成工具中文文档 是一款优质的AI工具。AI 综合评分 8.3 分,在同类工具中表现稳健。如果你正在寻找可靠的AI工具解决方案,这是一个值得深入了解的选择。

📚 深度解析
speech-swift — AI 语音合成工具中文文档 是一款基于 Swift 的开源工具,在 GitHub 上收获 1k+ Star,是apple-silicon、asr、coreml、ios领域中的优质开源项目。开源工具的最大优势在于代码完全透明,你可以审计每一行代码的安全性,也可以根据自身需求进行二次开发和定制。

**为什么要使用开源工具而非商业 SaaS?**
对于个人开发者和有隐私需求的用户,本地部署的开源工具意味着数据不离本机,不受第三方服务商的数据政策约束。同时,开源工具通常没有使用次数限制和月度费用,一次安装即可长期使用,对于高频使用场景的总拥有成本(TCO)远低于订阅制商业工具。

**安装与环境准备**
speech-swift — AI 语音合成工具中文文档 依赖 Swift 运行环境。建议通过 pyenv(Python)或 nvm(Node.js)管理 Swift 版本,避免全局环境污染。对于新手用户,推荐先创建虚拟环境(python -m venv venv && source venv/bin/activate),再安装依赖,这样即使出现问题也可以随时删除虚拟环境重新开始,不影响系统稳定性。

**社区与维护**
GitHub Issue 和 Discussion 是获取帮助的最快渠道。在提问前建议先检查 Closed Issues(已关闭的问题),大多数常见问题都已有解答。遇到 Bug 时,提供 pip list 的输出、完整错误堆栈和最小可复现示例,能显著提高开发者响应速度。AI Skill Hub 将持续追踪 speech-swift — AI 语音合成工具中文文档 的版本更新,及时通知重要功能变化。
📋 工具概览

speech-swift — AI 语音合成工具中文文档 是一款基于 Swift 开发的开源工具,专注于 apple-silicon、asr、coreml 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。

GitHub Stars
⭐ 748
开发语言
Swift
支持平台
macOS / iOS
维护状态
正常维护,社区驱动
开源协议
Apache-2.0
AI 综合评分
8.3 分
工具类型
AI工具
Forks
96
📖 中文文档
以下内容由 AI Skill Hub 根据项目信息自动整理,如需查看完整原始文档请访问底部「原始来源」。

speech-swift — AI 语音合成工具中文文档 是一款基于 Swift 开发的开源工具,专注于 apple-silicon、asr、coreml 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。

📌 核心特色
  • 开源免费,支持本地部署,数据完全自主可控
  • 活跃的 GitHub 开源社区,持续迭代更新
  • 提供详细文档和使用示例,新手友好
  • 支持自定义配置,灵活适配不同使用环境
  • 可作为基础组件集成进现有技术栈或进行二次开发
🎯 主要使用场景
  • 本地部署运行,保护数据隐私,满足合规要求
  • 自定义集成到现有系统,扩展技术栈能力
  • 作为开源基础组件进行商业化二次开发
以下安装命令基于项目开发语言和类型自动生成,实际以官方 README 为准。
安装命令
# 克隆仓库
git clone https://github.com/soniqo/speech-swift
cd speech-swift

# 查看安装说明
cat README.md

# 按 README 完成环境依赖安装后即可使用
📋 安装步骤说明
  1. 访问 GitHub 仓库页面
  2. 按照 README 文档完成依赖安装
  3. 根据系统环境完成初始化配置
  4. 参考官方示例或文档开始使用
  5. 遇到问题可在 GitHub Issues 中查找解答
以下用法示例由 AI Skill Hub 整理,涵盖最常见的使用场景。
常用命令 / 代码示例
# 查看帮助
speech-swift --help

# 基本运行
speech-swift [options] <input>

# 详细使用说明请查阅文档
# https://github.com/soniqo/speech-swift
以下配置示例基于典型使用场景生成,具体参数请参照官方文档调整。
配置示例
# speech-swift 配置说明
# 查看配置选项
speech-swift --config-example > config.yml

# 常见配置项
# output_dir: ./output
# log_level: info
# workers: 4

# 环境变量(覆盖配置文件)
export SPEECH_SWIFT_CONFIG="/path/to/config.yml"
📑 README 深度解析 真实文档 完整度 75/100 查看 GitHub 原文 →
以下内容由系统直接从 GitHub README 解析整理,保留代码块、表格与列表结构。

Speech Swift

AI speech models for Apple Silicon, powered by MLX Swift and CoreML.

📖 Read in: English · 中文 · 日本語 · 한국어 · Español · Deutsch · Français · हिन्दी · Português · Русский

On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs locally on Apple Silicon — no cloud, no API keys, no data leaves your device.

📚 Full Documentation → · 🤗 HuggingFace Models · 📝 Blog · 💬 Discord

<p align="center"> <a href="https://www.producthunt.com/products/speech-swift?embed=true&amp;utm_source=badge-featured&amp;utm_medium=badge&amp;utm_campaign=badge-speech-swift" target="_blank" rel="noopener noreferrer"><img alt="speech-swift - The whole speech stack, on your laptop. | Product Hunt" width="250" height="54" src="https://api.producthunt.com/widgets/embed-image/v1/featured.svg?post_id=1151422&amp;theme=light&amp;t=1779261593657"></a> </p>

<p align="center"> <a href="https://youtu.be/x9zgcaW0gUk"> <img src="https://img.youtube.com/vi/x9zgcaW0gUk/maxresdefault.jpg" width="640" alt="Local Speech AI on a MacBook — watch the 4-minute open-source library tour on YouTube"> </a> </p> <p align="center"><em>Local Speech AI on a MacBook — watch the 4-minute open-source library tour on YouTube</em></p>

Use cases: Voice Agents · Transcription · Speech Generation

  • Qwen3-ASR — Speech-to-text (automatic speech recognition, 52 languages, MLX + CoreML)
  • Parakeet TDT — Speech-to-text via CoreML (Neural Engine, NVIDIA FastConformer + TDT decoder, 25 languages)
  • Omnilingual ASR — Speech-to-text (Meta wav2vec2 + CTC, 1,672 languages across 32 scripts, CoreML 300M + MLX 300M/1B/3B/7B)
  • Streaming Dictation — Real-time dictation with partials and end-of-utterance detection (Parakeet-EOU-120M)
  • Nemotron Streaming — Low-latency streaming ASR with native punctuation and capitalization (NVIDIA Nemotron-Speech-Streaming-0.6B, CoreML, English)
  • Qwen3-ForcedAligner — Word-level timestamp alignment (audio + text → timestamps)
  • Qwen3-TTS — Text-to-speech (highest quality, streaming, custom speakers, 10 languages)
  • CosyVoice TTS — Streaming TTS with voice cloning, multi-speaker dialogue, emotion tags (9 languages)
  • VoxCPM2 — 48 kHz studio-quality TTS with voice cloning + instruction-driven voice design (2B, MLX bf16/int8/int4, 30 languages)
  • Kokoro TTS — On-device TTS (82M, CoreML/Neural Engine, 54 voices, iOS-ready, 10 languages)
  • VibeVoice TTS — Long-form / multi-speaker TTS (Microsoft VibeVoice Realtime-0.5B + 1.5B, MLX, up to 90-min podcast/audiobook synthesis, EN/ZH)
  • Qwen3.5-Chat — On-device LLM chat (0.8B, MLX INT4 + CoreML INT8, DeltaNet hybrid, streaming tokens)
  • MADLAD-400 — Many-to-many translation across 400+ languages (3B, MLX INT4 + INT8, T5 v1.1, Apache 2.0)
  • PersonaPlex — Full-duplex speech-to-speech (7B, audio in → audio out, 18 voice presets)
  • DeepFilterNet3 — Real-time noise suppression (2.1M params, 48 kHz)
  • Source Separation — Music source separation via Open-Unmix (UMX-HQ / UMX-L, 4 stems: vocals/drums/bass/other, 44.1 kHz stereo)
  • MAGNeT — Text-to-music generation (Meta MAGNeT Small 300M / Medium 1.5B, MLX INT4/INT8, 30 s clips at 32 kHz mono, masked parallel decoding)
  • FlashSR — Audio super-resolution (FlashSR ICASSP 2025, MLX, 48 kHz mono, 1-step distilled diffusion, INT4 363 MB / INT8 720 MB)
  • Wake-word — On-device keyword spotting (KWS Zipformer 3M, CoreML, 26× real-time, configurable keyword list)
  • VAD — Voice activity detection (Silero streaming, Pyannote offline, FireRedVAD 100+ languages)
  • Speaker Diarization — Who spoke when (Pyannote pipeline, Sortformer end-to-end on Neural Engine)
  • Speaker Embeddings — WeSpeaker ResNet34 (256-dim), CAM++ (192-dim)

Papers: Qwen3-ASR (Alibaba) · Qwen3-TTS (Alibaba) · Omnilingual ASR (Meta) · Parakeet TDT (NVIDIA) · CosyVoice 3 (Alibaba) · Kokoro (StyleTTS 2) · PersonaPlex (NVIDIA) · Mimi (Kyutai) · Sortformer (NVIDIA)

Requirements

  • Swift 6+, Xcode 16+ (with Metal Toolchain)
  • macOS 15+ (Sequoia) or iOS 18+, Apple Silicon (M1/M2/M3/M4)

The macOS 15 / iOS 18 minimum comes from MLState — Apple's persistent ANE state API used by the CoreML pipelines (Qwen3-ASR, Qwen3-Chat, Qwen3-TTS) to keep KV caches resident on the Neural Engine across token steps.

Installation

Build from source

git clone https://github.com/soniqo/speech-swift
cd speech-swift
make build

make build compiles the Swift package and the MLX Metal shader library. The Metal library is required for GPU inference — without it you'll see Failed to load the default metallib at runtime. make debug for debug builds, make test for the test suite.

Full build and install guide →

Quick start

Add the package to your Package.swift:

.package(url: "https://github.com/soniqo/speech-swift", branch: "main")

Import only the modules you need — every model is its own SPM library, so you don't pay for what you don't use:

.product(name: "ParakeetStreamingASR", package: "speech-swift"),
.product(name: "SpeechUI",             package: "speech-swift"),  // optional SwiftUI views

Transcribe an audio buffer in 3 lines:

import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)

Live streaming with partials:

for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
    print(partial.isFinal ? "FINAL: \(partial.text)" : "... \(partial.text)")
}

SwiftUI dictation view in ~10 lines:

import SwiftUI
import ParakeetStreamingASR
import SpeechUI

@MainActor
struct DictateView: View {
    @State private var store = TranscriptionStore()

    var body: some View {
        TranscriptionView(finals: store.finalLines, currentPartial: store.currentPartial)
            .task {
                let model = try? await ParakeetStreamingASRModel.fromPretrained()
                guard let model else { return }
                for await p in model.transcribeStream(audio: samples, sampleRate: 16000) {
                    store.apply(text: p.text, isFinal: p.isFinal)
                }
            }
    }
}

SpeechUI ships only TranscriptionView (finals + partials) and TranscriptionStore (streaming ASR adapter). Use AVFoundation for audio visualization and playback.

Available SPM products: Qwen3ASR, Qwen3TTS, Qwen3TTSCoreML, ParakeetASR, ParakeetStreamingASR, NemotronStreamingASR, OmnilingualASR, KokoroTTS, VibeVoiceTTS, CosyVoiceTTS, VoxCPM2TTS, MAGNeTMusicGen, FlashSR, PersonaPlex, SpeechVAD, SpeechEnhancement, SourceSeparation, Qwen3Chat, SpeechCore, SpeechUI, AudioCommon.

Code examples

The snippets below show the minimal path for each domain. Every section links to a full guide on soniqo.audio with configuration options, multiple backends, streaming patterns, and CLI recipes.

Speech-to-Text — [full guide →](https://soniqo.audio/guides/transcribe)

import Qwen3ASR

let model = try await Qwen3ASRModel.fromPretrained()
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)

Alternative backends: Parakeet TDT (CoreML, 32× realtime), Omnilingual ASR (1,672 languages, CoreML or MLX), Streaming dictation (live partials).

Forced Alignment — [full guide →](https://soniqo.audio/guides/align)

import Qwen3ASR

let aligner = try await Qwen3ForcedAligner.fromPretrained()
let aligned = aligner.align(
    audio: audioSamples,
    text: "Can you guarantee that the replacement part will be shipped tomorrow?",
    sampleRate: 24000
)
for word in aligned {
    print("[\(word.startTime)s - \(word.endTime)s] \(word.text)")
}

Text-to-Speech — [full guide →](https://soniqo.audio/guides/speak)

import Qwen3TTS
import AudioCommon

let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello world", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)

Alternative TTS engines: CosyVoice3 (streaming + voice cloning + emotion tags), Kokoro-82M (iOS-ready, 54 voices), VibeVoice (long-form podcast / multi-speaker, EN/ZH), Voice cloning.

Speech-to-Speech — [full guide →](https://soniqo.audio/guides/respond)

import PersonaPlex

let model = try await PersonaPlexModel.fromPretrained()
let responseAudio = model.respond(userAudio: userSamples)
// 24 kHz mono Float32 output ready for playback

LLM Chat — [full guide →](https://soniqo.audio/guides/chat)

import Qwen3Chat

let chat = try await Qwen35MLXChat.fromPretrained()
chat.chat(messages: [(.user, "Explain MLX in one sentence")]) { token, isFinal in
    print(token, terminator: "")
}

Translation — [full guide →](https://soniqo.audio/guides/translate)

import MADLADTranslation

let translator = try await MADLADTranslator.fromPretrained()
let es = try translator.translate("Hello, how are you?", to: "es")
// → "Hola, ¿cómo estás?"

Voice Activity Detection — [full guide →](https://soniqo.audio/guides/vad)

import SpeechVAD

let vad = try await SileroVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for s in segments { print("\(s.startTime)s → \(s.endTime)s") }

Speaker Diarization — [full guide →](https://soniqo.audio/guides/diarize)

import SpeechVAD

let diarizer = try await DiarizationPipeline.fromPretrained()
let segments = diarizer.diarize(audio: samples, sampleRate: 16000)
for s in segments { print("Speaker \(s.speakerId): \(s.startTime)s - \(s.endTime)s") }

Speech Enhancement — [full guide →](https://soniqo.audio/guides/denoise)

import SpeechEnhancement

let denoiser = try await DeepFilterNet3Model.fromPretrained()
let clean = try denoiser.enhance(audio: noisySamples, sampleRate: 48000)

Voice Pipeline (ASR → LLM → TTS) — [full guide →](https://soniqo.audio/voice-agents)

import SpeechCore

let pipeline = VoicePipeline(
    stt: parakeetASR,
    tts: qwen3TTS,
    vad: sileroVAD,
    config: .init(mode: .voicePipeline),
    onEvent: { event in print(event) }
)
pipeline.start()
pipeline.pushAudio(micSamples)

VoicePipeline is the real-time voice-agent state machine (powered by speech-core) with VAD-driven turn detection, interruption handling, and eager STT. It connects any SpeechRecognitionModel + SpeechGenerationModel + StreamingVADProvider.

Demo apps

  • DictateDemo (docs) — macOS menu-bar streaming dictation with live partials, VAD-driven end-of-utterance detection, and one-click copy. Runs as a background agent (Parakeet-EOU-120M + Silero VAD).
  • iOSEchoDemo — iOS echo demo (Parakeet ASR + Kokoro TTS). Device and simulator.
  • PersonaPlexDemo — Conversational voice assistant with mic input, VAD, and multi-turn context. macOS. RTF ~0.94 on M2 Max (faster than real-time).
  • SpeechDemo — Dictation and TTS synthesis in a tabbed interface. macOS.

Each demo's README has build instructions.

Cache configuration

Model weights download from HuggingFace on first use and cache to ~/Library/Caches/qwen3-speech/. Override with QWEN3_CACHE_DIR (CLI) or cacheDir: (Swift API). All fromPretrained() entry points also accept offlineMode: true to skip network when weights are already cached.

See docs/inference/cache-and-offline.md for full details including sandboxed iOS container paths.

HTTP API server

speech-server --port 8080

Exposes every model via HTTP REST + WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at /v1/realtime. See Sources/AudioServer/.

Swift Package Manager

dependencies: [
    .package(url: "https://github.com/soniqo/speech-swift", branch: "main")
]

Import only what you need — every model is its own SPM target:

import Qwen3ASR             // Speech recognition (MLX)
import ParakeetASR          // Speech recognition (CoreML, batch)
import ParakeetStreamingASR // Streaming dictation with partials + EOU
import NemotronStreamingASR // English streaming ASR with native punctuation (0.6B)
import OmnilingualASR       // 1,672 languages (CoreML + MLX)
import Qwen3TTS             // Text-to-speech
import CosyVoiceTTS         // Text-to-speech with voice cloning
import VoxCPM2TTS           // 48 kHz TTS with voice cloning + voice design (2B)
import KokoroTTS            // Text-to-speech (iOS-ready)
import VibeVoiceTTS         // Long-form / multi-speaker TTS (EN/ZH)
import Qwen3Chat            // On-device LLM chat
import MADLADTranslation    // Many-to-many translation across 400+ languages
import PersonaPlex          // Full-duplex speech-to-speech
import SpeechVAD            // VAD + speaker diarization + embeddings
import SpeechEnhancement    // Noise suppression
import SourceSeparation     // Music source separation (Open-Unmix, 4 stems)
import SpeechUI             // SwiftUI components for streaming transcripts
import AudioCommon          // Shared protocols and utilities
📚 实用指南(长尾问题)
适合谁
  • 构建多智能体协作系统的 Agent 开发者
  • 构建企业知识库 / RAG 检索应用的团队
  • 跨境业务、多语言内容运营团队
  • 做语音类 AI 产品的开发者
最佳实践
  • Agent 任务先做 dry-run 验证工具调用链,再开启自主执行
常见错误
  • API key 直接提交到 git 仓库(请用 .env 并加入 .gitignore)
部署方案
  • CLI:直接 npm install -g / pip install,命令行调用
  • 云端托管:可放在 Vercel / Railway / Fly.io 等 PaaS 平台
相关搜索
speech-swift 中文教程speech-swift 安装报错怎么办speech-swift Agent 工作流speech-swift 与同类工具对比speech-swift 最佳实践speech-swift 适合谁用
⚡ 核心功能
👥 适合人群
AI 技术爱好者研究人员和学生开发者和工程师技术创业者
🎯 使用场景
  • 本地部署运行,保护数据隐私,满足合规要求
  • 自定义集成到现有系统,扩展技术栈能力
  • 作为开源基础组件进行商业化二次开发
⚖️ 优点与不足
✅ 优点
  • +Apache-2.0 协议,可免费商用
  • +完全开源免费,无授权费用
  • +本地部署,数据完全自主可控
  • +开发者社区支持,遇问题可查可问
⚠️ 不足
  • 安装和初始配置可能需要一定技术基础
  • 功能完整性通常不如成熟商业产品
  • 技术支持主要依赖开源社区,响应速度不稳定
⚠️ 使用须知

AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。

建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。

📄 License 说明

✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。

🔗 相关工具推荐
❓ 常见问题 FAQ
speech-swift 是一款Swift开发的AI辅助工具。AI speech toolkit for Apple Silicon — ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML
💡 AI Skill Hub 点评

总体来看,speech-swift — AI 语音合成工具中文文档 是一款质量优秀的AI工具,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。

📚 深入学习 speech-swift — AI 语音合成工具中文文档
查看分步骤安装教程和完整使用指南,快速上手这款工具
🌐 原始信息
原始名称 speech-swift
原始描述 AI speech toolkit for Apple Silicon — ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML
Topics apple-siliconasrcoremliosmacosmlxtts
GitHub https://github.com/soniqo/speech-swift
License Apache-2.0
语言 Swift
🔗 原始来源
🐙 GitHub 仓库  https://github.com/soniqo/speech-swift 🌐 官方网站  https://soniqo.audio

收录时间:2026-05-22 · 更新时间:2026-05-22 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。