AI Skill Hub 强烈推荐:speech-swift — AI 语音合成工具中文文档 是一款优质的AI工具。AI 综合评分 8.3 分,在同类工具中表现稳健。如果你正在寻找可靠的AI工具解决方案,这是一个值得深入了解的选择。
speech-swift — AI 语音合成工具中文文档 是一款基于 Swift 开发的开源工具,专注于 apple-silicon、asr、coreml 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
speech-swift — AI 语音合成工具中文文档 是一款基于 Swift 开发的开源工具,专注于 apple-silicon、asr、coreml 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 克隆仓库 git clone https://github.com/soniqo/speech-swift cd speech-swift # 查看安装说明 cat README.md # 按 README 完成环境依赖安装后即可使用
# 查看帮助 speech-swift --help # 基本运行 speech-swift [options] <input> # 详细使用说明请查阅文档 # https://github.com/soniqo/speech-swift
# speech-swift 配置说明 # 查看配置选项 speech-swift --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export SPEECH_SWIFT_CONFIG="/path/to/config.yml"
AI speech models for Apple Silicon, powered by MLX Swift and CoreML.
📖 Read in: English · 中文 · 日本語 · 한국어 · Español · Deutsch · Français · हिन्दी · Português · Русский
On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs locally on Apple Silicon — no cloud, no API keys, no data leaves your device.
📚 Full Documentation → · 🤗 HuggingFace Models · 📝 Blog · 💬 Discord
<p align="center"> <a href="https://www.producthunt.com/products/speech-swift?embed=true&utm_source=badge-featured&utm_medium=badge&utm_campaign=badge-speech-swift" target="_blank" rel="noopener noreferrer"><img alt="speech-swift - The whole speech stack, on your laptop. | Product Hunt" width="250" height="54" src="https://api.producthunt.com/widgets/embed-image/v1/featured.svg?post_id=1151422&theme=light&t=1779261593657"></a> </p>
<p align="center"> <a href="https://youtu.be/x9zgcaW0gUk"> <img src="https://img.youtube.com/vi/x9zgcaW0gUk/maxresdefault.jpg" width="640" alt="Local Speech AI on a MacBook — watch the 4-minute open-source library tour on YouTube"> </a> </p> <p align="center"><em>Local Speech AI on a MacBook — watch the 4-minute open-source library tour on YouTube</em></p>
Use cases: Voice Agents · Transcription · Speech Generation
Papers: Qwen3-ASR (Alibaba) · Qwen3-TTS (Alibaba) · Omnilingual ASR (Meta) · Parakeet TDT (NVIDIA) · CosyVoice 3 (Alibaba) · Kokoro (StyleTTS 2) · PersonaPlex (NVIDIA) · Mimi (Kyutai) · Sortformer (NVIDIA)
The macOS 15 / iOS 18 minimum comes from MLState — Apple's persistent ANE state API used by the CoreML pipelines (Qwen3-ASR, Qwen3-Chat, Qwen3-TTS) to keep KV caches resident on the Neural Engine across token steps.
git clone https://github.com/soniqo/speech-swift
cd speech-swift
make build
make build compiles the Swift package and the MLX Metal shader library. The Metal library is required for GPU inference — without it you'll see Failed to load the default metallib at runtime. make debug for debug builds, make test for the test suite.
Add the package to your Package.swift:
.package(url: "https://github.com/soniqo/speech-swift", branch: "main")
Import only the modules you need — every model is its own SPM library, so you don't pay for what you don't use:
.product(name: "ParakeetStreamingASR", package: "speech-swift"),
.product(name: "SpeechUI", package: "speech-swift"), // optional SwiftUI views
Transcribe an audio buffer in 3 lines:
import ParakeetStreamingASR
let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)
Live streaming with partials:
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
print(partial.isFinal ? "FINAL: \(partial.text)" : "... \(partial.text)")
}
SwiftUI dictation view in ~10 lines:
import SwiftUI
import ParakeetStreamingASR
import SpeechUI
@MainActor
struct DictateView: View {
@State private var store = TranscriptionStore()
var body: some View {
TranscriptionView(finals: store.finalLines, currentPartial: store.currentPartial)
.task {
let model = try? await ParakeetStreamingASRModel.fromPretrained()
guard let model else { return }
for await p in model.transcribeStream(audio: samples, sampleRate: 16000) {
store.apply(text: p.text, isFinal: p.isFinal)
}
}
}
}
SpeechUI ships only TranscriptionView (finals + partials) and TranscriptionStore (streaming ASR adapter). Use AVFoundation for audio visualization and playback.
Available SPM products: Qwen3ASR, Qwen3TTS, Qwen3TTSCoreML, ParakeetASR, ParakeetStreamingASR, NemotronStreamingASR, OmnilingualASR, KokoroTTS, VibeVoiceTTS, CosyVoiceTTS, VoxCPM2TTS, MAGNeTMusicGen, FlashSR, PersonaPlex, SpeechVAD, SpeechEnhancement, SourceSeparation, Qwen3Chat, SpeechCore, SpeechUI, AudioCommon.
The snippets below show the minimal path for each domain. Every section links to a full guide on soniqo.audio with configuration options, multiple backends, streaming patterns, and CLI recipes.
import Qwen3ASR
let model = try await Qwen3ASRModel.fromPretrained()
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)
Alternative backends: Parakeet TDT (CoreML, 32× realtime), Omnilingual ASR (1,672 languages, CoreML or MLX), Streaming dictation (live partials).
import Qwen3ASR
let aligner = try await Qwen3ForcedAligner.fromPretrained()
let aligned = aligner.align(
audio: audioSamples,
text: "Can you guarantee that the replacement part will be shipped tomorrow?",
sampleRate: 24000
)
for word in aligned {
print("[\(word.startTime)s - \(word.endTime)s] \(word.text)")
}
import Qwen3TTS
import AudioCommon
let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello world", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)
Alternative TTS engines: CosyVoice3 (streaming + voice cloning + emotion tags), Kokoro-82M (iOS-ready, 54 voices), VibeVoice (long-form podcast / multi-speaker, EN/ZH), Voice cloning.
import PersonaPlex
let model = try await PersonaPlexModel.fromPretrained()
let responseAudio = model.respond(userAudio: userSamples)
// 24 kHz mono Float32 output ready for playback
import Qwen3Chat
let chat = try await Qwen35MLXChat.fromPretrained()
chat.chat(messages: [(.user, "Explain MLX in one sentence")]) { token, isFinal in
print(token, terminator: "")
}
import MADLADTranslation
let translator = try await MADLADTranslator.fromPretrained()
let es = try translator.translate("Hello, how are you?", to: "es")
// → "Hola, ¿cómo estás?"
import SpeechVAD
let vad = try await SileroVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for s in segments { print("\(s.startTime)s → \(s.endTime)s") }
import SpeechVAD
let diarizer = try await DiarizationPipeline.fromPretrained()
let segments = diarizer.diarize(audio: samples, sampleRate: 16000)
for s in segments { print("Speaker \(s.speakerId): \(s.startTime)s - \(s.endTime)s") }
import SpeechEnhancement
let denoiser = try await DeepFilterNet3Model.fromPretrained()
let clean = try denoiser.enhance(audio: noisySamples, sampleRate: 48000)
import SpeechCore
let pipeline = VoicePipeline(
stt: parakeetASR,
tts: qwen3TTS,
vad: sileroVAD,
config: .init(mode: .voicePipeline),
onEvent: { event in print(event) }
)
pipeline.start()
pipeline.pushAudio(micSamples)
VoicePipeline is the real-time voice-agent state machine (powered by speech-core) with VAD-driven turn detection, interruption handling, and eager STT. It connects any SpeechRecognitionModel + SpeechGenerationModel + StreamingVADProvider.
Each demo's README has build instructions.
Model weights download from HuggingFace on first use and cache to ~/Library/Caches/qwen3-speech/. Override with QWEN3_CACHE_DIR (CLI) or cacheDir: (Swift API). All fromPretrained() entry points also accept offlineMode: true to skip network when weights are already cached.
See docs/inference/cache-and-offline.md for full details including sandboxed iOS container paths.
speech-server --port 8080
Exposes every model via HTTP REST + WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at /v1/realtime. See Sources/AudioServer/.
dependencies: [
.package(url: "https://github.com/soniqo/speech-swift", branch: "main")
]
Import only what you need — every model is its own SPM target:
import Qwen3ASR // Speech recognition (MLX)
import ParakeetASR // Speech recognition (CoreML, batch)
import ParakeetStreamingASR // Streaming dictation with partials + EOU
import NemotronStreamingASR // English streaming ASR with native punctuation (0.6B)
import OmnilingualASR // 1,672 languages (CoreML + MLX)
import Qwen3TTS // Text-to-speech
import CosyVoiceTTS // Text-to-speech with voice cloning
import VoxCPM2TTS // 48 kHz TTS with voice cloning + voice design (2B)
import KokoroTTS // Text-to-speech (iOS-ready)
import VibeVoiceTTS // Long-form / multi-speaker TTS (EN/ZH)
import Qwen3Chat // On-device LLM chat
import MADLADTranslation // Many-to-many translation across 400+ languages
import PersonaPlex // Full-duplex speech-to-speech
import SpeechVAD // VAD + speaker diarization + embeddings
import SpeechEnhancement // Noise suppression
import SourceSeparation // Music source separation (Open-Unmix, 4 stems)
import SpeechUI // SwiftUI components for streaming transcripts
import AudioCommon // Shared protocols and utilities
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。
总体来看,speech-swift — AI 语音合成工具中文文档 是一款质量优秀的AI工具,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | speech-swift |
| 原始描述 | AI speech toolkit for Apple Silicon — ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML |
| Topics | apple-siliconasrcoremliosmacosmlxtts |
| GitHub | https://github.com/soniqo/speech-swift |
| License | Apache-2.0 |
| 语言 | Swift |
收录时间:2026-05-22 · 更新时间:2026-05-22 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。