speech-swift 是什么工具？

speech-swift 是一款Swift开发的AI辅助工具。AI speech toolkit for Apple Silicon — ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML

speech-swift 如何安装和开始使用？

访问 speech-swift 的 GitHub 仓库或官方网站，按照 README 文档中的步骤安装依赖并运行。通常需要 Python 3.8+ 或 Node.js 16+ 基础环境。

speech-swift 是否免费？许可证是什么？

speech-swift 完全免费，采用 Apache-2.0 许可证开源发布，任何人都可以免费使用、修改和分发。

speech-swift 适合哪些用户使用？

speech-swift 主要面向有一定技术基础的用户，包括开发者、数据分析师、AI 工程师等专业人士。

speech-swift 的社区活跃度和项目维护状况如何？

speech-swift 在 GitHub 上已获得 748 个 Star，处于积极发展阶段，社区在持续扩大。

📄 工具详情 ⚙️ 安装教程 📚 使用教程

能力标签

🤖 Agent 🔄 工作流 🌐 翻译 💻 CLI 🔗 REST API 🧬 Embedding 🔊 TTS 🎙 STT 🧠 Claude ✨ GPT

🛠

AI工具

speech-swift — AI 语音合成工具中文文档

基于 Swift · 开源免费，本地部署，数据完全自主可控

英文名：speech-swift

⭐ 748 Stars 🍴 96 Forks 💻 Swift 📄 Apache-2.0 🏷 AI 8.5分

8.5AI 综合评分

语音识别文本转语音Apple Silicon离线处理机器学习

🌐 访问官网 📺 TG 频道

✦ AI Skill Hub 推荐

AI Skill Hub 强烈推荐：speech-swift — AI 语音合成工具中文文档是一款优质的AI工具。AI 综合评分 8.5 分，在同类工具中表现稳健。如果你正在寻找可靠的AI工具解决方案，这是一个值得深入了解的选择。

📚 深度解析

speech-swift — AI 语音合成工具中文文档是一款基于 Swift 的开源工具，在 GitHub 上收获 1k+ Star，是语音识别、文本转语音、Apple Silicon、离线处理领域中的优质开源项目。开源工具的最大优势在于代码完全透明，你可以审计每一行代码的安全性，也可以根据自身需求进行二次开发和定制。

**为什么要使用开源工具而非商业 SaaS？**
对于个人开发者和有隐私需求的用户，本地部署的开源工具意味着数据不离本机，不受第三方服务商的数据政策约束。同时，开源工具通常没有使用次数限制和月度费用，一次安装即可长期使用，对于高频使用场景的总拥有成本（TCO）远低于订阅制商业工具。

**安装与环境准备**
speech-swift — AI 语音合成工具中文文档依赖 Swift 运行环境。建议通过 pyenv（Python）或 nvm（Node.js）管理 Swift 版本，避免全局环境污染。对于新手用户，推荐先创建虚拟环境（python -m venv venv && source venv/bin/activate），再安装依赖，这样即使出现问题也可以随时删除虚拟环境重新开始，不影响系统稳定性。

**社区与维护**
GitHub Issue 和 Discussion 是获取帮助的最快渠道。在提问前建议先检查 Closed Issues（已关闭的问题），大多数常见问题都已有解答。遇到 Bug 时，提供 pip list 的输出、完整错误堆栈和最小可复现示例，能显著提高开发者响应速度。AI Skill Hub 将持续追踪 speech-swift — AI 语音合成工具中文文档的版本更新，及时通知重要功能变化。

📋 工具概览

专为Apple Silicon优化的轻量级语音处理框架，整合ASR语音识别、TTS文本转语音、语音转换、VAD活动检测和说话人分割等功能。基于MLX和CoreML构建，适合iOS/macOS开发者构建离线语音应用。

speech-swift — AI 语音合成工具中文文档是一款基于 Swift 开发的开源工具，专注于语音识别、文本转语音、Apple Silicon 等核心功能。作为 GitHub 开源项目，它拥有活跃的社区支持和持续的版本迭代，代码完全透明可审计，支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流，都能提供稳定可靠的解决方案。

GitHub Stars

⭐ 748

开发语言

Swift

支持平台

macOS / iOS

维护状态

正常维护，社区驱动

开源协议

Apache-2.0

AI 综合评分

8.5 分

工具类型

AI工具

Forks

📖 中文文档

以下内容由 AI Skill Hub 根据项目信息自动整理，如需查看完整原始文档请访问底部「原始来源」。

📌 核心特色

开源免费，支持本地部署，数据完全自主可控
活跃的 GitHub 开源社区，持续迭代更新
提供详细文档和使用示例，新手友好
支持自定义配置，灵活适配不同使用环境
可作为基础组件集成进现有技术栈或进行二次开发

🎯 主要使用场景

本地部署运行，保护数据隐私，满足合规要求
自定义集成到现有系统，扩展技术栈能力
作为开源基础组件进行商业化二次开发

以下安装命令基于项目开发语言和类型自动生成，实际以官方 README 为准。

安装命令

# 克隆仓库
git clone https://github.com/soniqo/speech-swift
cd speech-swift

# 查看安装说明
cat README.md

# 按 README 完成环境依赖安装后即可使用

📋 安装步骤说明

访问 GitHub 仓库页面
按照 README 文档完成依赖安装
根据系统环境完成初始化配置
参考官方示例或文档开始使用
遇到问题可在 GitHub Issues 中查找解答

以下用法示例由 AI Skill Hub 整理，涵盖最常见的使用场景。

常用命令 / 代码示例

# 查看帮助
speech-swift --help

# 基本运行
speech-swift [options] <input>

# 详细使用说明请查阅文档
# https://github.com/soniqo/speech-swift

以下配置示例基于典型使用场景生成，具体参数请参照官方文档调整。

配置示例

# speech-swift 配置说明
# 查看配置选项
speech-swift --config-example > config.yml

# 常见配置项
# output_dir: ./output
# log_level: info
# workers: 4

# 环境变量（覆盖配置文件）
export SPEECH_SWIFT_CONFIG="/path/to/config.yml"

📑 README 深度解析真实文档完整度 75/100 查看 GitHub 原文 →

以下内容由系统直接从 GitHub README 解析整理，保留代码块、表格与列表结构。

Speech Swift

AI speech models for Apple Silicon, powered by MLX Swift and CoreML.

📖 Read in: English · 中文 · 日本語 · 한국어 · Español · Deutsch · Français · हिन्दी · Português · Русский · العربية · Tiếng Việt · Türkçe · ไทย

On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs locally on Apple Silicon — no cloud, no API keys, no data leaves your device.

📚 Full Documentation → · 🤗 HuggingFace Models · 📝 Blog · 💬 Discord

<a href="https://youtu.be/x9zgcaW0gUk"> <img src="https://img.youtube.com/vi/x9zgcaW0gUk/maxresdefault.jpg" width="640" alt="Local Speech AI on a MacBook — watch the 4-minute open-source library tour on YouTube"> </a> Local Speech AI on a MacBook — watch the 4-minute open-source library tour on YouTube

Use cases: Voice Agents · Transcription · Speech Generation

Capability groups: STT / ASR · Alignment · TTS · LLMs & translation · Speech-to-speech · Enhancement/restoration · Source separation · Music/audio generation · Wake word, VAD, diarization & speaker identity

STT / ASR

Qwen3-ASR — Speech-to-text (automatic speech recognition, 52 languages, MLX + CoreML)
WhisperASR — Whisper Large-v3 Turbo speech-to-text via native CoreML runtime (ANE, multilingual)
Parakeet TDT — Speech-to-text via CoreML (Neural Engine, NVIDIA FastConformer + TDT decoder, 25 languages)
Omnilingual ASR — Speech-to-text (Meta wav2vec2 + CTC, 1,672 languages across 32 scripts, CoreML 300M + MLX 300M/1B/3B/7B)
Streaming Dictation — Real-time dictation with partials and end-of-utterance detection (Parakeet-EOU-120M)
Nemotron Streaming (Multilingual) — Low-latency streaming ASR with native punctuation and capitalization (NVIDIA Nemotron-3.5-ASR-Streaming-0.6B, CoreML + MLX, 40 language-locales)
Nemotron Streaming (English) — Low-latency streaming ASR with native punctuation and capitalization (NVIDIA Nemotron-Speech-Streaming-0.6B, CoreML, English-only, smaller and faster than the multilingual variant)

Alignment

Qwen3-ForcedAligner — Word-level timestamp alignment (audio + text → timestamps)

TTS / Speech Generation

Qwen3-TTS — Text-to-speech (highest quality, streaming, custom speakers, 10 languages)
CosyVoice TTS — Streaming TTS with voice cloning, multi-speaker dialogue, emotion tags (9 languages)
VoxCPM2 — 48 kHz studio-quality TTS with voice cloning + instruction-driven voice design (2B, MLX bf16/int8, 30 languages)
Kokoro TTS — On-device TTS (82M, CoreML/Neural Engine, 54 voices, iOS-ready, 10 languages)
VibeVoice TTS — Long-form / multi-speaker TTS (Microsoft VibeVoice Realtime-0.5B + 1.5B, MLX, up to 90-min podcast/audiobook synthesis, EN/ZH)
Magpie TTS — Multilingual TTS (NVIDIA Magpie-TTS Multilingual 357M, MLX INT8 411 MB or CoreML INT8 342 MB, 9 languages, 5 baked speakers, streaming on MLX)
Supertonic TTS — On-device flow-matching TTS (Supertone Supertonic-3 99M, CoreML/Neural Engine, 31 languages, 10 voices, G2P-free, 44.1 kHz)
Chatterbox TTS — Multilingual TTS with zero-shot voice cloning (Resemble AI Chatterbox Multilingual, MLX fp16 ~1.3 GB, 23 languages, MIT)
OmniVoice TTS — Non-autoregressive diffusion TTS with zero-shot voice cloning (k2-fsa OmniVoice, Qwen3 backbone, MLX int8 ~1 GB / fp16, 600+ languages, Apache-2.0)
Indic-Mio — Hindi/Indic TTS with inline emotion markers and optional reference-voice cloning (MLX, 24 kHz)

LLMs & Translation

Qwen3Chat — On-device LLM chat (Qwen3.5-0.8B MLX/CoreML plus dense Qwen3 4B and Gemma 4 E2B/E4B MLX backends, streaming tokens)
FunctionGemma — On-device LLM for structured function / tool calls (Gemma 3 270M, CoreML 8-bit palettized, Neural Engine, ~252 tok/s)
MADLAD-400 — Many-to-many translation across 400+ languages (3B, MLX INT4 + INT8, T5 v1.1, Apache 2.0)

Speech-to-Speech & Voice Agents

Hibiki Zero-3B — Streaming speech-to-speech translation (FR/ES/PT/DE → EN, MLX INT4 + INT8, Kyutai Moshi/Mimi stack, CC-BY-4.0)
PersonaPlex — Full-duplex speech-to-speech (7B, audio in → audio out, 18 voice presets)

Enhancement, Separation & Audio Generation

DeepFilterNet3 — Real-time noise suppression (2.1M params, 48 kHz). Long-form audio above the 60 s single-shot cap is auto-chunked with crossfade — see enhanceChunked(...)
Source Separation — Music source separation via HTDemucs (Demucs v4) + Open-Unmix (UMX-HQ / UMX-L, 4 stems: vocals/drums/bass/other, 44.1 kHz stereo)
MAGNeT — Text-to-music generation (Meta MAGNeT Small 300M / Medium 1.5B, MLX INT8, 30 s clips at 32 kHz mono, masked parallel decoding)
Stable Audio 3 — Text-to-audio/music generation (Stable Audio 3 Medium, MLX INT8/INT4, 44.1 kHz stereo, variable length)
FlashSR — Audio super-resolution (FlashSR ICASSP 2025, MLX, 48 kHz mono, 1-step distilled diffusion, INT4 363 MB / INT8 720 MB)

Turn Detection, Diarization & Speaker Identity

Wake-word — On-device keyword spotting (KWS Zipformer 3M, CoreML, 26× real-time, configurable keyword list)
VAD — Voice activity detection (Silero streaming, Pyannote offline, FireRedVAD 100+ languages)
Speaker Diarization — Who spoke when (Pyannote pipeline, Sortformer end-to-end on Neural Engine)
Speaker Embeddings — WeSpeaker ResNet34 (256-dim), CAM++ (192-dim)

Papers: Qwen3-ASR (Alibaba) · Qwen3-TTS (Alibaba) · Omnilingual ASR (Meta) · Parakeet TDT (NVIDIA) · CosyVoice 3 (Alibaba) · Kokoro (StyleTTS 2) · PersonaPlex (NVIDIA) · Mimi (Kyutai) · Hibiki (Kyutai) · Sortformer (NVIDIA)

Requirements

Swift 6+, Xcode 16+ (with Metal Toolchain)
macOS 15+ (Sequoia) or iOS 18+, Apple Silicon (M1/M2/M3/M4)

The macOS 15 / iOS 18 minimum comes from MLState — Apple's persistent ANE state API used by the CoreML pipelines (Qwen3-ASR, Qwen3-Chat, Qwen3-TTS) to keep KV caches resident on the Neural Engine across token steps.

Installation

Build from source

git clone https://github.com/soniqo/speech-swift
cd speech-swift
make build

make build compiles the Swift package and the MLX Metal shader library. The Metal library is required for GPU inference — without it you'll see Failed to load the default metallib at runtime. make debug for debug builds, make test for the test suite.

Full build and install guide →

Quick start

Add the package to your Package.swift:

.package(url: "https://github.com/soniqo/speech-swift", branch: "main")

Import only the modules you need — every model is its own SPM library, so you don't pay for what you don't use:

.product(name: "ParakeetStreamingASR", package: "speech-swift"),
.product(name: "SpeechUI",             package: "speech-swift"),  // optional SwiftUI views

Transcribe an audio buffer in 3 lines:

import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)

Live streaming with partials:

for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
    print(partial.isFinal ? "FINAL: \(partial.text)" : "... \(partial.text)")
}

SwiftUI dictation view in ~10 lines:

import SwiftUI
import ParakeetStreamingASR
import SpeechUI

@MainActor
struct DictateView: View {
    @State private var store = TranscriptionStore()

    var body: some View {
        TranscriptionView(finals: store.finalLines, currentPartial: store.currentPartial)
            .task {
                let model = try? await ParakeetStreamingASRModel.fromPretrained()
                guard let model else { return }
                for await p in model.transcribeStream(audio: samples, sampleRate: 16000) {
                    store.apply(text: p.text, isFinal: p.isFinal)
                }
            }
    }
}

SpeechUI ships only TranscriptionView (finals + partials) and TranscriptionStore (streaming ASR adapter). Use AVFoundation for audio visualization and playback.

Available SPM products: Qwen3ASR, WhisperASR, Qwen3TTS, Qwen3TTSCoreML, ParakeetASR, ParakeetStreamingASR, NemotronStreamingASR, OmnilingualASR, KokoroTTS, SupertonicTTS, VibeVoiceTTS, CosyVoiceTTS, VoxCPM2TTS, ChatterboxTTS, OmniVoiceTTS, IndicMioTTS, FishAudioTTS, MagpieTTS, MagpieTTSCoreML, MAGNeTMusicGen, StableAudio3MusicGen, FlashSR, PersonaPlex, HibikiTranslate, MADLADTranslation, SpeechVAD, SpeechWakeWord, SpeechEnhancement, SpeechRestoration, SourceSeparation, Qwen3Chat, FunctionGemma, SpeechCore, SpeechUI, AudioCommon.

Code examples

The snippets below show the minimal path for each domain. Every section links to a full guide on soniqo.audio with configuration options, multiple backends, streaming patterns, and CLI recipes.

Speech-to-Text — [full guide →](https://soniqo.audio/guides/transcribe)

import Qwen3ASR

let model = try await Qwen3ASRModel.fromPretrained()
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)

Alternative backends: WhisperASR (Whisper Large-v3 Turbo, native CoreML), Parakeet TDT (CoreML, 32× realtime), Omnilingual ASR (1,672 languages, CoreML or MLX), Streaming dictation (live partials).

Forced Alignment — [full guide →](https://soniqo.audio/guides/align)

import Qwen3ASR

let aligner = try await Qwen3ForcedAligner.fromPretrained()
let aligned = aligner.align(
    audio: audioSamples,
    text: "Can you guarantee that the replacement part will be shipped tomorrow?",
    sampleRate: 24000
)
for word in aligned {
    print("[\(word.startTime)s - \(word.endTime)s] \(word.text)")
}

Text-to-Speech — [full guide →](https://soniqo.audio/guides/speak)

import Qwen3TTS
import AudioCommon

let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello world", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)

Alternative TTS engines: CosyVoice3 (streaming + voice cloning + emotion tags), Kokoro-82M (iOS-ready, 54 voices), VibeVoice (long-form podcast / multi-speaker, EN/ZH), Fish Audio S2 Pro (experimental zero-shot cloning + bracket style markers), Voice cloning.

Speech-to-Speech — [full guide →](https://soniqo.audio/guides/respond)

import PersonaPlex

let model = try await PersonaPlexModel.fromPretrained()
let responseAudio = model.respond(userAudio: userSamples)
// 24 kHz mono Float32 output ready for playback

LLM Chat — [full guide →](https://soniqo.audio/guides/chat)

import Qwen3Chat
import FunctionGemma

let chat = try await Qwen35MLXChat.fromPretrained()
chat.chat(messages: [(.user, "Explain MLX in one sentence")]) { token, isFinal in
    print(token, terminator: "")
}

Translation — [full guide →](https://soniqo.audio/guides/translate)

import MADLADTranslation

let translator = try await MADLADTranslator.fromPretrained()
let es = try translator.translate("Hello, how are you?", to: "es")
// → "Hola, ¿cómo estás?"

Speech Translation — [full guide →](https://soniqo.audio/guides/audio-translate)

import HibikiTranslate
import AudioCommon

let model = try await HibikiTranslateModel.fromPretrained()
let pcm = try AudioFileLoader.load(url: input, targetSampleRate: 24000)
let (englishAudio, textTokens) = model.translate(
    sourceAudio: pcm, sourceLanguage: .fr
)
// Hibiki Zero-3B — FR/ES/PT/DE → EN, on-device, streaming Mimi codec

Voice Activity Detection — [full guide →](https://soniqo.audio/guides/vad)

import SpeechVAD

let vad = try await SileroVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for s in segments { print("\(s.startTime)s → \(s.endTime)s") }

Speaker Diarization — [full guide →](https://soniqo.audio/guides/diarize)

import SpeechVAD

let diarizer = try await DiarizationPipeline.fromPretrained()
let segments = diarizer.diarize(audio: samples, sampleRate: 16000)
for s in segments { print("Speaker \(s.speakerId): \(s.startTime)s - \(s.endTime)s") }

Speech Enhancement — [full guide →](https://soniqo.audio/guides/denoise)

import SpeechEnhancement

let denoiser = try await DeepFilterNet3Model.fromPretrained()
let clean = try denoiser.enhance(audio: noisySamples, sampleRate: 48000)

Speech Restoration — [full guide →](https://soniqo.audio/guides/restore)

Joint denoise and dereverb with Sidon (w2v-BERT 2.0 predictor + DAC vocoder, Core ML). Unlike a generic noise suppressor, Sidon is trained to preserve speaker identity, so it is well suited to cleaning a noisy or reverberant voice-cloning reference before TTS. Input is 16 kHz; output is 48 kHz mono.

import SpeechRestoration

let restorer = try await SpeechRestorer.fromPretrained()          // .fp16 (default) or .int8
let clean = try restorer.restore(audio: noisySamples, sampleRate: 16000)  // → 48 kHz

From the CLI:

```bash speech restore noisy.wav -o clean.wav # denoise + dereverb, 48 kHz output speech restore noisy.wav --variant int8 # smaller, lower peak RAM

Voice Pipeline (ASR → LLM → TTS) — [full guide →](https://soniqo.audio/voice-agents)

import SpeechCore

let pipeline = VoicePipeline(
    stt: parakeetASR,
    tts: qwen3TTS,
    vad: sileroVAD,
    config: .init(mode: .voicePipeline),
    onEvent: { event in print(event) }
)
pipeline.start()
pipeline.pushAudio(micSamples)

VoicePipeline is the real-time voice-agent state machine (powered by speech-core) with VAD-driven turn detection, interruption handling, and eager STT. It connects any SpeechRecognitionModel + SpeechGenerationModel + StreamingVADProvider.

Demo apps

DictateDemo (docs) — macOS menu-bar streaming dictation with live partials, VAD-driven end-of-utterance detection, and one-click copy. Runs as a background agent (Parakeet-EOU-120M + Silero VAD).
iOSEchoDemo — iOS echo demo (Parakeet ASR + Kokoro TTS). Device and simulator.
PersonaPlexDemo — Conversational voice assistant with mic input, VAD, and multi-turn context. macOS. RTF ~0.94 on M2 Max (faster than real-time).
SpeechDemo — Dictation and TTS synthesis in a tabbed interface. macOS.

Each demo's README has build instructions.

Cache configuration

Model weights download from HuggingFace on first use and cache to ~/Library/Caches/qwen3-speech/. Override with QWEN3_CACHE_DIR (CLI) or cacheDir: (Swift API). All fromPretrained() entry points also accept offlineMode: true to skip network when weights are already cached.

Users in mainland China (or anywhere huggingface.co is slow/blocked) can fetch from a mirror by setting HF_ENDPOINT, e.g. export HF_ENDPOINT=https://hf-mirror.com.

See docs/inference/cache-and-offline.md for full details including sandboxed iOS container paths.

Clean a voice-cloning reference before TTS (opt-in; preserves speaker identity):

speech speak "Hello" --engine voxcpm2 --voice-sample ref.wav --clean-reference ```

HTTP API server

speech-server --port 8080

Exposes every model via HTTP REST + WebSocket endpoints, including OpenAI-compatible APIs: a Realtime WebSocket at /v1/realtime and a transcription REST endpoint at /v1/audio/transcriptions. See Sources/AudioServer/.

Swift Package Manager

dependencies: [
    .package(url: "https://github.com/soniqo/speech-swift", branch: "main")
]

Import only what you need — every model is its own SPM target:

import Qwen3ASR             // Speech recognition (MLX)
import WhisperASR           // Whisper Large-v3 Turbo (CoreML)
import ParakeetASR          // Speech recognition (CoreML, batch)
import ParakeetStreamingASR // Streaming dictation with partials + EOU
import NemotronStreamingASR // Multilingual streaming ASR with native punctuation (0.6B, 40 langs)
import OmnilingualASR       // 1,672 languages (CoreML + MLX)
import Qwen3TTS             // Text-to-speech
import CosyVoiceTTS         // Text-to-speech with voice cloning
import VoxCPM2TTS           // 48 kHz TTS with voice cloning + voice design (2B)
import KokoroTTS            // Text-to-speech (iOS-ready)
import VibeVoiceTTS         // Long-form / multi-speaker TTS (EN/ZH)
import MagpieTTS            // Multilingual TTS (NVIDIA Magpie 357M, MLX, 9 langs)
import MagpieTTSCoreML      // Magpie CoreML backend (hybrid CoreML + MLX, 8 langs)
import FishAudioTTS         // Experimental Fish Audio S2 Pro runtime with voice cloning
import IndicMioTTS          // Hindi/Indic TTS with emotion markers
import Qwen3Chat            // On-device LLM chat
import FunctionGemma    // On-device tool-call LLM
import MADLADTranslation    // Many-to-many translation across 400+ languages
import HibikiTranslate      // Streaming speech-to-speech translation (FR/ES/PT/DE → EN)
import PersonaPlex          // Full-duplex speech-to-speech
import SpeechVAD            // VAD + speaker diarization + embeddings
import SpeechWakeWord       // Wake-word / keyword spotting
import SpeechEnhancement    // Noise suppression
import SpeechRestoration    // Speech restoration — denoise + dereverb (Sidon, CoreML, 48 kHz)
import SourceSeparation     // Music source separation (Open-Unmix, 4 stems)
import StableAudio3MusicGen // Text-to-audio/music generation (Stable Audio 3)
import SpeechUI             // SwiftUI components for streaming transcripts
import AudioCommon          // Shared protocols and utilities

🇨🇳 中文文档镜像 AI 翻译 2026-05-23

英文原文章节由系统翻译为中文摘要，便于快速理解。完整原文见上方 "📑 README 深度解析"。

📌 简介

Speech Swift 是一个基于 Apple Silicon 的 AI 语音模型，使用 MLX Swift 和 CoreML。它提供了 Mac 和 iOS 设备上的语音识别、合成和理解功能。

📋 环境依赖

Speech Swift 需要 Swift 6+、Xcode 16+（带有 Metal Toolchain）和 macOS 15+（Sequoia）或 iOS 18+，Apple Silicon（M1/M2/M3/M4）。

🛠 安装步骤（Docker/pip/源码）

从源码安装 Speech Swift，需要使用 `git clone`、`cd` 和 `make build` 等命令。

🚀 使用教程

要使用 Speech Swift，需要在 `Package.swift` 中添加包依赖，例如 `Qwen3ASR` 和 `ParakeetStreamingASR`。

⚙️ 配置说明（含 MCP / env）

Speech Swift 的配置包括缓存设置，例如 `QWEN3_CACHE_DIR` 和 `cacheDir:`。

🔌 API 说明

Speech Swift 提供了 HTTP API 服务器，包括 WebSocket 端点和 OpenAI Realtime API 兼容的 WebSocket 端点 `/v1/realtime`。

🔄 工作流/模块

Speech Swift 使用 Swift Package Manager（SPM）作为工作流和模块管理工具。

🎯 aiskill88 AI 点评 A 级 2026-05-23

aiskill88点评：专业的Apple生态语音解决方案，性能优异、功能完整、文档健全。离线处理优势明显，是iOS/macOS开发首选方案。

📚 实用指南（长尾问题）

适合谁

构建多智能体协作系统的 Agent 开发者
构建企业知识库 / RAG 检索应用的团队
跨境业务、多语言内容运营团队
做语音类 AI 产品的开发者

最佳实践

Agent 任务先做 dry-run 验证工具调用链，再开启自主执行

常见错误

API key 直接提交到 git 仓库（请用 .env 并加入 .gitignore）

部署方案

CLI：直接 npm install -g / pip install，命令行调用
云端托管：可放在 Vercel / Railway / Fly.io 等 PaaS 平台

⚡ 核心功能

开源免费，支持本地部署，数据完全自主可控
活跃的 GitHub 开源社区，持续迭代更新
提供详细文档和使用示例，新手友好
支持自定义配置，灵活适配不同使用环境
可作为基础组件集成进现有技术栈或进行二次开发

👥 适合谁

构建多智能体协作系统的 Agent 开发者
构建企业知识库 / RAG 检索应用的团队
跨境业务、多语言内容运营团队
做语音类 AI 产品的开发者

⭐ 最佳实践

Agent 任务先做 dry-run 验证工具调用链，再开启自主执行

⚠️ 常见错误

API key 直接提交到 git 仓库（请用 .env 并加入 .gitignore）

👥 适合人群

AI 技术爱好者研究人员和学生开发者和工程师技术创业者

🎯 使用场景

本地部署运行，保护数据隐私，满足合规要求
自定义集成到现有系统，扩展技术栈能力
作为开源基础组件进行商业化二次开发

⚖️ 优点与不足

✅ 优点

+Apache-2.0 协议，可免费商用
+完全开源免费，无授权费用
+本地部署，数据完全自主可控
+开发者社区支持，遇问题可查可问

⚠️ 不足

−安装和初始配置可能需要一定技术基础
−功能完整性通常不如成熟商业产品
−技术支持主要依赖开源社区，响应速度不稳定

⚠️ 使用须知

AI Skill Hub 为第三方内容聚合平台，本页面信息基于公开数据整理，不对工具功能和质量作任何法律背书。

建议在沙箱或测试环境中充分验证后，再部署至生产环境，并做好必要的安全评估。

📄 License 说明

🔗 相关工具推荐

Whisper 语音转文字

OpenAI开源的先进语音识别模型，能自动生成视频字幕和转录文本，支持99种语言。适合内容创作者、媒体工作者、研究人员等

Whisper语音识别引擎

OpenAI Whisper模型的C/C++高性能实现，专为离线语音转文字优化。支持多语言识别，资源占用小，适合开发者集

Ray分布式计算引擎

Ray是专为AI应用设计的分布式计算平台。提供高性能的分布式运行时和机器学习库，支持超参数优化、模型训练和部署。适合需要

transformers AI技能包

Hugging Face开源的深度学习框架，提供预训练语言模型、视觉模型和多模态模型。集成BERT、GPT、Llama等

📰 相关 AI 新闻

🍿 AI 圈相关吃瓜

AutoGPT 自主完成了任务：把我的文件夹全部重命名了

AI 圈观察

Claude 回复了30页，我只问了"你好"

AI 圈观察

Agent 帮我订了3次机票，全部是同一天的

🗺️ 相关解决方案

ai-workflow-templates

translation

ai-translation-pipeline

cli

cli-productivity

🧩 你可能还需要

基于当前 Skill 的能力图谱，自动补全的工具组合

技能寻求者

MCP · Agent · 工作流

total-agent-memory MCP工具

为Claude Code和Codex CLI提供持久化记忆功能的开源MCP工具。自动提取知识图谱，支持多轮对话上下文保留，适合需要长期记忆和

natively-cluely-ai-assistant — Claude Skill 中文使用文档

免费开源的AI面试助手，实时转录，隐蔽模式，局部RAG，BYOK。无订阅，防止数据泄露。

❓ 常见问题 FAQ

是否支持在设备上离线运行？−

是的，基于CoreML和MLX框架设计，支持完全离线处理，无需云服务。

支持哪些Apple设备？+

识别准确率如何？+

安装这个工具需要什么基础？+

安装过程中遇到依赖冲突怎么办？+

工具安装成功但运行报错，该怎么处理？+

这个工具是否有数据隐私风险？+

工具更新后会影响已有的配置和数据吗？+

💡 AI Skill Hub 点评

总体来看，speech-swift — AI 语音合成工具中文文档是一款质量优秀的AI工具，在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态，建议收藏备用，结合自身场景选择合适时机引入使用。

📚 深入学习 speech-swift — AI 语音合成工具中文文档

查看分步骤安装教程和完整使用指南，快速上手这款工具

⚙️ 安装教程 📚 使用教程

🌐 原始信息

原始名称	`speech-swift`
原始描述	AI speech toolkit for Apple Silicon — ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML
Topics	`语音识别文本转语音Apple Silicon离线处理机器学习`
GitHub	https://github.com/soniqo/speech-swift
License	Apache-2.0
语言	Swift

🔗 原始来源

🐙 GitHub 仓库 https://github.com/soniqo/speech-swift 🌐 官方网站 https://soniqo.audio

收录时间：2026-05-22 · 更新时间：2026-05-30 · License：Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。

📺 订阅 AI Skill Hub Daily Telegram 频道

每天 8 条精选 AI Skill、MCP、Agent 与自动化工具推送

加入频道 →