CosyVoice3 TTS
Local text-to-speech using Alibaba's CosyVoice3 on macOS Apple Silicon.
Overview
CosyVoice3 is an advanced TTS system based on large language models, supporting:
- - 9 languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
- 18+ Chinese dialects: Cantonese, Sichuan, Dongbei, Shanghai, etc.
- Zero-shot voice cloning: Clone any voice from 3-10 seconds of audio
- Cross-lingual synthesis: Speak Chinese with English voice or vice versa
- Fine-grained control: Emotions, speed, volume via text tags
Prerequisites
- - macOS with Apple Silicon (M1/M2/M3)
- Python 3.10
- Conda installed
- ~5GB disk space for models
Installation
Run the installation script:
CODEBLOCK0
This will:
- 1. Create conda environment INLINECODE0
- Install PyTorch (CPU version for Apple Silicon)
- Install CosyVoice dependencies
- Download Fun-CosyVoice3-0.5B model (~2GB)
Usage
Quick Start - Basic TTS
重要:CosyVoice3 需要在参考文本中添加 <|endofprompt|> 标记!
CODEBLOCK1
Using the TTS Script
Generate speech from text:
CODEBLOCK2
Available Assets
Reference audio files in cosyvoice3-repo/asset/:
- -
zero_shot_prompt.wav - Default Chinese female voice - INLINECODE4 - English prompt for cross-lingual
Advanced Features
Voice Cloning
Clone a voice from 3-10 seconds of reference audio:
CODEBLOCK3
Fine-Grained Control
Control prosody with special tags:
CODEBLOCK4
Dialect Support
Use instruct mode for dialects:
CODEBLOCK5
Troubleshooting
Model not found
If you get "model not found" errors, download models manually:
CODEBLOCK6
Memory issues
For long text, split into sentences:
CODEBLOCK7
Audio format
Reference audio requirements:
- - Format: WAV, MP3
- Sample rate: 16kHz+ (automatically resampled)
- Duration: 3-10 seconds optimal
- Content: Clear speech, minimal background noise
Resources
Scripts
- -
install.sh - Installation script for macOS - INLINECODE6 - Main TTS script with CLI interface
- INLINECODE7 - Download pretrained models
References
Model Files
Located in cosyvoice3-repo/pretrained_models/:
- -
Fun-CosyVoice3-0.5B/ - Main model (recommended) - INLINECODE10 - Previous version
- INLINECODE11 - Lighter model
- INLINECODE12 - SFT version
- INLINECODE13 - Instruct version
Notes
- - First inference takes ~30 seconds (model warmup)
- Subsequent inferences are faster
- Apple Silicon uses CPU mode (no CUDA)
- RTF (real-time factor) ~0.3-0.5 on M-series chips
- Model files are cached locally after first download