audio-speaker-tools

# Audio Speaker Tools Tools for speaker separation, voice comparison, and audio processing using Demucs, pyannote, and Resemblyzer. ## Overview This skill provides three main workflows: 1. **Speaker separation** - Extract per-speaker audio from multi-speaker recordings 2. **Voice comparison** - Measure speaker similarity between two audio files 3. **Audio processing** - Segment extraction and voice isolation ## Prerequisites ### Setup Virtual Environment Run once to create the venv and install dependencies: ```bash bash scripts/setup_venv.sh ``` Default venv location: `./.venv` **Requirements:** - Python 3.9+ - ffmpeg (`brew install ffmpeg`) - HuggingFace token (set as env var `HF_TOKEN`) ## Scripts ### 1. Speaker Separation: `diarize_and_slice_mps.py` Separate speakers from multi-speaker audio: ```bash # Basic usage HF_TOKEN=<your-hf-token> \ /path/to/venv/bin/python scripts/diarize_and_slice_mps.py \ --input audio.mp3 \ --outdir /path/to/output \ --prefix MyShow # With speaker constraints HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \ --input audio.mp3 \ --outdir ./out \ --min-speakers 2 \ --max-speakers 5 \ --pad-ms 100 ``` **Process:** 1. Converts input to 16kHz mono WAV 2. Runs Demucs vocal/background separation (optional, for cleaner input) 3. Runs pyannote speaker diarization (MPS-accelerated) 4. Extracts concatenated per-speaker WAV files **Output:** - `<prefix>_speaker1.wav`, `<prefix>_speaker2.wav`, etc. (one per detected speaker) - `diarization.rttm` (time-stamped speaker segments) - `segments.jsonl` (JSON segments metadata) - `meta.json` (pipeline info and speaker index) **Important:** - **Always pass HF token via `HF_TOKEN` env var**, never as CLI arg - **MPS first, CPU fallback** - Script prefers Metal GPU, falls back to CPU if unavailable - Default output: `./separated/` ### 2. Voice Comparison: `compare_voices.py` Measure similarity between two voice samples using Resemblyzer: ```bash # Basic comparison python scripts/compare_voices.py \ --audio1 sample1.wav \ --audio2 sample2.wav # JSON output python scripts/compare_voices.py \ --audio1 reference.wav \ --audio2 clone.wav \ --threshold 0.85 \ --json # Exit code = 0 if pass, 1 if fail ``` **Scores:** - `< 0.75` = Different speakers - `0.75-0.84` = Likely same speaker - `0.85+` = Excellent match (ideal for voice cloning validation) **Use cases:** - Voice clone quality assessment (compare clone vs. original) - Speaker verification (authenticate speaker identity) - Validate speaker separation (confirm separated speakers are distinct) **See:** `references/scoring-guide.md` for detailed interpretation ### 3. Audio Trimming Use `ffmpeg` directly for segment extraction: ```bash # Extract 10-second segment starting at 5 seconds ffmpeg -i input.mp3 -ss 5 -t 10 -c copy output.mp3 # Extract vocals only with Demucs (before diarization) demucs --two-stems vocals --out ./separated input.mp3 ``` ## Workflows ### Workflow 1: Extract Clean Voice Sample for Cloning **Goal:** Get a clean, single-speaker sample for ElevenLabs voice cloning ```bash # 1. Separate speakers HF_TOKEN=<your-hf-token> python scripts/diarize_and_slice_mps.py \ --input podcast.mp3 --outdir ./out --prefix Podcast # 2. Review speaker files (out/Podcast_speaker1.wav, etc.) # 3. Select best sample (5-30s, clean speech) ffmpeg -i out/Podcast_speaker2.wav -ss 10 -t 20 -c copy sample.wav # 4. Upload to ElevenLabs as instant voice clone ``` **See:** `references/elevenlabs-cloning.md` for best practices ### Workflow 2: Validate Voice Clone Quality **Goal:** Measure how well a cloned voice matches the original ```bash # 1. Generate test audio with ElevenLabs clone # (done via ElevenLabs web UI or API) # 2. Compare clone vs. reference python scripts/compare_voices.py \ --audio1 original_sample.wav \ --audio2 elevenlabs_clone.wav \ --threshold 0.85 \ --json # 3. Interpret score: # 0.85+ = excellent, publish-ready # 0.80-0.84 = acceptable, may need tweaking # < 0.80 = poor, try different sample or settings ``` **See:** `references/scoring-guide.md` for troubleshooting low scores ### Workflow 3: Multi-Speaker Conversation Analysis **Goal:** Separate and identify speakers in a conversation ```bash # 1. Run diarization HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \ --input meeting.mp3 --outdir ./out --prefix Meeting # 2. Check detected speakers (meta.json) cat out/meta.json # 3. Compare speaker pairs to confirm separation python scripts/compare_voices.py \ --audio1 out/Meeting_speaker1.wav \ --audio2 out/Meeting_speaker2.wav # Expected: < 0.75 if separation worked correctly ``` ## Technical Notes ### Device Acceleration - **pyannote diarization:** MPS (Metal) by default, CPU fallback - **Resemblyzer:** CPU only (no GPU acceleration) - **Demucs:** MPS by default when available To force CPU for diarization: `--device cpu` ### Audio Formats - **Input:** Any format supported by ffmpeg (wav, mp3, flac, m4a, etc.) - **Processing:** Internally converted to 16kHz mono WAV for diarization - **Output:** WAV format (44.1kHz stereo preserved from source) ### HuggingFace Token - **Required for:** pyannote speaker diarization - **Access:** Must accept gated repo `pyannote/speaker-diarization-3.1` on HF - **Storage:** Any secure secrets manager - **Usage:** Always pass via `HF_TOKEN` env var, never CLI arg ### Sample Quality Tips - **Shorter is better:** 5-30s clean samples often score higher than 60+ second samples - **Clean audio:** Remove background noise with Demucs `--two-stems vocals` - **Single speaker:** Ensure isolated voice, not mixed conversation - **Good recording:** Studio mic > phone mic for voice comparison accuracy ## References - **elevenlabs-cloning.md** - Best practices for ElevenLabs instant voice cloning (model settings, sample selection, proven configurations) - **scoring-guide.md** - How to interpret Resemblyzer similarity scores (thresholds, use cases, troubleshooting) ## Common Issues ### "Missing HF token" error - Export token before running: `export HF_TOKEN=<your-token>` - Or pass inline: `HF_TOKEN=<your-token> python script.py ...` ### Low voice comparison scores for same speaker - Try shorter, cleaner samples (5-30s) - Use Demucs to isolate vocals: `demucs --two-stems vocals input.mp3` - Ensure consistent recording quality (same mic, environment) - See `references/scoring-guide.md` troubleshooting section ### Diarization not detecting all speakers - Adjust `--min-speakers` and `--max-speakers` flags - Check audio quality (clear speech, minimal overlap) - Try longer audio (30+ seconds) for better speaker modeling ### MPS/Metal acceleration not working - Ensure PyTorch with MPS support: `python -c "import torch; print(torch.backends.mps.is_available())"` - Fallback to CPU: `--device cpu` - Re-run `setup_venv.sh` to reinstall PyTorch

audio-speaker-tools

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

audio-speaker-tools