# Speaker Diarization & Name Identification Design **Date:** 2026-04-02 ## Goal Extend the transcription pipeline with speaker diarization (pyannote.audio) and automatic speaker name identification (Ollama). Every recording produces three documents: an index, a raw transcript with speaker labels, and a polished summary. ## Architecture ``` WAV ├─► Whisper → segments [(start, end, text), …] ├─► pyannote → speaker segments [(start, end, "SPEAKER_00"), …] │ └─► Alignment → [(speaker_label, text), …] │ ├─► Ollama (name prompt) → {"SPEAKER_00": "Thomas", "SPEAKER_01": "Möller"} │ └─ Fallback: WS event `speakers_unknown` → UI card → POST /speakers │ ├─► transkript.md (speaker: text, new paragraph per speaker change) ├─► zusammenfassung.md (key points, open questions, next steps) └─► index.md (TL;DR, speakers, duration, links to both) ``` ## Config Schema Extension ```toml [diarization] enabled = true hf_token = "hf_..." # HuggingFace read token ``` ## New Module: diarization.py ```python class Diarizer: def __init__(self, hf_token: str): ... async def diarize(self, wav_path: str) -> list[tuple[float, float, str]]: # returns [(start_sec, end_sec, "SPEAKER_00"), …] ``` Uses `pyannote/speaker-diarization-3.1`. Loaded lazily on first call. Runs in `loop.run_in_executor` to avoid blocking the event loop. ## Timestamp Alignment For each Whisper segment `(start, end, text)`: find the pyannote speaker with the greatest time overlap → assign that speaker label. Consecutive segments with the same speaker are merged into one paragraph. **Remote Whisper path:** request `timestamp_granularities=["segment"]` from the OpenAI-compatible API — the response includes `segments[].start` and `segments[].end`. ## Speaker Name Identification Ollama receives the first ~2000 chars of the aligned transcript and a prompt: > "Analysiere das folgende Gesprächstranskript. Ermittle welche Namen den Sprechern > zugeordnet werden können (z.B. durch direkte Anrede). Antworte NUR mit JSON: > `{\"SPEAKER_00\": \"Name oder null\", …}`" If all values are `null` or parsing fails → emit `speakers_unknown` WebSocket event. If at least one name is found → apply known names, leave unknowns as `Sprecher N`. ## Frontend: Speaker Naming Card Triggered by `speakers_unknown` WS event. Shown above the record button. Each speaker has: - Excerpt navigator: `‹ "first few sentences…" 1/4 ›` — arrows cycle through all excerpts (3-4 sentences each) for that speaker - Text input for the name Buttons: - **Übernehmen** → `POST /speakers` with `{"SPEAKER_00": "Thomas", …}` → pipeline writes the three documents and emits `saved` - **Anonym lassen** → same POST with empty strings → labels stay as `Sprecher 1` etc. ## New API Endpoint | Method | Path | Description | |--------|------|-------------| | POST | `/speakers` | Receives speaker name mapping, triggers document writing | The pipeline pauses after alignment and waits for `/speakers` before writing output. State stored in `api/state.py` as `state._pending_speakers`. ## Three Output Documents All three share the same filename base (e.g. `2026-04-02-1430-Meeting`): **`...-index.md`** ```markdown # Meeting — 02.04.2026 14:30 **Sprecher:** Thomas, Möller **Dauer:** 23 min > [2-3 sentence TL;DR from Ollama] - [Transkript](…-transkript.md) - [Zusammenfassung](…-zusammenfassung.md) ``` **`...-transkript.md`** — Raw annotated transcript, new paragraph per speaker change: ```markdown **Thomas:** Gut, dann fangen wir an. **Möller:** Ich hab das Budget schon vorbereitet… ``` **`...-zusammenfassung.md`** — Polished summary document (Ollama): ```markdown # Meeting-Zusammenfassung — 02.04.2026 ## Wichtigste Punkte … ## Offene Fragen … ## Nächste Schritte / Ideen … ``` All three appear in the transcript list. Index entries get a `meeting` badge. ## HuggingFace Setup (one-time, per machine) 1. Create account at huggingface.co 2. Go to https://huggingface.co/pyannote/speaker-diarization-3.1 → click "Access repository" and accept the terms of service 3. Go to huggingface.co/settings/tokens → create a token with **Read** access 4. Enter the token in Transkriptor settings → Einstellungen → Diarisierung ## Not in Scope - Speaker voice profiles / pre-registration - More than one diarization model - Windows support