4.4 KiB
Speaker Diarization & Name Identification Design
Date: 2026-04-02
Goal
Extend the transcription pipeline with speaker diarization (pyannote.audio) and automatic speaker name identification (Ollama). Every recording produces three documents: an index, a raw transcript with speaker labels, and a polished summary.
Architecture
WAV
├─► Whisper → segments [(start, end, text), …]
├─► pyannote → speaker segments [(start, end, "SPEAKER_00"), …]
│
└─► Alignment → [(speaker_label, text), …]
│
├─► Ollama (name prompt) → {"SPEAKER_00": "Thomas", "SPEAKER_01": "Möller"}
│ └─ Fallback: WS event `speakers_unknown` → UI card → POST /speakers
│
├─► transkript.md (speaker: text, new paragraph per speaker change)
├─► zusammenfassung.md (key points, open questions, next steps)
└─► index.md (TL;DR, speakers, duration, links to both)
Config Schema Extension
[diarization]
enabled = true
hf_token = "hf_..." # HuggingFace read token
New Module: diarization.py
class Diarizer:
def __init__(self, hf_token: str): ...
async def diarize(self, wav_path: str) -> list[tuple[float, float, str]]:
# returns [(start_sec, end_sec, "SPEAKER_00"), …]
Uses pyannote/speaker-diarization-3.1. Loaded lazily on first call.
Runs in loop.run_in_executor to avoid blocking the event loop.
Timestamp Alignment
For each Whisper segment (start, end, text): find the pyannote speaker with the
greatest time overlap → assign that speaker label. Consecutive segments with the same
speaker are merged into one paragraph.
Remote Whisper path: request timestamp_granularities=["segment"] from the
OpenAI-compatible API — the response includes segments[].start and segments[].end.
Speaker Name Identification
Ollama receives the first ~2000 chars of the aligned transcript and a prompt:
"Analysiere das folgende Gesprächstranskript. Ermittle welche Namen den Sprechern zugeordnet werden können (z.B. durch direkte Anrede). Antworte NUR mit JSON:
{\"SPEAKER_00\": \"Name oder null\", …}"
If all values are null or parsing fails → emit speakers_unknown WebSocket event.
If at least one name is found → apply known names, leave unknowns as Sprecher N.
Frontend: Speaker Naming Card
Triggered by speakers_unknown WS event. Shown above the record button.
Each speaker has:
- Excerpt navigator:
‹ "first few sentences…" 1/4 ›— arrows cycle through all excerpts (3-4 sentences each) for that speaker - Text input for the name
Buttons:
- Übernehmen →
POST /speakerswith{"SPEAKER_00": "Thomas", …}→ pipeline writes the three documents and emitssaved - Anonym lassen → same POST with empty strings → labels stay as
Sprecher 1etc.
New API Endpoint
| Method | Path | Description |
|---|---|---|
| POST | /speakers |
Receives speaker name mapping, triggers document writing |
The pipeline pauses after alignment and waits for /speakers before writing output.
State stored in api/state.py as state._pending_speakers.
Three Output Documents
All three share the same filename base (e.g. 2026-04-02-1430-Meeting):
...-index.md
# Meeting — 02.04.2026 14:30
**Sprecher:** Thomas, Möller
**Dauer:** 23 min
> [2-3 sentence TL;DR from Ollama]
- [Transkript](…-transkript.md)
- [Zusammenfassung](…-zusammenfassung.md)
...-transkript.md — Raw annotated transcript, new paragraph per speaker change:
**Thomas:** Gut, dann fangen wir an.
**Möller:** Ich hab das Budget schon vorbereitet…
...-zusammenfassung.md — Polished summary document (Ollama):
# Meeting-Zusammenfassung — 02.04.2026
## Wichtigste Punkte
…
## Offene Fragen
…
## Nächste Schritte / Ideen
…
All three appear in the transcript list. Index entries get a meeting badge.
HuggingFace Setup (one-time, per machine)
- Create account at huggingface.co
- Go to https://huggingface.co/pyannote/speaker-diarization-3.1 → click "Access repository" and accept the terms of service
- Go to huggingface.co/settings/tokens → create a token with Read access
- Enter the token in Transkriptor settings → Einstellungen → Diarisierung
Not in Scope
- Speaker voice profiles / pre-registration
- More than one diarization model
- Windows support