docs: diarization + speaker identification design

2026-04-02 00:46:18 +02:00
parent 80ce1aa77c
commit 8d1af32ef3
1 changed files with 143 additions and 0 deletions
@@ -0,0 +1,143 @@
+# Speaker Diarization & Name Identification Design
+
+**Date:** 2026-04-02
+
+## Goal
+
+Extend the transcription pipeline with speaker diarization (pyannote.audio) and automatic
+speaker name identification (Ollama). Every recording produces three documents: an index,
+a raw transcript with speaker labels, and a polished summary.
+
+## Architecture
+
+```
+WAV
+ ├─► Whisper       → segments [(start, end, text), …]
+ ├─► pyannote      → speaker segments [(start, end, "SPEAKER_00"), …]
+ │
+ └─► Alignment     → [(speaker_label, text), …]
+      │
+      ├─► Ollama (name prompt) → {"SPEAKER_00": "Thomas", "SPEAKER_01": "Möller"}
+      │    └─ Fallback: WS event `speakers_unknown` → UI card → POST /speakers
+      │
+      ├─► transkript.md   (speaker: text, new paragraph per speaker change)
+      ├─► zusammenfassung.md  (key points, open questions, next steps)
+      └─► index.md        (TL;DR, speakers, duration, links to both)
+```
+
+## Config Schema Extension
+
+```toml
+[diarization]
+enabled = true
+hf_token = "hf_..."   # HuggingFace read token
+```
+
+## New Module: diarization.py
+
+```python
+class Diarizer:
+    def __init__(self, hf_token: str): ...
+    async def diarize(self, wav_path: str) -> list[tuple[float, float, str]]:
+        # returns [(start_sec, end_sec, "SPEAKER_00"), …]
+```
+
+Uses `pyannote/speaker-diarization-3.1`. Loaded lazily on first call.
+Runs in `loop.run_in_executor` to avoid blocking the event loop.
+
+## Timestamp Alignment
+
+For each Whisper segment `(start, end, text)`: find the pyannote speaker with the
+greatest time overlap → assign that speaker label. Consecutive segments with the same
+speaker are merged into one paragraph.
+
+**Remote Whisper path:** request `timestamp_granularities=["segment"]` from the
+OpenAI-compatible API — the response includes `segments[].start` and `segments[].end`.
+
+## Speaker Name Identification
+
+Ollama receives the first ~2000 chars of the aligned transcript and a prompt:
+
+> "Analysiere das folgende Gesprächstranskript. Ermittle welche Namen den Sprechern
+> zugeordnet werden können (z.B. durch direkte Anrede). Antworte NUR mit JSON:
+> `{\"SPEAKER_00\": \"Name oder null\", …}`"
+
+If all values are `null` or parsing fails → emit `speakers_unknown` WebSocket event.
+If at least one name is found → apply known names, leave unknowns as `Sprecher N`.
+
+## Frontend: Speaker Naming Card
+
+Triggered by `speakers_unknown` WS event. Shown above the record button.
+
+Each speaker has:
+- Excerpt navigator: `‹ "first few sentences…" 1/4 ›` — arrows cycle through all
+  excerpts (3-4 sentences each) for that speaker
+- Text input for the name
+
+Buttons:
+- **Übernehmen** → `POST /speakers` with `{"SPEAKER_00": "Thomas", …}` → pipeline
+  writes the three documents and emits `saved`
+- **Anonym lassen** → same POST with empty strings → labels stay as `Sprecher 1` etc.
+
+## New API Endpoint
+
+| Method | Path | Description |
+|--------|------|-------------|
+| POST | `/speakers` | Receives speaker name mapping, triggers document writing |
+
+The pipeline pauses after alignment and waits for `/speakers` before writing output.
+State stored in `api/state.py` as `state._pending_speakers`.
+
+## Three Output Documents
+
+All three share the same filename base (e.g. `2026-04-02-1430-Meeting`):
+
+**`...-index.md`**
+```markdown
+# Meeting — 02.04.2026 14:30
+
+**Sprecher:** Thomas, Möller
+**Dauer:** 23 min
+
+> [2-3 sentence TL;DR from Ollama]
+
+- [Transkript](…-transkript.md)
+- [Zusammenfassung](…-zusammenfassung.md)
+```
+
+**`...-transkript.md`** — Raw annotated transcript, new paragraph per speaker change:
+```markdown
+**Thomas:** Gut, dann fangen wir an.
+
+**Möller:** Ich hab das Budget schon vorbereitet…
+```
+
+**`...-zusammenfassung.md`** — Polished summary document (Ollama):
+```markdown
+# Meeting-Zusammenfassung — 02.04.2026
+
+## Wichtigste Punkte
+…
+
+## Offene Fragen
+…
+
+## Nächste Schritte / Ideen
+…
+```
+
+All three appear in the transcript list. Index entries get a `meeting` badge.
+
+## HuggingFace Setup (one-time, per machine)
+
+1. Create account at huggingface.co
+2. Go to https://huggingface.co/pyannote/speaker-diarization-3.1 → click
+   "Access repository" and accept the terms of service
+3. Go to huggingface.co/settings/tokens → create a token with **Read** access
+4. Enter the token in Transkriptor settings → Einstellungen → Diarisierung
+
+## Not in Scope
+
+- Speaker voice profiles / pre-registration
+- More than one diarization model
+- Windows support