diff --git a/docs/plans/2026-04-02-diarization-design.md b/docs/plans/2026-04-02-diarization-design.md new file mode 100644 index 0000000..2bc916c --- /dev/null +++ b/docs/plans/2026-04-02-diarization-design.md @@ -0,0 +1,143 @@ +# Speaker Diarization & Name Identification Design + +**Date:** 2026-04-02 + +## Goal + +Extend the transcription pipeline with speaker diarization (pyannote.audio) and automatic +speaker name identification (Ollama). Every recording produces three documents: an index, +a raw transcript with speaker labels, and a polished summary. + +## Architecture + +``` +WAV + ├─► Whisper → segments [(start, end, text), …] + ├─► pyannote → speaker segments [(start, end, "SPEAKER_00"), …] + │ + └─► Alignment → [(speaker_label, text), …] + │ + ├─► Ollama (name prompt) → {"SPEAKER_00": "Thomas", "SPEAKER_01": "Möller"} + │ └─ Fallback: WS event `speakers_unknown` → UI card → POST /speakers + │ + ├─► transkript.md (speaker: text, new paragraph per speaker change) + ├─► zusammenfassung.md (key points, open questions, next steps) + └─► index.md (TL;DR, speakers, duration, links to both) +``` + +## Config Schema Extension + +```toml +[diarization] +enabled = true +hf_token = "hf_..." # HuggingFace read token +``` + +## New Module: diarization.py + +```python +class Diarizer: + def __init__(self, hf_token: str): ... + async def diarize(self, wav_path: str) -> list[tuple[float, float, str]]: + # returns [(start_sec, end_sec, "SPEAKER_00"), …] +``` + +Uses `pyannote/speaker-diarization-3.1`. Loaded lazily on first call. +Runs in `loop.run_in_executor` to avoid blocking the event loop. + +## Timestamp Alignment + +For each Whisper segment `(start, end, text)`: find the pyannote speaker with the +greatest time overlap → assign that speaker label. Consecutive segments with the same +speaker are merged into one paragraph. + +**Remote Whisper path:** request `timestamp_granularities=["segment"]` from the +OpenAI-compatible API — the response includes `segments[].start` and `segments[].end`. + +## Speaker Name Identification + +Ollama receives the first ~2000 chars of the aligned transcript and a prompt: + +> "Analysiere das folgende Gesprächstranskript. Ermittle welche Namen den Sprechern +> zugeordnet werden können (z.B. durch direkte Anrede). Antworte NUR mit JSON: +> `{\"SPEAKER_00\": \"Name oder null\", …}`" + +If all values are `null` or parsing fails → emit `speakers_unknown` WebSocket event. +If at least one name is found → apply known names, leave unknowns as `Sprecher N`. + +## Frontend: Speaker Naming Card + +Triggered by `speakers_unknown` WS event. Shown above the record button. + +Each speaker has: +- Excerpt navigator: `‹ "first few sentences…" 1/4 ›` — arrows cycle through all + excerpts (3-4 sentences each) for that speaker +- Text input for the name + +Buttons: +- **Übernehmen** → `POST /speakers` with `{"SPEAKER_00": "Thomas", …}` → pipeline + writes the three documents and emits `saved` +- **Anonym lassen** → same POST with empty strings → labels stay as `Sprecher 1` etc. + +## New API Endpoint + +| Method | Path | Description | +|--------|------|-------------| +| POST | `/speakers` | Receives speaker name mapping, triggers document writing | + +The pipeline pauses after alignment and waits for `/speakers` before writing output. +State stored in `api/state.py` as `state._pending_speakers`. + +## Three Output Documents + +All three share the same filename base (e.g. `2026-04-02-1430-Meeting`): + +**`...-index.md`** +```markdown +# Meeting — 02.04.2026 14:30 + +**Sprecher:** Thomas, Möller +**Dauer:** 23 min + +> [2-3 sentence TL;DR from Ollama] + +- [Transkript](…-transkript.md) +- [Zusammenfassung](…-zusammenfassung.md) +``` + +**`...-transkript.md`** — Raw annotated transcript, new paragraph per speaker change: +```markdown +**Thomas:** Gut, dann fangen wir an. + +**Möller:** Ich hab das Budget schon vorbereitet… +``` + +**`...-zusammenfassung.md`** — Polished summary document (Ollama): +```markdown +# Meeting-Zusammenfassung — 02.04.2026 + +## Wichtigste Punkte +… + +## Offene Fragen +… + +## Nächste Schritte / Ideen +… +``` + +All three appear in the transcript list. Index entries get a `meeting` badge. + +## HuggingFace Setup (one-time, per machine) + +1. Create account at huggingface.co +2. Go to https://huggingface.co/pyannote/speaker-diarization-3.1 → click + "Access repository" and accept the terms of service +3. Go to huggingface.co/settings/tokens → create a token with **Read** access +4. Enter the token in Transkriptor settings → Einstellungen → Diarisierung + +## Not in Scope + +- Speaker voice profiles / pre-registration +- More than one diarization model +- Windows support