Files
tueit_Transkriptor/docs/plans/2026-04-02-diarization-design.md

144 lines
4.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Speaker Diarization & Name Identification Design
**Date:** 2026-04-02
## Goal
Extend the transcription pipeline with speaker diarization (pyannote.audio) and automatic
speaker name identification (Ollama). Every recording produces three documents: an index,
a raw transcript with speaker labels, and a polished summary.
## Architecture
```
WAV
├─► Whisper → segments [(start, end, text), …]
├─► pyannote → speaker segments [(start, end, "SPEAKER_00"), …]
└─► Alignment → [(speaker_label, text), …]
├─► Ollama (name prompt) → {"SPEAKER_00": "Thomas", "SPEAKER_01": "Möller"}
│ └─ Fallback: WS event `speakers_unknown` → UI card → POST /speakers
├─► transkript.md (speaker: text, new paragraph per speaker change)
├─► zusammenfassung.md (key points, open questions, next steps)
└─► index.md (TL;DR, speakers, duration, links to both)
```
## Config Schema Extension
```toml
[diarization]
enabled = true
hf_token = "hf_..." # HuggingFace read token
```
## New Module: diarization.py
```python
class Diarizer:
def __init__(self, hf_token: str): ...
async def diarize(self, wav_path: str) -> list[tuple[float, float, str]]:
# returns [(start_sec, end_sec, "SPEAKER_00"), …]
```
Uses `pyannote/speaker-diarization-3.1`. Loaded lazily on first call.
Runs in `loop.run_in_executor` to avoid blocking the event loop.
## Timestamp Alignment
For each Whisper segment `(start, end, text)`: find the pyannote speaker with the
greatest time overlap → assign that speaker label. Consecutive segments with the same
speaker are merged into one paragraph.
**Remote Whisper path:** request `timestamp_granularities=["segment"]` from the
OpenAI-compatible API — the response includes `segments[].start` and `segments[].end`.
## Speaker Name Identification
Ollama receives the first ~2000 chars of the aligned transcript and a prompt:
> "Analysiere das folgende Gesprächstranskript. Ermittle welche Namen den Sprechern
> zugeordnet werden können (z.B. durch direkte Anrede). Antworte NUR mit JSON:
> `{\"SPEAKER_00\": \"Name oder null\", …}`"
If all values are `null` or parsing fails → emit `speakers_unknown` WebSocket event.
If at least one name is found → apply known names, leave unknowns as `Sprecher N`.
## Frontend: Speaker Naming Card
Triggered by `speakers_unknown` WS event. Shown above the record button.
Each speaker has:
- Excerpt navigator: ` "first few sentences…" 1/4 ` — arrows cycle through all
excerpts (3-4 sentences each) for that speaker
- Text input for the name
Buttons:
- **Übernehmen** → `POST /speakers` with `{"SPEAKER_00": "Thomas", …}` → pipeline
writes the three documents and emits `saved`
- **Anonym lassen** → same POST with empty strings → labels stay as `Sprecher 1` etc.
## New API Endpoint
| Method | Path | Description |
|--------|------|-------------|
| POST | `/speakers` | Receives speaker name mapping, triggers document writing |
The pipeline pauses after alignment and waits for `/speakers` before writing output.
State stored in `api/state.py` as `state._pending_speakers`.
## Three Output Documents
All three share the same filename base (e.g. `2026-04-02-1430-Meeting`):
**`...-index.md`**
```markdown
# Meeting — 02.04.2026 14:30
**Sprecher:** Thomas, Möller
**Dauer:** 23 min
> [2-3 sentence TL;DR from Ollama]
- [Transkript](…-transkript.md)
- [Zusammenfassung](…-zusammenfassung.md)
```
**`...-transkript.md`** — Raw annotated transcript, new paragraph per speaker change:
```markdown
**Thomas:** Gut, dann fangen wir an.
**Möller:** Ich hab das Budget schon vorbereitet…
```
**`...-zusammenfassung.md`** — Polished summary document (Ollama):
```markdown
# Meeting-Zusammenfassung — 02.04.2026
## Wichtigste Punkte
## Offene Fragen
## Nächste Schritte / Ideen
```
All three appear in the transcript list. Index entries get a `meeting` badge.
## HuggingFace Setup (one-time, per machine)
1. Create account at huggingface.co
2. Go to https://huggingface.co/pyannote/speaker-diarization-3.1 → click
"Access repository" and accept the terms of service
3. Go to huggingface.co/settings/tokens → create a token with **Read** access
4. Enter the token in Transkriptor settings → Einstellungen → Diarisierung
## Not in Scope
- Speaker voice profiles / pre-registration
- More than one diarization model
- Windows support