docs: diarization + speaker identification design
This commit is contained in:
@@ -0,0 +1,143 @@
|
||||
# Speaker Diarization & Name Identification Design
|
||||
|
||||
**Date:** 2026-04-02
|
||||
|
||||
## Goal
|
||||
|
||||
Extend the transcription pipeline with speaker diarization (pyannote.audio) and automatic
|
||||
speaker name identification (Ollama). Every recording produces three documents: an index,
|
||||
a raw transcript with speaker labels, and a polished summary.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
WAV
|
||||
├─► Whisper → segments [(start, end, text), …]
|
||||
├─► pyannote → speaker segments [(start, end, "SPEAKER_00"), …]
|
||||
│
|
||||
└─► Alignment → [(speaker_label, text), …]
|
||||
│
|
||||
├─► Ollama (name prompt) → {"SPEAKER_00": "Thomas", "SPEAKER_01": "Möller"}
|
||||
│ └─ Fallback: WS event `speakers_unknown` → UI card → POST /speakers
|
||||
│
|
||||
├─► transkript.md (speaker: text, new paragraph per speaker change)
|
||||
├─► zusammenfassung.md (key points, open questions, next steps)
|
||||
└─► index.md (TL;DR, speakers, duration, links to both)
|
||||
```
|
||||
|
||||
## Config Schema Extension
|
||||
|
||||
```toml
|
||||
[diarization]
|
||||
enabled = true
|
||||
hf_token = "hf_..." # HuggingFace read token
|
||||
```
|
||||
|
||||
## New Module: diarization.py
|
||||
|
||||
```python
|
||||
class Diarizer:
|
||||
def __init__(self, hf_token: str): ...
|
||||
async def diarize(self, wav_path: str) -> list[tuple[float, float, str]]:
|
||||
# returns [(start_sec, end_sec, "SPEAKER_00"), …]
|
||||
```
|
||||
|
||||
Uses `pyannote/speaker-diarization-3.1`. Loaded lazily on first call.
|
||||
Runs in `loop.run_in_executor` to avoid blocking the event loop.
|
||||
|
||||
## Timestamp Alignment
|
||||
|
||||
For each Whisper segment `(start, end, text)`: find the pyannote speaker with the
|
||||
greatest time overlap → assign that speaker label. Consecutive segments with the same
|
||||
speaker are merged into one paragraph.
|
||||
|
||||
**Remote Whisper path:** request `timestamp_granularities=["segment"]` from the
|
||||
OpenAI-compatible API — the response includes `segments[].start` and `segments[].end`.
|
||||
|
||||
## Speaker Name Identification
|
||||
|
||||
Ollama receives the first ~2000 chars of the aligned transcript and a prompt:
|
||||
|
||||
> "Analysiere das folgende Gesprächstranskript. Ermittle welche Namen den Sprechern
|
||||
> zugeordnet werden können (z.B. durch direkte Anrede). Antworte NUR mit JSON:
|
||||
> `{\"SPEAKER_00\": \"Name oder null\", …}`"
|
||||
|
||||
If all values are `null` or parsing fails → emit `speakers_unknown` WebSocket event.
|
||||
If at least one name is found → apply known names, leave unknowns as `Sprecher N`.
|
||||
|
||||
## Frontend: Speaker Naming Card
|
||||
|
||||
Triggered by `speakers_unknown` WS event. Shown above the record button.
|
||||
|
||||
Each speaker has:
|
||||
- Excerpt navigator: `‹ "first few sentences…" 1/4 ›` — arrows cycle through all
|
||||
excerpts (3-4 sentences each) for that speaker
|
||||
- Text input for the name
|
||||
|
||||
Buttons:
|
||||
- **Übernehmen** → `POST /speakers` with `{"SPEAKER_00": "Thomas", …}` → pipeline
|
||||
writes the three documents and emits `saved`
|
||||
- **Anonym lassen** → same POST with empty strings → labels stay as `Sprecher 1` etc.
|
||||
|
||||
## New API Endpoint
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| POST | `/speakers` | Receives speaker name mapping, triggers document writing |
|
||||
|
||||
The pipeline pauses after alignment and waits for `/speakers` before writing output.
|
||||
State stored in `api/state.py` as `state._pending_speakers`.
|
||||
|
||||
## Three Output Documents
|
||||
|
||||
All three share the same filename base (e.g. `2026-04-02-1430-Meeting`):
|
||||
|
||||
**`...-index.md`**
|
||||
```markdown
|
||||
# Meeting — 02.04.2026 14:30
|
||||
|
||||
**Sprecher:** Thomas, Möller
|
||||
**Dauer:** 23 min
|
||||
|
||||
> [2-3 sentence TL;DR from Ollama]
|
||||
|
||||
- [Transkript](…-transkript.md)
|
||||
- [Zusammenfassung](…-zusammenfassung.md)
|
||||
```
|
||||
|
||||
**`...-transkript.md`** — Raw annotated transcript, new paragraph per speaker change:
|
||||
```markdown
|
||||
**Thomas:** Gut, dann fangen wir an.
|
||||
|
||||
**Möller:** Ich hab das Budget schon vorbereitet…
|
||||
```
|
||||
|
||||
**`...-zusammenfassung.md`** — Polished summary document (Ollama):
|
||||
```markdown
|
||||
# Meeting-Zusammenfassung — 02.04.2026
|
||||
|
||||
## Wichtigste Punkte
|
||||
…
|
||||
|
||||
## Offene Fragen
|
||||
…
|
||||
|
||||
## Nächste Schritte / Ideen
|
||||
…
|
||||
```
|
||||
|
||||
All three appear in the transcript list. Index entries get a `meeting` badge.
|
||||
|
||||
## HuggingFace Setup (one-time, per machine)
|
||||
|
||||
1. Create account at huggingface.co
|
||||
2. Go to https://huggingface.co/pyannote/speaker-diarization-3.1 → click
|
||||
"Access repository" and accept the terms of service
|
||||
3. Go to huggingface.co/settings/tokens → create a token with **Read** access
|
||||
4. Enter the token in Transkriptor settings → Einstellungen → Diarisierung
|
||||
|
||||
## Not in Scope
|
||||
|
||||
- Speaker voice profiles / pre-registration
|
||||
- More than one diarization model
|
||||
- Windows support
|
||||
Reference in New Issue
Block a user