Files
tueit_Transkriptor/docs/plans/2026-04-02-diarization-design.md
T

4.4 KiB
Raw Blame History

Speaker Diarization & Name Identification Design

Date: 2026-04-02

Goal

Extend the transcription pipeline with speaker diarization (pyannote.audio) and automatic speaker name identification (Ollama). Every recording produces three documents: an index, a raw transcript with speaker labels, and a polished summary.

Architecture

WAV
 ├─► Whisper       → segments [(start, end, text), …]
 ├─► pyannote      → speaker segments [(start, end, "SPEAKER_00"), …]
 │
 └─► Alignment     → [(speaker_label, text), …]
      │
      ├─► Ollama (name prompt) → {"SPEAKER_00": "Thomas", "SPEAKER_01": "Möller"}
      │    └─ Fallback: WS event `speakers_unknown` → UI card → POST /speakers
      │
      ├─► transkript.md   (speaker: text, new paragraph per speaker change)
      ├─► zusammenfassung.md  (key points, open questions, next steps)
      └─► index.md        (TL;DR, speakers, duration, links to both)

Config Schema Extension

[diarization]
enabled = true
hf_token = "hf_..."   # HuggingFace read token

New Module: diarization.py

class Diarizer:
    def __init__(self, hf_token: str): ...
    async def diarize(self, wav_path: str) -> list[tuple[float, float, str]]:
        # returns [(start_sec, end_sec, "SPEAKER_00"), …]

Uses pyannote/speaker-diarization-3.1. Loaded lazily on first call. Runs in loop.run_in_executor to avoid blocking the event loop.

Timestamp Alignment

For each Whisper segment (start, end, text): find the pyannote speaker with the greatest time overlap → assign that speaker label. Consecutive segments with the same speaker are merged into one paragraph.

Remote Whisper path: request timestamp_granularities=["segment"] from the OpenAI-compatible API — the response includes segments[].start and segments[].end.

Speaker Name Identification

Ollama receives the first ~2000 chars of the aligned transcript and a prompt:

"Analysiere das folgende Gesprächstranskript. Ermittle welche Namen den Sprechern zugeordnet werden können (z.B. durch direkte Anrede). Antworte NUR mit JSON: {\"SPEAKER_00\": \"Name oder null\", …}"

If all values are null or parsing fails → emit speakers_unknown WebSocket event. If at least one name is found → apply known names, leave unknowns as Sprecher N.

Frontend: Speaker Naming Card

Triggered by speakers_unknown WS event. Shown above the record button.

Each speaker has:

  • Excerpt navigator: "first few sentences…" 1/4 — arrows cycle through all excerpts (3-4 sentences each) for that speaker
  • Text input for the name

Buttons:

  • ÜbernehmenPOST /speakers with {"SPEAKER_00": "Thomas", …} → pipeline writes the three documents and emits saved
  • Anonym lassen → same POST with empty strings → labels stay as Sprecher 1 etc.

New API Endpoint

Method Path Description
POST /speakers Receives speaker name mapping, triggers document writing

The pipeline pauses after alignment and waits for /speakers before writing output. State stored in api/state.py as state._pending_speakers.

Three Output Documents

All three share the same filename base (e.g. 2026-04-02-1430-Meeting):

...-index.md

# Meeting — 02.04.2026 14:30

**Sprecher:** Thomas, Möller
**Dauer:** 23 min

> [2-3 sentence TL;DR from Ollama]

- [Transkript](…-transkript.md)
- [Zusammenfassung](…-zusammenfassung.md)

...-transkript.md — Raw annotated transcript, new paragraph per speaker change:

**Thomas:** Gut, dann fangen wir an.

**Möller:** Ich hab das Budget schon vorbereitet…

...-zusammenfassung.md — Polished summary document (Ollama):

# Meeting-Zusammenfassung — 02.04.2026

## Wichtigste Punkte
## Offene Fragen
## Nächste Schritte / Ideen

All three appear in the transcript list. Index entries get a meeting badge.

HuggingFace Setup (one-time, per machine)

  1. Create account at huggingface.co
  2. Go to https://huggingface.co/pyannote/speaker-diarization-3.1 → click "Access repository" and accept the terms of service
  3. Go to huggingface.co/settings/tokens → create a token with Read access
  4. Enter the token in Transkriptor settings → Einstellungen → Diarisierung

Not in Scope

  • Speaker voice profiles / pre-registration
  • More than one diarization model
  • Windows support