Files
tueit_Transkriptor/docs/plans/2026-04-01-desktop-transcription-design.md
T

4.3 KiB
Raw Blame History

Design: tüit Transkriptor

Date: 2026-04-01 Status: Approved Platform: Arch Linux, KDE Plasma (Wayland), AMD RX 6800 XT (RDNA2 / ROCm)

Goal

A local AI transcription tool that runs as a system tray application, monitors audio input, and produces LLM-refined Markdown transcripts saved directly into the Nextcloud-synced notes folder. Designed as a personal secretary — the user can provide instructions alongside the recording to guide the LLM output.

Architecture

tueit_Transkriptor/
├── main.py                  # Entry point: starts FastAPI + pystray
├── api/
│   ├── router.py            # REST endpoints + WebSocket
│   └── state.py             # Global app state (recording, transcript, ...)
├── audio.py                 # sounddevice → PCM buffer
├── transcription.py         # faster-whisper wrapper (ROCm-capable)
├── llm.py                   # Ollama httpx client
├── output.py                # Render Markdown + write to Nextcloud folder
├── config.py                # TOML config (~/.config/tueit-transcriber/config.toml)
├── frontend/
│   ├── index.html           # Single-page UI (tüit CI: dark mode, #DA251C, #FFD802, Overpass)
│   └── app.js               # WebSocket client for live status
├── install.sh               # Check deps (ROCm, Ollama, Python packages), set up systemd user service
└── requirements.txt

Data Flow

SIGUSR1 / Tray click / API POST /toggle
    → sounddevice captures PCM (16kHz mono)
    → on stop: WAV → faster-whisper → raw text
    → raw text + user instructions (from UI) → Ollama (gemma3:12b via ROCm)
    → Markdown with tüit CI (frontmatter, headings, highlights)
    → file: ~/cloud.shron.de/Hetzner Storagebox/work/YYYY-MM-DD-HHmm-<title>.md

API Endpoints

Method Path Purpose
POST /toggle Start/stop recording (also triggered via SIGUSR1)
GET /status Current state: recording / processing / idle
GET /transcripts List of recent transcripts
WS /ws Live updates to frontend
GET /config Current configuration
PUT /config Update config (model, output path, ...)

Future Thunderbird integration: POST /compose — generates a draft from the transcript. The API foundation is already in place.

UI

Permanent browser window opened at startup (http://localhost:8765). Dark mode, tüit CI colors and Overpass font.

  • Top: Large record toggle button (red when active, grey when idle) + status display
  • Middle: Instruction text field — persistent, included as LLM context on every processing run. Examples: "highlight the key points", "create a ticket for this", "draft an offer"
  • Bottom: Live transcript preview during processing; list of recent transcripts (clickable → opens file)

Trigger Mechanism

  • Tray icon click — toggles recording
  • SIGUSR1 — toggles recording (Wayland-compatible hotkey workaround)
    • PID written to ~/.local/run/tueit-transcriber.pid
    • KDE custom shortcut: pkill -USR1 -f main.py
    • Works on any DE, Wayland-independent

Ollama Model

The RX 6800 XT has 16 GB GDDR6 VRAM. ROCm supports RDNA2 since ROCm 5.x.

Component Model VRAM
LLM gemma3:12b (default) ~89 GB
Whisper large-v3 (ROCm) ~3 GB
Fallback Whisper medium ~1.5 GB

Both configurable via config.toml and the settings UI.

Output Format

Markdown file with YAML frontmatter:

---
date: 2026-04-01T14:32:00
tags: [transkript]
---

# <LLM-generated title>

<LLM-refined transcript content>

File naming: YYYY-MM-DD-HHmm-<slugified-title>.md Output path: ~/cloud.shron.de/Hetzner Storagebox/work/

Dependencies

  • faster-whisper — Whisper inference
  • sounddevice — audio capture
  • httpx — Ollama API client
  • fastapi + uvicorn — local HTTP/WebSocket server
  • pystray + Pillow — system tray icon
  • tomli / tomllib — TOML config
  • Ollama (system, with ROCm)
  • ROCm (system, via pacman: rocm-hip-sdk)

Future Extensions

  • Thunderbird integration via POST /compose
  • Zammad ticket creation via POST /ticket
  • Template system (e.g. "offer", "reminder", "meeting notes")