Files
tueit_Transkriptor/docs/plans/2026-04-01-desktop-transcription-design.md
T

117 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design: tüit Transkriptor
**Date:** 2026-04-01
**Status:** Approved
**Platform:** Arch Linux, KDE Plasma (Wayland), AMD RX 6800 XT (RDNA2 / ROCm)
## Goal
A local AI transcription tool that runs as a system tray application, monitors audio input, and produces LLM-refined Markdown transcripts saved directly into the Nextcloud-synced notes folder. Designed as a personal secretary — the user can provide instructions alongside the recording to guide the LLM output.
## Architecture
```
tueit_Transkriptor/
├── main.py # Entry point: starts FastAPI + pystray
├── api/
│ ├── router.py # REST endpoints + WebSocket
│ └── state.py # Global app state (recording, transcript, ...)
├── audio.py # sounddevice → PCM buffer
├── transcription.py # faster-whisper wrapper (ROCm-capable)
├── llm.py # Ollama httpx client
├── output.py # Render Markdown + write to Nextcloud folder
├── config.py # TOML config (~/.config/tueit-transcriber/config.toml)
├── frontend/
│ ├── index.html # Single-page UI (tüit CI: dark mode, #DA251C, #FFD802, Overpass)
│ └── app.js # WebSocket client for live status
├── install.sh # Check deps (ROCm, Ollama, Python packages), set up systemd user service
└── requirements.txt
```
## Data Flow
```
SIGUSR1 / Tray click / API POST /toggle
→ sounddevice captures PCM (16kHz mono)
→ on stop: WAV → faster-whisper → raw text
→ raw text + user instructions (from UI) → Ollama (gemma3:12b via ROCm)
→ Markdown with tüit CI (frontmatter, headings, highlights)
→ file: ~/cloud.shron.de/Hetzner Storagebox/work/YYYY-MM-DD-HHmm-<title>.md
```
## API Endpoints
| Method | Path | Purpose |
|--------|------|---------|
| `POST` | `/toggle` | Start/stop recording (also triggered via SIGUSR1) |
| `GET` | `/status` | Current state: recording / processing / idle |
| `GET` | `/transcripts` | List of recent transcripts |
| `WS` | `/ws` | Live updates to frontend |
| `GET` | `/config` | Current configuration |
| `PUT` | `/config` | Update config (model, output path, ...) |
Future Thunderbird integration: `POST /compose` — generates a draft from the transcript. The API foundation is already in place.
## UI
Permanent browser window opened at startup (`http://localhost:8765`). Dark mode, tüit CI colors and Overpass font.
- **Top:** Large record toggle button (red when active, grey when idle) + status display
- **Middle:** Instruction text field — persistent, included as LLM context on every processing run. Examples: "highlight the key points", "create a ticket for this", "draft an offer"
- **Bottom:** Live transcript preview during processing; list of recent transcripts (clickable → opens file)
## Trigger Mechanism
- **Tray icon click** — toggles recording
- **SIGUSR1** — toggles recording (Wayland-compatible hotkey workaround)
- PID written to `~/.local/run/tueit-transcriber.pid`
- KDE custom shortcut: `pkill -USR1 -f main.py`
- Works on any DE, Wayland-independent
## Ollama Model
The RX 6800 XT has 16 GB GDDR6 VRAM. ROCm supports RDNA2 since ROCm 5.x.
| Component | Model | VRAM |
|-----------|-------|------|
| LLM | `gemma3:12b` (default) | ~89 GB |
| Whisper | `large-v3` (ROCm) | ~3 GB |
| Fallback Whisper | `medium` | ~1.5 GB |
Both configurable via `config.toml` and the settings UI.
## Output Format
Markdown file with YAML frontmatter:
```markdown
---
date: 2026-04-01T14:32:00
tags: [transkript]
---
# <LLM-generated title>
<LLM-refined transcript content>
```
File naming: `YYYY-MM-DD-HHmm-<slugified-title>.md`
Output path: `~/cloud.shron.de/Hetzner Storagebox/work/`
## Dependencies
- `faster-whisper` — Whisper inference
- `sounddevice` — audio capture
- `httpx` — Ollama API client
- `fastapi` + `uvicorn` — local HTTP/WebSocket server
- `pystray` + `Pillow` — system tray icon
- `tomli` / `tomllib` — TOML config
- Ollama (system, with ROCm)
- ROCm (system, via pacman: `rocm-hip-sdk`)
## Future Extensions
- Thunderbird integration via `POST /compose`
- Zammad ticket creation via `POST /ticket`
- Template system (e.g. "offer", "reminder", "meeting notes")