From 4c04e17d0678e74259efacc77d2b02bdce4ca03b Mon Sep 17 00:00:00 2001 From: "thomas.kopp" Date: Wed, 1 Apr 2026 01:58:15 +0200 Subject: [PATCH] =?UTF-8?q?docs:=20initial=20design=20for=20t=C3=BCit=20Tr?= =?UTF-8?q?anskriptor=20desktop=20transcription=20tool?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...2026-04-01-desktop-transcription-design.md | 116 ++++++++++++++++++ 1 file changed, 116 insertions(+) create mode 100644 docs/plans/2026-04-01-desktop-transcription-design.md diff --git a/docs/plans/2026-04-01-desktop-transcription-design.md b/docs/plans/2026-04-01-desktop-transcription-design.md new file mode 100644 index 0000000..a60ba20 --- /dev/null +++ b/docs/plans/2026-04-01-desktop-transcription-design.md @@ -0,0 +1,116 @@ +# Design: tüit Transkriptor + +**Date:** 2026-04-01 +**Status:** Approved +**Platform:** Arch Linux, KDE Plasma (Wayland), AMD RX 6800 XT (RDNA2 / ROCm) + +## Goal + +A local AI transcription tool that runs as a system tray application, monitors audio input, and produces LLM-refined Markdown transcripts saved directly into the Nextcloud-synced notes folder. Designed as a personal secretary — the user can provide instructions alongside the recording to guide the LLM output. + +## Architecture + +``` +tueit_Transkriptor/ +├── main.py # Entry point: starts FastAPI + pystray +├── api/ +│ ├── router.py # REST endpoints + WebSocket +│ └── state.py # Global app state (recording, transcript, ...) +├── audio.py # sounddevice → PCM buffer +├── transcription.py # faster-whisper wrapper (ROCm-capable) +├── llm.py # Ollama httpx client +├── output.py # Render Markdown + write to Nextcloud folder +├── config.py # TOML config (~/.config/tueit-transcriber/config.toml) +├── frontend/ +│ ├── index.html # Single-page UI (tüit CI: dark mode, #DA251C, #FFD802, Overpass) +│ └── app.js # WebSocket client for live status +├── install.sh # Check deps (ROCm, Ollama, Python packages), set up systemd user service +└── requirements.txt +``` + +## Data Flow + +``` +SIGUSR1 / Tray click / API POST /toggle + → sounddevice captures PCM (16kHz mono) + → on stop: WAV → faster-whisper → raw text + → raw text + user instructions (from UI) → Ollama (gemma3:12b via ROCm) + → Markdown with tüit CI (frontmatter, headings, highlights) + → file: ~/cloud.shron.de/Hetzner Storagebox/work/YYYY-MM-DD-HHmm-.md +``` + +## API Endpoints + +| Method | Path | Purpose | +|--------|------|---------| +| `POST` | `/toggle` | Start/stop recording (also triggered via SIGUSR1) | +| `GET` | `/status` | Current state: recording / processing / idle | +| `GET` | `/transcripts` | List of recent transcripts | +| `WS` | `/ws` | Live updates to frontend | +| `GET` | `/config` | Current configuration | +| `PUT` | `/config` | Update config (model, output path, ...) | + +Future Thunderbird integration: `POST /compose` — generates a draft from the transcript. The API foundation is already in place. + +## UI + +Permanent browser window opened at startup (`http://localhost:8765`). Dark mode, tüit CI colors and Overpass font. + +- **Top:** Large record toggle button (red when active, grey when idle) + status display +- **Middle:** Instruction text field — persistent, included as LLM context on every processing run. Examples: "highlight the key points", "create a ticket for this", "draft an offer" +- **Bottom:** Live transcript preview during processing; list of recent transcripts (clickable → opens file) + +## Trigger Mechanism + +- **Tray icon click** — toggles recording +- **SIGUSR1** — toggles recording (Wayland-compatible hotkey workaround) + - PID written to `~/.local/run/tueit-transcriber.pid` + - KDE custom shortcut: `pkill -USR1 -f main.py` + - Works on any DE, Wayland-independent + +## Ollama Model + +The RX 6800 XT has 16 GB GDDR6 VRAM. ROCm supports RDNA2 since ROCm 5.x. + +| Component | Model | VRAM | +|-----------|-------|------| +| LLM | `gemma3:12b` (default) | ~8–9 GB | +| Whisper | `large-v3` (ROCm) | ~3 GB | +| Fallback Whisper | `medium` | ~1.5 GB | + +Both configurable via `config.toml` and the settings UI. + +## Output Format + +Markdown file with YAML frontmatter: + +```markdown +--- +date: 2026-04-01T14:32:00 +tags: [transkript] +--- + +# <LLM-generated title> + +<LLM-refined transcript content> +``` + +File naming: `YYYY-MM-DD-HHmm-<slugified-title>.md` +Output path: `~/cloud.shron.de/Hetzner Storagebox/work/` + +## Dependencies + +- `faster-whisper` — Whisper inference +- `sounddevice` — audio capture +- `httpx` — Ollama API client +- `fastapi` + `uvicorn` — local HTTP/WebSocket server +- `pystray` + `Pillow` — system tray icon +- `tomli` / `tomllib` — TOML config +- Ollama (system, with ROCm) +- ROCm (system, via pacman: `rocm-hip-sdk`) + +## Future Extensions + +- Thunderbird integration via `POST /compose` +- Zammad ticket creation via `POST /ticket` +- Template system (e.g. "offer", "reminder", "meeting notes")