docs: initial design for tüit Transkriptor desktop transcription tool
This commit is contained in:
@@ -0,0 +1,116 @@
|
|||||||
|
# Design: tüit Transkriptor
|
||||||
|
|
||||||
|
**Date:** 2026-04-01
|
||||||
|
**Status:** Approved
|
||||||
|
**Platform:** Arch Linux, KDE Plasma (Wayland), AMD RX 6800 XT (RDNA2 / ROCm)
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
A local AI transcription tool that runs as a system tray application, monitors audio input, and produces LLM-refined Markdown transcripts saved directly into the Nextcloud-synced notes folder. Designed as a personal secretary — the user can provide instructions alongside the recording to guide the LLM output.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
tueit_Transkriptor/
|
||||||
|
├── main.py # Entry point: starts FastAPI + pystray
|
||||||
|
├── api/
|
||||||
|
│ ├── router.py # REST endpoints + WebSocket
|
||||||
|
│ └── state.py # Global app state (recording, transcript, ...)
|
||||||
|
├── audio.py # sounddevice → PCM buffer
|
||||||
|
├── transcription.py # faster-whisper wrapper (ROCm-capable)
|
||||||
|
├── llm.py # Ollama httpx client
|
||||||
|
├── output.py # Render Markdown + write to Nextcloud folder
|
||||||
|
├── config.py # TOML config (~/.config/tueit-transcriber/config.toml)
|
||||||
|
├── frontend/
|
||||||
|
│ ├── index.html # Single-page UI (tüit CI: dark mode, #DA251C, #FFD802, Overpass)
|
||||||
|
│ └── app.js # WebSocket client for live status
|
||||||
|
├── install.sh # Check deps (ROCm, Ollama, Python packages), set up systemd user service
|
||||||
|
└── requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
SIGUSR1 / Tray click / API POST /toggle
|
||||||
|
→ sounddevice captures PCM (16kHz mono)
|
||||||
|
→ on stop: WAV → faster-whisper → raw text
|
||||||
|
→ raw text + user instructions (from UI) → Ollama (gemma3:12b via ROCm)
|
||||||
|
→ Markdown with tüit CI (frontmatter, headings, highlights)
|
||||||
|
→ file: ~/cloud.shron.de/Hetzner Storagebox/work/YYYY-MM-DD-HHmm-<title>.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## API Endpoints
|
||||||
|
|
||||||
|
| Method | Path | Purpose |
|
||||||
|
|--------|------|---------|
|
||||||
|
| `POST` | `/toggle` | Start/stop recording (also triggered via SIGUSR1) |
|
||||||
|
| `GET` | `/status` | Current state: recording / processing / idle |
|
||||||
|
| `GET` | `/transcripts` | List of recent transcripts |
|
||||||
|
| `WS` | `/ws` | Live updates to frontend |
|
||||||
|
| `GET` | `/config` | Current configuration |
|
||||||
|
| `PUT` | `/config` | Update config (model, output path, ...) |
|
||||||
|
|
||||||
|
Future Thunderbird integration: `POST /compose` — generates a draft from the transcript. The API foundation is already in place.
|
||||||
|
|
||||||
|
## UI
|
||||||
|
|
||||||
|
Permanent browser window opened at startup (`http://localhost:8765`). Dark mode, tüit CI colors and Overpass font.
|
||||||
|
|
||||||
|
- **Top:** Large record toggle button (red when active, grey when idle) + status display
|
||||||
|
- **Middle:** Instruction text field — persistent, included as LLM context on every processing run. Examples: "highlight the key points", "create a ticket for this", "draft an offer"
|
||||||
|
- **Bottom:** Live transcript preview during processing; list of recent transcripts (clickable → opens file)
|
||||||
|
|
||||||
|
## Trigger Mechanism
|
||||||
|
|
||||||
|
- **Tray icon click** — toggles recording
|
||||||
|
- **SIGUSR1** — toggles recording (Wayland-compatible hotkey workaround)
|
||||||
|
- PID written to `~/.local/run/tueit-transcriber.pid`
|
||||||
|
- KDE custom shortcut: `pkill -USR1 -f main.py`
|
||||||
|
- Works on any DE, Wayland-independent
|
||||||
|
|
||||||
|
## Ollama Model
|
||||||
|
|
||||||
|
The RX 6800 XT has 16 GB GDDR6 VRAM. ROCm supports RDNA2 since ROCm 5.x.
|
||||||
|
|
||||||
|
| Component | Model | VRAM |
|
||||||
|
|-----------|-------|------|
|
||||||
|
| LLM | `gemma3:12b` (default) | ~8–9 GB |
|
||||||
|
| Whisper | `large-v3` (ROCm) | ~3 GB |
|
||||||
|
| Fallback Whisper | `medium` | ~1.5 GB |
|
||||||
|
|
||||||
|
Both configurable via `config.toml` and the settings UI.
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
Markdown file with YAML frontmatter:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
---
|
||||||
|
date: 2026-04-01T14:32:00
|
||||||
|
tags: [transkript]
|
||||||
|
---
|
||||||
|
|
||||||
|
# <LLM-generated title>
|
||||||
|
|
||||||
|
<LLM-refined transcript content>
|
||||||
|
```
|
||||||
|
|
||||||
|
File naming: `YYYY-MM-DD-HHmm-<slugified-title>.md`
|
||||||
|
Output path: `~/cloud.shron.de/Hetzner Storagebox/work/`
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- `faster-whisper` — Whisper inference
|
||||||
|
- `sounddevice` — audio capture
|
||||||
|
- `httpx` — Ollama API client
|
||||||
|
- `fastapi` + `uvicorn` — local HTTP/WebSocket server
|
||||||
|
- `pystray` + `Pillow` — system tray icon
|
||||||
|
- `tomli` / `tomllib` — TOML config
|
||||||
|
- Ollama (system, with ROCm)
|
||||||
|
- ROCm (system, via pacman: `rocm-hip-sdk`)
|
||||||
|
|
||||||
|
## Future Extensions
|
||||||
|
|
||||||
|
- Thunderbird integration via `POST /compose`
|
||||||
|
- Zammad ticket creation via `POST /ticket`
|
||||||
|
- Template system (e.g. "offer", "reminder", "meeting notes")
|
||||||
Reference in New Issue
Block a user