docs(dross): Phase 2 (voice) design spec

Local faster-whisper on CT 102, record→transcribe→review-send, and a
durable owner-only clip-retention store (transcript in void-db, audio on
a backed-up ZFS dataset — not void-app's ephemeral tier). Encryption-at-
rest noted as future.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
root
2026-06-10 00:45:59 +10:00
parent 2dc9d612de
commit fc1e93a58f

View File

@@ -0,0 +1,64 @@
# Floating Dross Chat — Phase 2 (Voice) Design
**Date:** 2026-06-10
**Status:** Draft (awaiting sign-off)
**Builds on:** `2026-06-09-floating-dross-chat-design.md` (Phase 1 shipped in v2.11.0)
**Goal:** Let the user record a voice clip in the Dross bubble, transcribe it locally, and drop the transcript into the input to review-and-send. Optionally retain each clip paired with its transcript, stored durably and owner-only.
---
## Locked decisions (from the Phase-1 brainstorm + 2026-06-10 follow-up)
1. **STT = local faster-whisper on CT 102** (the Ollama box, RTX A2000). GPU with CPU fallback (per the GPU/CPU-fallback HA rule). English model (`small.en`) for speed/accuracy. OpenAI-compatible HTTP API.
2. **Flow = review-and-send first** (`voiceMode: 'review'`). Record → transcribe → transcript lands in the bubble input → user edits/sends. `handsfree` (auto-send) and `action` (interpret) are later (the setting already exists; only `review` is wired now).
3. **Retention = "Keep voice clips" Dross setting**, default OFF. When ON, each clip is saved paired with its transcript. **Storage:** transcript + metadata in **void-db** (`voice_clips` table — in the Core-4 offsite backup + HA-replicated); audio files on a **dedicated owner-only ZFS dataset** (`localzfs/void-voiceclips`, bind-mounted into void-app at `/var/lib/void/voice-clips`, 0700), **added to the offsite backup + syncoid replication**. NOT on void-app's ephemeral rootfs (it's the rebuildable tier, excluded from backups). Encryption-at-rest is a documented **future** toggle (ZFS native encryption, key in Vaultwarden).
## Non-goals (this phase)
- `handsfree` / `action` voice modes (designed-for; only `review` wired).
- Encryption-at-rest of clips (future).
- Wake-word / always-listening.
---
## Architecture
### Components
| Unit | Responsibility |
|---|---|
| **faster-whisper service** (CT 102, infra) | OpenAI-compatible `/v1/audio/transcriptions` (e.g. `faster-whisper-server`/`speaches`), `small.en`, GPU+CPU fallback, systemd unit, bound to `192.168.1.185:<port>` (LAN-only). |
| `lib/voice/whisper.js` (void-app) | Thin client: POST the audio buffer to the CT-102 service, return `{ text }`. Timeout + error surface. |
| `lib/api/routes/voice.js` (void-app) | `POST /api/voice/transcribe` (owner-only, multipart, ≤25 MB / ≤60 s): transcribe; if `dross.keepClips` is on, persist (Task: retention). Returns `{ text, clip_id? }`. |
| `lib/db/repos/voice_clips.js` + migration | `voice_clips` table (id, transcript, duration_ms, bytes, mime, path, created_at). |
| `public/components/dross_bubble.js` (edit) | Enable the mic: `MediaRecorder` capture (tap start/stop), recording UI (timer/waveform), upload, transcript → input (review-and-send). |
| Settings → Dross (edit) | Add the **"Keep voice clips"** toggle; a small **clips list** (play / delete) when retention is on. |
| Infra | New ZFS dataset + bind-mount; add the dataset to the offsite-backup script + syncoid job. |
### Data flow
**Transcribe:** mic tap → `MediaRecorder` (`audio/webm;codecs=opus`) → on stop, blob → `POST /api/voice/transcribe` (multipart) → `whisper.js` → CT-102 faster-whisper → `{text}` → bubble drops text into the input (review-and-send). Errors never block typing.
**Retention (when `dross.keepClips`):** the transcribe route, after a successful transcript, writes the audio to `/var/lib/void/voice-clips/<uuid>.webm` (0600) and inserts a `voice_clips` row (transcript + metadata + path). `GET /api/voice/clips` lists; `GET /api/voice/clips/:id/audio` streams; `DELETE` removes row + file. Owner-only throughout.
## Error handling
- **Whisper down / GPU absent:** `/transcribe` returns a clear 503; bubble shows "couldn't transcribe — type instead", keeps the typed text. faster-whisper falls back to CPU on a GPU-less node (slower).
- **Mic permission denied / unsupported:** hide recording UI, one-line hint, typing still works.
- **Clip too large/long:** reject at the route (413) with a friendly message.
- **CT 102 disk pressure** (currently 89% full / 6.4 GB free): install lean (CTranslate2, no torch); **may expand the CT disk first**. Flagged as a build risk.
## Testing
- **Unit:** `voice.js` route with a mocked whisper client (returns `{text}`); retention path writes a row + file (temp dir) and lists/deletes; size/duration guard returns 413.
- **Live smoke:** record a short WAV via the CT-300 test harness → `/api/voice/transcribe` → non-empty text from the real CT-102 service.
- **Headless:** mic button enabled; recording UI toggles; (MediaRecorder needs a fake audio device in Chromium — use `--use-fake-device-for-media-stream`).
## Build phases
- **P2a — Transcription path.** faster-whisper on CT 102 + `whisper.js` + `/api/voice/transcribe` (no retention) + enable the mic + record→review-send. Ship-able.
- **P2b — Retention.** ZFS dataset + bind-mount + backup/replication wiring; `voice_clips` table + repo; save on transcribe when `keepClips`; clips list/play/delete UI; the "Keep voice clips" toggle.
## Documentation
Wiki + Gitea per the standing rule; update `project_cradle_chat_floating` memory. Encryption-at-rest recorded as a future toggle.