5.3 KiB
Floating Dross Chat — Phase 2 (Voice) Design
Date: 2026-06-10
Status: SHIPPED — P2a v2.12.0 (transcribe+mic), P2b v2.13.0 (retention), 2026-06-10. Known gap: clips HA-replicated Z↔Z3 but not yet in the offsite Farm backup. Future: whisper-model selector, configurable storage, encryption-at-rest, LAN-IP mic (https-on-LAN).
Builds on: 2026-06-09-floating-dross-chat-design.md (Phase 1 shipped in v2.11.0)
Goal: Let the user record a voice clip in the Dross bubble, transcribe it locally, and drop the transcript into the input to review-and-send. Optionally retain each clip paired with its transcript, stored durably and owner-only.
Locked decisions (from the Phase-1 brainstorm + 2026-06-10 follow-up)
- STT = local faster-whisper on CT 102 (the Ollama box, RTX A2000). GPU with CPU fallback (per the GPU/CPU-fallback HA rule). English model (
small.en) for speed/accuracy. OpenAI-compatible HTTP API. - Flow = review-and-send first (
voiceMode: 'review'). Record → transcribe → transcript lands in the bubble input → user edits/sends.handsfree(auto-send) andaction(interpret) are later (the setting already exists; onlyreviewis wired now). - Retention = "Keep voice clips" Dross setting, default OFF. When ON, each clip is saved paired with its transcript. Storage: transcript + metadata in void-db (
voice_clipstable — in the Core-4 offsite backup + HA-replicated); audio files on a dedicated owner-only ZFS dataset (localzfs/void-voiceclips, bind-mounted into void-app at/var/lib/void/voice-clips, 0700), added to the offsite backup + syncoid replication. NOT on void-app's ephemeral rootfs (it's the rebuildable tier, excluded from backups). Encryption-at-rest is a documented future toggle (ZFS native encryption, key in Vaultwarden).
Non-goals (this phase)
handsfree/actionvoice modes (designed-for; onlyreviewwired).- Encryption-at-rest of clips (future).
- Wake-word / always-listening.
Architecture
Components
| Unit | Responsibility |
|---|---|
| faster-whisper service (CT 102, infra) | OpenAI-compatible /v1/audio/transcriptions (e.g. faster-whisper-server/speaches), small.en, GPU+CPU fallback, systemd unit, bound to 192.168.1.185:<port> (LAN-only). |
lib/voice/whisper.js (void-app) |
Thin client: POST the audio buffer to the CT-102 service, return { text }. Timeout + error surface. |
lib/api/routes/voice.js (void-app) |
POST /api/voice/transcribe (owner-only, multipart, ≤25 MB / ≤60 s): transcribe; if dross.keepClips is on, persist (Task: retention). Returns { text, clip_id? }. |
lib/db/repos/voice_clips.js + migration |
voice_clips table (id, transcript, duration_ms, bytes, mime, path, created_at). |
public/components/dross_bubble.js (edit) |
Enable the mic: MediaRecorder capture (tap start/stop), recording UI (timer/waveform), upload, transcript → input (review-and-send). |
| Settings → Dross (edit) | Add the "Keep voice clips" toggle; a small clips list (play / delete) when retention is on. |
| Infra | New ZFS dataset + bind-mount; add the dataset to the offsite-backup script + syncoid job. |
Data flow
Transcribe: mic tap → MediaRecorder (audio/webm;codecs=opus) → on stop, blob → POST /api/voice/transcribe (multipart) → whisper.js → CT-102 faster-whisper → {text} → bubble drops text into the input (review-and-send). Errors never block typing.
Retention (when dross.keepClips): the transcribe route, after a successful transcript, writes the audio to /var/lib/void/voice-clips/<uuid>.webm (0600) and inserts a voice_clips row (transcript + metadata + path). GET /api/voice/clips lists; GET /api/voice/clips/:id/audio streams; DELETE removes row + file. Owner-only throughout.
Error handling
- Whisper down / GPU absent:
/transcribereturns a clear 503; bubble shows "couldn't transcribe — type instead", keeps the typed text. faster-whisper falls back to CPU on a GPU-less node (slower). - Mic permission denied / unsupported: hide recording UI, one-line hint, typing still works.
- Clip too large/long: reject at the route (413) with a friendly message.
- CT 102 disk pressure (currently 89% full / 6.4 GB free): install lean (CTranslate2, no torch); may expand the CT disk first. Flagged as a build risk.
Testing
- Unit:
voice.jsroute with a mocked whisper client (returns{text}); retention path writes a row + file (temp dir) and lists/deletes; size/duration guard returns 413. - Live smoke: record a short WAV via the CT-300 test harness →
/api/voice/transcribe→ non-empty text from the real CT-102 service. - Headless: mic button enabled; recording UI toggles; (MediaRecorder needs a fake audio device in Chromium — use
--use-fake-device-for-media-stream).
Build phases
- P2a — Transcription path. faster-whisper on CT 102 +
whisper.js+/api/voice/transcribe(no retention) + enable the mic + record→review-send. Ship-able. - P2b — Retention. ZFS dataset + bind-mount + backup/replication wiring;
voice_clipstable + repo; save on transcribe whenkeepClips; clips list/play/delete UI; the "Keep voice clips" toggle.
Documentation
Wiki + Gitea per the standing rule; update project_cradle_chat_floating memory. Encryption-at-rest recorded as a future toggle.