Files
Void-Homelab/docs/superpowers/specs/2026-06-10-dross-voice-phase2-design.md
root fc1e93a58f docs(dross): Phase 2 (voice) design spec
Local faster-whisper on CT 102, record→transcribe→review-send, and a
durable owner-only clip-retention store (transcript in void-db, audio on
a backed-up ZFS dataset — not void-app's ephemeral tier). Encryption-at-
rest noted as future.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 00:45:59 +10:00

5.1 KiB

Floating Dross Chat — Phase 2 (Voice) Design

Date: 2026-06-10 Status: Draft (awaiting sign-off) Builds on: 2026-06-09-floating-dross-chat-design.md (Phase 1 shipped in v2.11.0) Goal: Let the user record a voice clip in the Dross bubble, transcribe it locally, and drop the transcript into the input to review-and-send. Optionally retain each clip paired with its transcript, stored durably and owner-only.


Locked decisions (from the Phase-1 brainstorm + 2026-06-10 follow-up)

  1. STT = local faster-whisper on CT 102 (the Ollama box, RTX A2000). GPU with CPU fallback (per the GPU/CPU-fallback HA rule). English model (small.en) for speed/accuracy. OpenAI-compatible HTTP API.
  2. Flow = review-and-send first (voiceMode: 'review'). Record → transcribe → transcript lands in the bubble input → user edits/sends. handsfree (auto-send) and action (interpret) are later (the setting already exists; only review is wired now).
  3. Retention = "Keep voice clips" Dross setting, default OFF. When ON, each clip is saved paired with its transcript. Storage: transcript + metadata in void-db (voice_clips table — in the Core-4 offsite backup + HA-replicated); audio files on a dedicated owner-only ZFS dataset (localzfs/void-voiceclips, bind-mounted into void-app at /var/lib/void/voice-clips, 0700), added to the offsite backup + syncoid replication. NOT on void-app's ephemeral rootfs (it's the rebuildable tier, excluded from backups). Encryption-at-rest is a documented future toggle (ZFS native encryption, key in Vaultwarden).

Non-goals (this phase)

  • handsfree / action voice modes (designed-for; only review wired).
  • Encryption-at-rest of clips (future).
  • Wake-word / always-listening.

Architecture

Components

Unit Responsibility
faster-whisper service (CT 102, infra) OpenAI-compatible /v1/audio/transcriptions (e.g. faster-whisper-server/speaches), small.en, GPU+CPU fallback, systemd unit, bound to 192.168.1.185:<port> (LAN-only).
lib/voice/whisper.js (void-app) Thin client: POST the audio buffer to the CT-102 service, return { text }. Timeout + error surface.
lib/api/routes/voice.js (void-app) POST /api/voice/transcribe (owner-only, multipart, ≤25 MB / ≤60 s): transcribe; if dross.keepClips is on, persist (Task: retention). Returns { text, clip_id? }.
lib/db/repos/voice_clips.js + migration voice_clips table (id, transcript, duration_ms, bytes, mime, path, created_at).
public/components/dross_bubble.js (edit) Enable the mic: MediaRecorder capture (tap start/stop), recording UI (timer/waveform), upload, transcript → input (review-and-send).
Settings → Dross (edit) Add the "Keep voice clips" toggle; a small clips list (play / delete) when retention is on.
Infra New ZFS dataset + bind-mount; add the dataset to the offsite-backup script + syncoid job.

Data flow

Transcribe: mic tap → MediaRecorder (audio/webm;codecs=opus) → on stop, blob → POST /api/voice/transcribe (multipart) → whisper.js → CT-102 faster-whisper → {text} → bubble drops text into the input (review-and-send). Errors never block typing.

Retention (when dross.keepClips): the transcribe route, after a successful transcript, writes the audio to /var/lib/void/voice-clips/<uuid>.webm (0600) and inserts a voice_clips row (transcript + metadata + path). GET /api/voice/clips lists; GET /api/voice/clips/:id/audio streams; DELETE removes row + file. Owner-only throughout.

Error handling

  • Whisper down / GPU absent: /transcribe returns a clear 503; bubble shows "couldn't transcribe — type instead", keeps the typed text. faster-whisper falls back to CPU on a GPU-less node (slower).
  • Mic permission denied / unsupported: hide recording UI, one-line hint, typing still works.
  • Clip too large/long: reject at the route (413) with a friendly message.
  • CT 102 disk pressure (currently 89% full / 6.4 GB free): install lean (CTranslate2, no torch); may expand the CT disk first. Flagged as a build risk.

Testing

  • Unit: voice.js route with a mocked whisper client (returns {text}); retention path writes a row + file (temp dir) and lists/deletes; size/duration guard returns 413.
  • Live smoke: record a short WAV via the CT-300 test harness → /api/voice/transcribe → non-empty text from the real CT-102 service.
  • Headless: mic button enabled; recording UI toggles; (MediaRecorder needs a fake audio device in Chromium — use --use-fake-device-for-media-stream).

Build phases

  • P2a — Transcription path. faster-whisper on CT 102 + whisper.js + /api/voice/transcribe (no retention) + enable the mic + record→review-send. Ship-able.
  • P2b — Retention. ZFS dataset + bind-mount + backup/replication wiring; voice_clips table + repo; save on transcribe when keepClips; clips list/play/delete UI; the "Keep voice clips" toggle.

Documentation

Wiki + Gitea per the standing rule; update project_cradle_chat_floating memory. Encryption-at-rest recorded as a future toggle.