Files

root fc1e93a58f docs(dross): Phase 2 (voice) design spec

Local faster-whisper on CT 102, record→transcribe→review-send, and a
durable owner-only clip-retention store (transcript in void-db, audio on
a backed-up ZFS dataset — not void-app's ephemeral tier). Encryption-at-
rest noted as future.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-10 00:45:59 +10:00

5.1 KiB

Raw Blame History

Floating Dross Chat — Phase 2 (Voice) Design

Date: 2026-06-10 Status: Draft (awaiting sign-off) Builds on: 2026-06-09-floating-dross-chat-design.md (Phase 1 shipped in v2.11.0) Goal: Let the user record a voice clip in the Dross bubble, transcribe it locally, and drop the transcript into the input to review-and-send. Optionally retain each clip paired with its transcript, stored durably and owner-only.

Locked decisions (from the Phase-1 brainstorm + 2026-06-10 follow-up)

STT = local faster-whisper on CT 102 (the Ollama box, RTX A2000). GPU with CPU fallback (per the GPU/CPU-fallback HA rule). English model (small.en) for speed/accuracy. OpenAI-compatible HTTP API.
Flow = review-and-send first (voiceMode: 'review'). Record → transcribe → transcript lands in the bubble input → user edits/sends. handsfree (auto-send) and action (interpret) are later (the setting already exists; only review is wired now).
Retention = "Keep voice clips" Dross setting, default OFF. When ON, each clip is saved paired with its transcript. Storage: transcript + metadata in void-db (voice_clips table — in the Core-4 offsite backup + HA-replicated); audio files on a dedicated owner-only ZFS dataset (localzfs/void-voiceclips, bind-mounted into void-app at /var/lib/void/voice-clips, 0700), added to the offsite backup + syncoid replication. NOT on void-app's ephemeral rootfs (it's the rebuildable tier, excluded from backups). Encryption-at-rest is a documented future toggle (ZFS native encryption, key in Vaultwarden).

Non-goals (this phase)

handsfree / action voice modes (designed-for; only review wired).
Encryption-at-rest of clips (future).
Wake-word / always-listening.

Architecture

Components

Unit	Responsibility
faster-whisper service (CT 102, infra)	OpenAI-compatible `/v1/audio/transcriptions` (e.g. `faster-whisper-server`/`speaches`), `small.en`, GPU+CPU fallback, systemd unit, bound to `192.168.1.185:<port>` (LAN-only).
`lib/voice/whisper.js` (void-app)	Thin client: POST the audio buffer to the CT-102 service, return `{ text }`. Timeout + error surface.
`lib/api/routes/voice.js` (void-app)	`POST /api/voice/transcribe` (owner-only, multipart, ≤25 MB / ≤60 s): transcribe; if `dross.keepClips` is on, persist (Task: retention). Returns `{ text, clip_id? }`.
`lib/db/repos/voice_clips.js` + migration	`voice_clips` table (id, transcript, duration_ms, bytes, mime, path, created_at).
`public/components/dross_bubble.js` (edit)	Enable the mic: `MediaRecorder` capture (tap start/stop), recording UI (timer/waveform), upload, transcript → input (review-and-send).
Settings → Dross (edit)	Add the "Keep voice clips" toggle; a small clips list (play / delete) when retention is on.
Infra	New ZFS dataset + bind-mount; add the dataset to the offsite-backup script + syncoid job.

Data flow

Transcribe: mic tap → MediaRecorder (audio/webm;codecs=opus) → on stop, blob → POST /api/voice/transcribe (multipart) → whisper.js → CT-102 faster-whisper → {text} → bubble drops text into the input (review-and-send). Errors never block typing.

Retention (when dross.keepClips): the transcribe route, after a successful transcript, writes the audio to /var/lib/void/voice-clips/<uuid>.webm (0600) and inserts a voice_clips row (transcript + metadata + path). GET /api/voice/clips lists; GET /api/voice/clips/:id/audio streams; DELETE removes row + file. Owner-only throughout.

Error handling

Whisper down / GPU absent: /transcribe returns a clear 503; bubble shows "couldn't transcribe — type instead", keeps the typed text. faster-whisper falls back to CPU on a GPU-less node (slower).
Mic permission denied / unsupported: hide recording UI, one-line hint, typing still works.
Clip too large/long: reject at the route (413) with a friendly message.
CT 102 disk pressure (currently 89% full / 6.4 GB free): install lean (CTranslate2, no torch); may expand the CT disk first. Flagged as a build risk.

Testing

Unit: voice.js route with a mocked whisper client (returns {text}); retention path writes a row + file (temp dir) and lists/deletes; size/duration guard returns 413.
Live smoke: record a short WAV via the CT-300 test harness → /api/voice/transcribe → non-empty text from the real CT-102 service.
Headless: mic button enabled; recording UI toggles; (MediaRecorder needs a fake audio device in Chromium — use --use-fake-device-for-media-stream).

Build phases

P2a — Transcription path. faster-whisper on CT 102 + whisper.js + /api/voice/transcribe (no retention) + enable the mic + record→review-send. Ship-able.
P2b — Retention. ZFS dataset + bind-mount + backup/replication wiring; voice_clips table + repo; save on transcribe when keepClips; clips list/play/delete UI; the "Keep voice clips" toggle.

Documentation

Wiki + Gitea per the standing rule; update project_cradle_chat_floating memory. Encryption-at-rest recorded as a future toggle.

5.1 KiB Raw Blame History