From fc1e93a58f9471694060b7aebeb887b6c142b878 Mon Sep 17 00:00:00 2001 From: root Date: Wed, 10 Jun 2026 00:45:59 +1000 Subject: [PATCH] docs(dross): Phase 2 (voice) design spec MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Local faster-whisper on CT 102, record→transcribe→review-send, and a durable owner-only clip-retention store (transcript in void-db, audio on a backed-up ZFS dataset — not void-app's ephemeral tier). Encryption-at- rest noted as future. Co-Authored-By: Claude Opus 4.8 --- .../2026-06-10-dross-voice-phase2-design.md | 64 +++++++++++++++++++ 1 file changed, 64 insertions(+) create mode 100644 docs/superpowers/specs/2026-06-10-dross-voice-phase2-design.md diff --git a/docs/superpowers/specs/2026-06-10-dross-voice-phase2-design.md b/docs/superpowers/specs/2026-06-10-dross-voice-phase2-design.md new file mode 100644 index 0000000..6036d1e --- /dev/null +++ b/docs/superpowers/specs/2026-06-10-dross-voice-phase2-design.md @@ -0,0 +1,64 @@ +# Floating Dross Chat — Phase 2 (Voice) Design + +**Date:** 2026-06-10 +**Status:** Draft (awaiting sign-off) +**Builds on:** `2026-06-09-floating-dross-chat-design.md` (Phase 1 shipped in v2.11.0) +**Goal:** Let the user record a voice clip in the Dross bubble, transcribe it locally, and drop the transcript into the input to review-and-send. Optionally retain each clip paired with its transcript, stored durably and owner-only. + +--- + +## Locked decisions (from the Phase-1 brainstorm + 2026-06-10 follow-up) + +1. **STT = local faster-whisper on CT 102** (the Ollama box, RTX A2000). GPU with CPU fallback (per the GPU/CPU-fallback HA rule). English model (`small.en`) for speed/accuracy. OpenAI-compatible HTTP API. +2. **Flow = review-and-send first** (`voiceMode: 'review'`). Record → transcribe → transcript lands in the bubble input → user edits/sends. `handsfree` (auto-send) and `action` (interpret) are later (the setting already exists; only `review` is wired now). +3. **Retention = "Keep voice clips" Dross setting**, default OFF. When ON, each clip is saved paired with its transcript. **Storage:** transcript + metadata in **void-db** (`voice_clips` table — in the Core-4 offsite backup + HA-replicated); audio files on a **dedicated owner-only ZFS dataset** (`localzfs/void-voiceclips`, bind-mounted into void-app at `/var/lib/void/voice-clips`, 0700), **added to the offsite backup + syncoid replication**. NOT on void-app's ephemeral rootfs (it's the rebuildable tier, excluded from backups). Encryption-at-rest is a documented **future** toggle (ZFS native encryption, key in Vaultwarden). + +## Non-goals (this phase) + +- `handsfree` / `action` voice modes (designed-for; only `review` wired). +- Encryption-at-rest of clips (future). +- Wake-word / always-listening. + +--- + +## Architecture + +### Components + +| Unit | Responsibility | +|---|---| +| **faster-whisper service** (CT 102, infra) | OpenAI-compatible `/v1/audio/transcriptions` (e.g. `faster-whisper-server`/`speaches`), `small.en`, GPU+CPU fallback, systemd unit, bound to `192.168.1.185:` (LAN-only). | +| `lib/voice/whisper.js` (void-app) | Thin client: POST the audio buffer to the CT-102 service, return `{ text }`. Timeout + error surface. | +| `lib/api/routes/voice.js` (void-app) | `POST /api/voice/transcribe` (owner-only, multipart, ≤25 MB / ≤60 s): transcribe; if `dross.keepClips` is on, persist (Task: retention). Returns `{ text, clip_id? }`. | +| `lib/db/repos/voice_clips.js` + migration | `voice_clips` table (id, transcript, duration_ms, bytes, mime, path, created_at). | +| `public/components/dross_bubble.js` (edit) | Enable the mic: `MediaRecorder` capture (tap start/stop), recording UI (timer/waveform), upload, transcript → input (review-and-send). | +| Settings → Dross (edit) | Add the **"Keep voice clips"** toggle; a small **clips list** (play / delete) when retention is on. | +| Infra | New ZFS dataset + bind-mount; add the dataset to the offsite-backup script + syncoid job. | + +### Data flow + +**Transcribe:** mic tap → `MediaRecorder` (`audio/webm;codecs=opus`) → on stop, blob → `POST /api/voice/transcribe` (multipart) → `whisper.js` → CT-102 faster-whisper → `{text}` → bubble drops text into the input (review-and-send). Errors never block typing. + +**Retention (when `dross.keepClips`):** the transcribe route, after a successful transcript, writes the audio to `/var/lib/void/voice-clips/.webm` (0600) and inserts a `voice_clips` row (transcript + metadata + path). `GET /api/voice/clips` lists; `GET /api/voice/clips/:id/audio` streams; `DELETE` removes row + file. Owner-only throughout. + +## Error handling + +- **Whisper down / GPU absent:** `/transcribe` returns a clear 503; bubble shows "couldn't transcribe — type instead", keeps the typed text. faster-whisper falls back to CPU on a GPU-less node (slower). +- **Mic permission denied / unsupported:** hide recording UI, one-line hint, typing still works. +- **Clip too large/long:** reject at the route (413) with a friendly message. +- **CT 102 disk pressure** (currently 89% full / 6.4 GB free): install lean (CTranslate2, no torch); **may expand the CT disk first**. Flagged as a build risk. + +## Testing + +- **Unit:** `voice.js` route with a mocked whisper client (returns `{text}`); retention path writes a row + file (temp dir) and lists/deletes; size/duration guard returns 413. +- **Live smoke:** record a short WAV via the CT-300 test harness → `/api/voice/transcribe` → non-empty text from the real CT-102 service. +- **Headless:** mic button enabled; recording UI toggles; (MediaRecorder needs a fake audio device in Chromium — use `--use-fake-device-for-media-stream`). + +## Build phases + +- **P2a — Transcription path.** faster-whisper on CT 102 + `whisper.js` + `/api/voice/transcribe` (no retention) + enable the mic + record→review-send. Ship-able. +- **P2b — Retention.** ZFS dataset + bind-mount + backup/replication wiring; `voice_clips` table + repo; save on transcribe when `keepClips`; clips list/play/delete UI; the "Keep voice clips" toggle. + +## Documentation + +Wiki + Gitea per the standing rule; update `project_cradle_chat_floating` memory. Encryption-at-rest recorded as a future toggle.