docs(dross): Phase 2 (voice) design spec
Local faster-whisper on CT 102, record→transcribe→review-send, and a durable owner-only clip-retention store (transcript in void-db, audio on a backed-up ZFS dataset — not void-app's ephemeral tier). Encryption-at- rest noted as future. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,64 @@
|
||||
# Floating Dross Chat — Phase 2 (Voice) Design
|
||||
|
||||
**Date:** 2026-06-10
|
||||
**Status:** Draft (awaiting sign-off)
|
||||
**Builds on:** `2026-06-09-floating-dross-chat-design.md` (Phase 1 shipped in v2.11.0)
|
||||
**Goal:** Let the user record a voice clip in the Dross bubble, transcribe it locally, and drop the transcript into the input to review-and-send. Optionally retain each clip paired with its transcript, stored durably and owner-only.
|
||||
|
||||
---
|
||||
|
||||
## Locked decisions (from the Phase-1 brainstorm + 2026-06-10 follow-up)
|
||||
|
||||
1. **STT = local faster-whisper on CT 102** (the Ollama box, RTX A2000). GPU with CPU fallback (per the GPU/CPU-fallback HA rule). English model (`small.en`) for speed/accuracy. OpenAI-compatible HTTP API.
|
||||
2. **Flow = review-and-send first** (`voiceMode: 'review'`). Record → transcribe → transcript lands in the bubble input → user edits/sends. `handsfree` (auto-send) and `action` (interpret) are later (the setting already exists; only `review` is wired now).
|
||||
3. **Retention = "Keep voice clips" Dross setting**, default OFF. When ON, each clip is saved paired with its transcript. **Storage:** transcript + metadata in **void-db** (`voice_clips` table — in the Core-4 offsite backup + HA-replicated); audio files on a **dedicated owner-only ZFS dataset** (`localzfs/void-voiceclips`, bind-mounted into void-app at `/var/lib/void/voice-clips`, 0700), **added to the offsite backup + syncoid replication**. NOT on void-app's ephemeral rootfs (it's the rebuildable tier, excluded from backups). Encryption-at-rest is a documented **future** toggle (ZFS native encryption, key in Vaultwarden).
|
||||
|
||||
## Non-goals (this phase)
|
||||
|
||||
- `handsfree` / `action` voice modes (designed-for; only `review` wired).
|
||||
- Encryption-at-rest of clips (future).
|
||||
- Wake-word / always-listening.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Components
|
||||
|
||||
| Unit | Responsibility |
|
||||
|---|---|
|
||||
| **faster-whisper service** (CT 102, infra) | OpenAI-compatible `/v1/audio/transcriptions` (e.g. `faster-whisper-server`/`speaches`), `small.en`, GPU+CPU fallback, systemd unit, bound to `192.168.1.185:<port>` (LAN-only). |
|
||||
| `lib/voice/whisper.js` (void-app) | Thin client: POST the audio buffer to the CT-102 service, return `{ text }`. Timeout + error surface. |
|
||||
| `lib/api/routes/voice.js` (void-app) | `POST /api/voice/transcribe` (owner-only, multipart, ≤25 MB / ≤60 s): transcribe; if `dross.keepClips` is on, persist (Task: retention). Returns `{ text, clip_id? }`. |
|
||||
| `lib/db/repos/voice_clips.js` + migration | `voice_clips` table (id, transcript, duration_ms, bytes, mime, path, created_at). |
|
||||
| `public/components/dross_bubble.js` (edit) | Enable the mic: `MediaRecorder` capture (tap start/stop), recording UI (timer/waveform), upload, transcript → input (review-and-send). |
|
||||
| Settings → Dross (edit) | Add the **"Keep voice clips"** toggle; a small **clips list** (play / delete) when retention is on. |
|
||||
| Infra | New ZFS dataset + bind-mount; add the dataset to the offsite-backup script + syncoid job. |
|
||||
|
||||
### Data flow
|
||||
|
||||
**Transcribe:** mic tap → `MediaRecorder` (`audio/webm;codecs=opus`) → on stop, blob → `POST /api/voice/transcribe` (multipart) → `whisper.js` → CT-102 faster-whisper → `{text}` → bubble drops text into the input (review-and-send). Errors never block typing.
|
||||
|
||||
**Retention (when `dross.keepClips`):** the transcribe route, after a successful transcript, writes the audio to `/var/lib/void/voice-clips/<uuid>.webm` (0600) and inserts a `voice_clips` row (transcript + metadata + path). `GET /api/voice/clips` lists; `GET /api/voice/clips/:id/audio` streams; `DELETE` removes row + file. Owner-only throughout.
|
||||
|
||||
## Error handling
|
||||
|
||||
- **Whisper down / GPU absent:** `/transcribe` returns a clear 503; bubble shows "couldn't transcribe — type instead", keeps the typed text. faster-whisper falls back to CPU on a GPU-less node (slower).
|
||||
- **Mic permission denied / unsupported:** hide recording UI, one-line hint, typing still works.
|
||||
- **Clip too large/long:** reject at the route (413) with a friendly message.
|
||||
- **CT 102 disk pressure** (currently 89% full / 6.4 GB free): install lean (CTranslate2, no torch); **may expand the CT disk first**. Flagged as a build risk.
|
||||
|
||||
## Testing
|
||||
|
||||
- **Unit:** `voice.js` route with a mocked whisper client (returns `{text}`); retention path writes a row + file (temp dir) and lists/deletes; size/duration guard returns 413.
|
||||
- **Live smoke:** record a short WAV via the CT-300 test harness → `/api/voice/transcribe` → non-empty text from the real CT-102 service.
|
||||
- **Headless:** mic button enabled; recording UI toggles; (MediaRecorder needs a fake audio device in Chromium — use `--use-fake-device-for-media-stream`).
|
||||
|
||||
## Build phases
|
||||
|
||||
- **P2a — Transcription path.** faster-whisper on CT 102 + `whisper.js` + `/api/voice/transcribe` (no retention) + enable the mic + record→review-send. Ship-able.
|
||||
- **P2b — Retention.** ZFS dataset + bind-mount + backup/replication wiring; `voice_clips` table + repo; save on transcribe when `keepClips`; clips list/play/delete UI; the "Keep voice clips" toggle.
|
||||
|
||||
## Documentation
|
||||
|
||||
Wiki + Gitea per the standing rule; update `project_cradle_chat_floating` memory. Encryption-at-rest recorded as a future toggle.
|
||||
Reference in New Issue
Block a user