Files
Void-Homelab/docs/superpowers/specs/2026-06-01-void-v2-plan3-capture.md
root 54ba68a11c docs: move void-v2 specs + plans into the repo
All Void 2.0 superpowers specs and implementation plans now live at
docs/superpowers/{specs,plans}/ inside the repo. Previously they were
at /project/docs/superpowers/ which was not under git.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:11:32 +10:00

296 lines
20 KiB
Markdown

# Void 2.0 — Plan 3 Design Spec: Capture pipeline + hybrid search
**Date:** 2026-06-01
**Builds on:** Plan 1 (Foundation, complete) and Plan 2 (API + UI shell, complete, version 2.0.0-alpha.2).
**Master spec:** `docs/superpowers/specs/2026-05-31-void-v2-design.md` — many decisions inherit from there.
## Goal
Wire the Plan 2 SPA's stub Capture button to a real ingest pipeline. Add a pg-boss-backed job queue, capture entry points (URL POST + Karakeep webhook + drag-drop attachment), a URL worker that turns links into `refs`, an embeddings worker that writes vectors into the existing `embedding` columns, and a hybrid FTS+vector search that replaces the Plan 2 FTS-only `/api/search`.
## Out of scope (Plan 4 and later)
- Whisper transcription, Tesseract OCR, yt-dlp video ingestion, scanned-PDF OCR.
- The Python `void-workers` service. Plan 3 stays single-process Node.
- AI Space/Project suggestion on capture (defer; capture takes explicit `space_id`).
- Embedding chunks table — Plan 3 uses one whole-doc embedding per entity row; chunks land later once we can measure recall on a real corpus.
- MCP server surface. Plan 5+.
## Decisions locked by brainstorm
| Question | Answer |
|---|---|
| Plan 3 slice | Node-side: pg-boss + `/api/capture` POST + Karakeep webhook + URL worker + embed.text worker + hybrid search + Jobs panel. Defers ML-heavy ingest to Plan 4. |
| Capture entry points | `/api/capture` POST + Karakeep webhook + drag-drop upload. Inbound email skipped. |
| Embedding granularity | Whole-doc per entity row. Add chunks table later. |
| Search rollout | `/api/search` replaced in-place with hybrid (FTS + vector via RRF). Vector branch graceful-degrades to FTS-only if Ollama is down or the row lacks an embedding. |
| AI Space/Project suggestion | Deferred. Capture requires `space_id`. SPA preselects the user's last-used space from `localStorage`. |
| Jobs visibility | `/api/jobs?status=` + `/api/jobs/:id/retry` + `/api/jobs/:id/delete` + a minimal `#/jobs` SPA view (table grouped by status, 10 s polling, retry/delete per row). |
| Sequencing | Phase A → B → C → D (matches Plan 2 phasing). Each phase ends green and demoable. |
## Architecture
```
┌──────────────────────────────────────────┐
│ void-server (CT 311, Node, single proc)│
│ │
/api/capture ───▶ │ routes/capture.js │
/api/ingest/ │ routes/ingest.js (Karakeep webhook) │
karakeep ─────▶ │ │ │
drag-drop ─────▶ │ ▼ │
│ jobs/queue.js (pg-boss client) │
│ │ │
│ ▼ │
│ workers/ (in-process pollers) │
│ ├─ url.js │
│ ├─ karakeep.js │
│ ├─ embed.js (Ollama HTTP) │
│ └─ blob.js (drag-drop attachments) │
│ │ │
│ ▼ │
│ lib/db/repos/ (existing) + repos/jobs.js│
│ │ │
└──────┼───────────────────────────────────┘
┌─────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ Postgres │ │ Ollama │ │ Blob FS │
│ (CT 310, │ │ (CT 102, │ │ /var/lib/ │
│ pgvector │ │ nomic- │ │ void/blobs/ │
│ + pgboss │ │ embed-text)│ │ │
│ tables) │ └──────────────┘ └──────────────┘
└──────────┘
```
**Process model.** Workers and HTTP handlers share the void-server Node process. pg-boss polls Postgres on its own interval; HTTP requests enqueue jobs and return immediately with a `job_id`. No separate worker process — that's Plan 4 when the Python service arrives.
**External dependencies.** Postgres (already there), Ollama on CT 102 at `http://192.168.1.185:11434` (running, `nomic-embed-text` pulled, 768-dim embeddings verified 2026-06-01). Graceful-degrade still applies if it goes down later. Blob storage is local FS on CT 311's root pool, content-addressed.
**No new entity tables.** refs / pages / source_docs / attachments are reused. The `embedding vector(1024)` columns exist from Plan 1 (migration 002 + 004). pg-boss creates its own schema (`pgboss.*`) on first run.
## Phase A — Queue + worker harness + Jobs API
**New files:**
- `lib/jobs/queue.js` — singleton pg-boss client; `start()`, `enqueue(name, data, opts)`, `subscribe(name, handler, opts)`.
- `lib/jobs/index.js` — registers all worker handlers on start; called from `server.js` boot.
- `lib/jobs/workers/echo.js` — trivial worker used to prove the harness. Removed at end of Phase D.
- `lib/api/routes/jobs.js``GET /api/jobs?state=`, `GET /api/jobs/:id`, `POST /api/jobs/:id/retry`, `DELETE /api/jobs/:id`. Owner-only.
- `tests/jobs/queue.test.js` — pg-boss roundtrip: enqueue → handler runs → result.
- `tests/api/jobs.test.js` — list/retry/delete via HTTP.
**Modify:**
- `server.js` — call `jobs.start()` on boot, `jobs.shutdown()` on SIGTERM.
- `package.json` — add `pg-boss@^10`.
- `lib/api/index.js` — mount `/api/jobs`.
- `public/router.js` + `public/app.js` + add `public/views/jobs.js` — minimal Jobs view (placeholder for now; fleshed in Phase D).
**pg-boss config.** One pg-boss instance per process. Uses the existing `DATABASE_URL`. Default `pg-boss` schema name. `newJobCheckIntervalSeconds: 2` (alpha-tier; tighten later if needed). `archiveCompletedAfterSeconds: 86_400` (1 day archive). `deleteAfterDays: 7`.
**Concurrency limits** per the master spec, surfaced via `subscribe(name, handler, {teamSize, teamConcurrency})`:
| Worker name | Team size | Reason |
|---|---|---|
| `ingest.url` | 4 | Network-bound |
| `ingest.karakeep` | 4 | Network-bound |
| `ingest.blob` | 2 | Disk + sha256 hashing |
| `embed.text` | 2 | Ollama-bound (single GPU on CT 102) |
**Retry policy.** Per-worker `retryLimit: 5`, `retryBackoff: true`, `retryDelay: 10` (seconds). Effective backoff sequence: 10 s, 20 s, 40 s, 80 s, 160 s, then dead-letter. The spec called out 10 s / 60 s / 5 m but pg-boss only exposes exponential backoff with a base delay; the resulting curve is close enough.
**Dead-letter.** pg-boss's archive table (`pgboss.archive`) keeps failed jobs. `/api/jobs?state=failed` queries it. Manual retry copies to active.
**Commit:** `feat(jobs): pg-boss harness + Jobs API`.
## Phase B — Capture API + URL worker + blob storage
**Capture POST.** `POST /api/capture` (owner or agent with write tier):
```json
{
"space_id": "uuid",
"url": "https://example.com/article",
"hint": { "project_id": "uuid?", "title": "string?", "tags": ["string"] }
}
```
Response 202 with `{ job_id, idempotency_key, ref_id?: uuid }`. Idempotency key is `sha256(space_id + url)`. If a ref already exists for that key, the response carries the existing `ref_id` and `job_id: null` (no new job enqueued).
**URL worker.** `lib/jobs/workers/url.js` for `ingest.url`:
1. Compute idempotency key. If a `refs` row already exists with `source_kind='url'` and `external_id=<key>`, return its id.
2. `fetch(url)` with `User-Agent: void-ingest/2.0` and 15 s timeout.
3. Run readability extraction (npm `@mozilla/readability` + `jsdom`). Pull `title`, `byline`, `excerpt`, `textContent`, `siteName`.
4. Insert a `refs` row: `kind='url'`, `source_url=url`, `title=readability.title`, `summary=readability.excerpt`, `body_text=readability.textContent` (truncate to 200 kB), `source_kind='url'`, `external_id=<idempotency_key>`, `metadata={ site_name, byline, content_length }`.
5. Return the ref. Embedding is handled by Phase C's repo-level trigger that wraps `refs.create`; in Phase B alone the ref simply lacks an embedding until Phase C ships.
**Drag-drop.** `POST /api/capture/upload` (multipart, owner or agent write):
- Field `file` — the binary.
- Field `space_id` — required.
- Field `meta` (json) — optional `{ title, kind, tags }`.
Multer stages uploads in `/var/lib/void/uploads-tmp/` (size cap 100 MB per file) and the worker moves the file into the content-addressed blob store on success.
Worker `ingest.blob`:
1. Stream the upload to a temp file. Hash with sha256 as it streams.
2. If `/var/lib/void/blobs/<sha-prefix>/<sha>` exists, this is a duplicate; reuse the existing path.
3. Otherwise move the temp file into place.
4. Determine `kind` from `Content-Type` / extension: `image` for image/*, `pdf` for application/pdf, `file` for everything else. Video/audio fall through to `file` in Plan 3 (Plan 4 picks them up).
5. Insert a `refs` row: `kind=<derived>`, `blob_path=<path>`, `title=filename || sha`, plus metadata.
6. Insert via `refs.create`; Phase C's trigger picks up the embed automatically. In Phase B, no embed runs.
**Blob storage.** New directory `/var/lib/void/blobs/` on CT 311, owned by `void:void`, mode 750. Layout `<first-2-chars-of-sha>/<full-sha>`. Deploy bootstrap step adds the dir creation. Already on `localzfs` so replication picks it up.
**Files:**
- `lib/api/routes/capture.js` — both endpoints + multer config.
- `lib/jobs/workers/url.js`, `lib/jobs/workers/blob.js`.
- `lib/ingest/readability.js` — wraps `@mozilla/readability` for testability.
- `lib/ingest/blob_store.js` — sha + path resolution + write.
- `tests/api/capture.test.js`, `tests/jobs/workers/url.test.js`, `tests/jobs/workers/blob.test.js`.
**Deps to add:** `pg-boss`, `@mozilla/readability`, `jsdom`, `multer`.
**Commit:** `feat(jobs): capture API + URL + blob workers`.
## Phase C — Embeddings + hybrid search
**Ollama client.** `lib/ai/ollama.js`:
```js
async function embedText(text, model = 'nomic-embed-text') {
const res = await fetch(`${OLLAMA_URL}/api/embeddings`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model, prompt: text }),
signal: AbortSignal.timeout(60_000)
});
if (!res.ok) throw new OllamaError(res.status, await res.text());
const j = await res.json();
return j.embedding; // 768-dim
}
```
`OLLAMA_URL` env var, default `http://192.168.1.185:11434`. The 768-dim vector is zero-padded to 1024 to match the `vector(1024)` column (per master spec, eases later model swap).
**Embed worker.** `embed.text` job payload `{ entity_type, entity_id }`. Worker:
1. Load the entity row.
2. Build the embedding string:
- `page`: `${title}\n\n${body_md}`, truncated to ~6 k characters (≈ 1.5 k tokens; well under nomic's 8 k context).
- `ref`: `${title || ''}\n${summary || ''}\n${body_text || ''}`, same truncation.
- `source_doc`: `${name}\n${body_text || ''}`.
- `conversation`: `${title || ''}\n${summary || ''}` — short by design; conversations get richer treatment in Plan 5.
3. Call `embedText`. On `OllamaError` or fetch timeout, throw — pg-boss retry kicks in with exponential backoff.
4. Zero-pad to 1024, UPDATE the entity's `embedding` column.
5. Emit an audit log entry `(actor_kind='worker', action='update', entity_type, entity_id, diff={embedding:'updated'})`.
**Re-embed triggers.** Write paths (`repo.create`, `repo.update`) for pages/refs/source_docs already exist. Add a small `lib/jobs/triggers.js` that wraps these — after a successful create/update of an embeddable entity, enqueue `embed.text` with a singleton key `${entity_type}:${entity_id}` so rapid re-edits coalesce. The trigger is called from repo level so MCP and cron paths get it too.
**Hybrid search.** Rewrite `lib/db/repos/search.js::fts` into `search.hybrid({ q, space_id?, kinds?, limit, offset })`:
1. FTS branch — current Plan 2 query unchanged, returns up to `limit * 3` results with `ts_rank`.
2. Vector branch — embed `q` via Ollama (with a 5 s timeout — search must stay snappy). For each kind, run an ANN query against the matching table's `embedding` column using HNSW (`<=>` cosine distance). Returns up to `limit * 3` per kind. If Ollama times out or errors, skip this branch entirely — log a `search.vector_skipped` event and continue with FTS-only.
3. RRF fusion — for each unique `(kind, id)`, sum `1 / (60 + rank_fts) + 1 / (60 + rank_vec)`. The `60` constant matches the canonical RRF paper. Sort, slice to `[offset, offset+limit]`.
4. Vector-only rows (no FTS match) and FTS-only rows (no embedding yet) both participate; missing rank is treated as infinity, giving `1 / inf = 0` from that branch.
Result shape unchanged: `{ kind, id, space_id, title_or_snippet, rank }`. The `rank` field now carries the fused RRF score.
**Files:**
- `lib/ai/ollama.js` (new).
- `lib/jobs/workers/embed.js` (new).
- `lib/jobs/triggers.js` (new).
- `lib/db/repos/search.js` (rewrite).
- `tests/ai/ollama.test.js` — fetch mock.
- `tests/jobs/workers/embed.test.js` — fetch mock; verifies zero-pad + audit.
- `tests/repos/search.test.js` (existing) — extended with vector-fixture rows + RRF assertions.
**Embedding-test strategy.** Tests insert fixture vectors directly (no Ollama needed). One integration test under `tests/integration/embed_live.test.js` hits a real Ollama, marked `skip()` if `OLLAMA_URL` is unreachable.
**Repos that emit triggers:** pages.create, pages.update, refs.create, refs.update, refs.upsertByExternal, source_docs.create, source_docs.update. Conversation embeds are summary-only and re-fire when `setSummary` is called.
**Commit:** `feat(jobs): embed worker + hybrid search`.
## Phase D — Karakeep webhook + drag-drop UI + Jobs UI
**Karakeep webhook.** `POST /api/ingest/karakeep`. Authenticated by `X-Karakeep-Signature: sha256=<hex>` HMAC of the raw body with `KARAKEEP_WEBHOOK_SECRET` env. If the signature is missing or wrong: 401.
Payload (Karakeep's webhook shape, normalized): `{ event, bookmark_id, tags }`.
For `event === 'bookmark.created'`:
1. Look up the existing space-mapping from env: `KARAKEEP_DEFAULT_SPACE_ID` (a UUID). Future work: per-tag space routing.
2. Enqueue `ingest.karakeep` with `{ bookmark_id, space_id }`.
`ingest.karakeep` worker:
1. Fetch the bookmark via Karakeep's API: `GET https://karakeep.hynesy.com/api/v1/bookmarks/{bookmark_id}` with `KARAKEEP_API_TOKEN`.
2. Build the same payload an `ingest.url` job would use (URL + title + tags) and call the URL handler directly. Tags propagate to the `entity_tags` table via repo.
3. If Karakeep returns 404 (bookmark deleted), mark the job done — no error.
**Drag-drop UI.** `public/components/dropzone.js` — wraps a target element, intercepts drag events, POSTs each file to `/api/capture/upload`, shows toast progress. Wire onto `<main>` so dropping anywhere in the main area works. Pre-fills `space_id` with `localStorage.last_space_id` (set when the user navigates to a space view).
**Jobs UI fill-in.** Expand `public/views/jobs.js`:
- Group rows by `state` (active / completed / failed).
- Each row: `id (8 chars)`, `name`, `state`, relative `created_at`, `last_error?`, action buttons.
- Polls `/api/jobs?state=active,failed` every 10 s.
- Retry button POSTs `/api/jobs/:id/retry`; delete button DELETE `/api/jobs/:id`.
**Files:**
- `lib/api/routes/ingest.js`.
- `lib/jobs/workers/karakeep.js`.
- `lib/karakeep/client.js` — thin wrapper.
- `public/components/dropzone.js`.
- `public/views/jobs.js` (expand).
- `tests/api/ingest.test.js` — HMAC check, valid/invalid signature.
- `tests/jobs/workers/karakeep.test.js` — Karakeep API mocked via fetch interceptor.
**Commit:** `feat(jobs): Karakeep webhook + drag-drop + Jobs UI`.
## Error handling & idempotency
- **Idempotency keys.** URL and Karakeep workers compute `sha256(space_id + url)` (URL) or `sha256(space_id + 'karakeep:' + bookmark_id)` (Karakeep). Stored as `refs.external_id` with `source_kind` set to `'url'` or `'karakeep'`. The unique index `idx_refs_external_unique` already enforces this from Plan 1. A duplicate ingest finds the existing ref and short-circuits.
- **Singleton embed jobs.** pg-boss `singletonKey: '${entity_type}:${entity_id}'` so rapid edits coalesce into one pending embed. If a job is already in-flight when a new edit lands, a follow-up is enqueued.
- **Capture rate limit.** Out of scope. The `agentOrOwner` gate is enough at single-user scale.
- **Ollama down.** Embed jobs throw, retry under pg-boss backoff. After dead-letter (≈ 5 min cumulative), entity stays without an embedding; hybrid search falls back to FTS for those rows. Operator restores Ollama, then `POST /api/jobs/:id/retry` or wait for the periodic re-embed cron in a future phase.
- **Karakeep down.** Webhook still accepts. The worker dead-letters; tag mapping replays from the operator manually.
- **Blob upload partial.** Stream to temp; rename on success only. Failed uploads leave a temp file; a daily cron in Plan 4 sweeps `> 24 h` temps.
## Observability
- Pino structured logs already in place. New log keys: `job_id`, `job_name`, `entity_type`, `entity_id`, `idempotency_key`, `outcome`.
- `/api/jobs` is the operator surface; the SPA Jobs view fronts it.
- pg-boss's archive table is the source of truth for completed/failed jobs; no separate audit needed for job lifecycle (the audit log captures entity-level changes the workers cause).
## Testing strategy
- **Unit:** workers and the Ollama client get unit tests with `fetch` mocked (vitest's `vi.fn`).
- **Repo:** `tests/repos/search.test.js` extended; new `tests/repos/jobs.test.js` covers `pg-boss`-backed list/retry helpers.
- **API:** capture, ingest, jobs routes via supertest. HMAC signature pass/fail. Idempotency on second capture of the same URL.
- **Integration (gated):** one test that hits real Ollama; auto-skipped if `OLLAMA_URL` is unreachable. Real pg-boss roundtrips happen inside the existing test DB using `resetDb` + `await pg-boss.stop()` between suites to avoid cross-talk.
- **No new vitest config.** `fileParallelism: false` already in place from Plan 1 — pg-boss is happier serialized too.
## Migrations
- **No new SQL migrations from Void.** pg-boss creates its own schema on first `start()`.
- One-time CT 311 ops: create `/var/lib/void/blobs/` and chown `void:void`.
## Deploy delta
- `.env` adds `OLLAMA_URL`, `KARAKEEP_WEBHOOK_SECRET`, `KARAKEEP_API_TOKEN`, `KARAKEEP_API_URL`, `KARAKEEP_DEFAULT_SPACE_ID`. Documented in `deploy/README.md`.
- `deploy/push.sh` unchanged (rsync still works).
- Snapshot CT 310 + 311 before deploying Plan 3 (standing rule). The Phase A first-deploy is the "major update" — pg-boss creates new tables in the shared DB.
## Known follow-ups (not Plan 3)
- AI Space/Project suggestion on capture.
- Embedding chunks table.
- pdf-text-extract for born-digital PDFs (Plan 4 likely handles this with Tesseract too).
- Per-tag Karakeep → Space routing instead of one default space.
- Recurring re-embed cron for rows where `embedding IS NULL`.
- Real-time Jobs UI via `pg LISTEN/NOTIFY` instead of polling.
## Open items for the user
- **Karakeep secrets.** Plan 3 Phase D needs `KARAKEEP_API_TOKEN` (issued from Karakeep settings) and a chosen `KARAKEEP_DEFAULT_SPACE_ID`. Surfaceable when the phase starts.
- **The 29-day-old `knowledge_pipeline` memory** (Karakeep → Qdrant → MCP) is now superseded by Void 2.0's pgvector-only architecture. After Plan 3 ships, that memory should be marked obsolete or deleted to avoid future-me reading it as authoritative.