# Void 2.0 — Plan 3 Design Spec: Capture pipeline + hybrid search **Date:** 2026-06-01 **Builds on:** Plan 1 (Foundation, complete) and Plan 2 (API + UI shell, complete, version 2.0.0-alpha.2). **Master spec:** `docs/superpowers/specs/2026-05-31-void-v2-design.md` — many decisions inherit from there. ## Goal Wire the Plan 2 SPA's stub Capture button to a real ingest pipeline. Add a pg-boss-backed job queue, capture entry points (URL POST + Karakeep webhook + drag-drop attachment), a URL worker that turns links into `refs`, an embeddings worker that writes vectors into the existing `embedding` columns, and a hybrid FTS+vector search that replaces the Plan 2 FTS-only `/api/search`. ## Out of scope (Plan 4 and later) - Whisper transcription, Tesseract OCR, yt-dlp video ingestion, scanned-PDF OCR. - The Python `void-workers` service. Plan 3 stays single-process Node. - AI Space/Project suggestion on capture (defer; capture takes explicit `space_id`). - Embedding chunks table — Plan 3 uses one whole-doc embedding per entity row; chunks land later once we can measure recall on a real corpus. - MCP server surface. Plan 5+. ## Decisions locked by brainstorm | Question | Answer | |---|---| | Plan 3 slice | Node-side: pg-boss + `/api/capture` POST + Karakeep webhook + URL worker + embed.text worker + hybrid search + Jobs panel. Defers ML-heavy ingest to Plan 4. | | Capture entry points | `/api/capture` POST + Karakeep webhook + drag-drop upload. Inbound email skipped. | | Embedding granularity | Whole-doc per entity row. Add chunks table later. | | Search rollout | `/api/search` replaced in-place with hybrid (FTS + vector via RRF). Vector branch graceful-degrades to FTS-only if Ollama is down or the row lacks an embedding. | | AI Space/Project suggestion | Deferred. Capture requires `space_id`. SPA preselects the user's last-used space from `localStorage`. | | Jobs visibility | `/api/jobs?status=` + `/api/jobs/:id/retry` + `/api/jobs/:id/delete` + a minimal `#/jobs` SPA view (table grouped by status, 10 s polling, retry/delete per row). | | Sequencing | Phase A → B → C → D (matches Plan 2 phasing). Each phase ends green and demoable. | ## Architecture ``` ┌──────────────────────────────────────────┐ │ void-server (CT 311, Node, single proc)│ │ │ /api/capture ───▶ │ routes/capture.js │ /api/ingest/ │ routes/ingest.js (Karakeep webhook) │ karakeep ─────▶ │ │ │ drag-drop ─────▶ │ ▼ │ │ jobs/queue.js (pg-boss client) │ │ │ │ │ ▼ │ │ workers/ (in-process pollers) │ │ ├─ url.js │ │ ├─ karakeep.js │ │ ├─ embed.js (Ollama HTTP) │ │ └─ blob.js (drag-drop attachments) │ │ │ │ │ ▼ │ │ lib/db/repos/ (existing) + repos/jobs.js│ │ │ │ └──────┼───────────────────────────────────┘ │ ┌─────────────┼──────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ Postgres │ │ Ollama │ │ Blob FS │ │ (CT 310, │ │ (CT 102, │ │ /var/lib/ │ │ pgvector │ │ nomic- │ │ void/blobs/ │ │ + pgboss │ │ embed-text)│ │ │ │ tables) │ └──────────────┘ └──────────────┘ └──────────┘ ``` **Process model.** Workers and HTTP handlers share the void-server Node process. pg-boss polls Postgres on its own interval; HTTP requests enqueue jobs and return immediately with a `job_id`. No separate worker process — that's Plan 4 when the Python service arrives. **External dependencies.** Postgres (already there), Ollama on CT 102 at `http://192.168.1.185:11434` (running, `nomic-embed-text` pulled, 768-dim embeddings verified 2026-06-01). Graceful-degrade still applies if it goes down later. Blob storage is local FS on CT 311's root pool, content-addressed. **No new entity tables.** refs / pages / source_docs / attachments are reused. The `embedding vector(1024)` columns exist from Plan 1 (migration 002 + 004). pg-boss creates its own schema (`pgboss.*`) on first run. ## Phase A — Queue + worker harness + Jobs API **New files:** - `lib/jobs/queue.js` — singleton pg-boss client; `start()`, `enqueue(name, data, opts)`, `subscribe(name, handler, opts)`. - `lib/jobs/index.js` — registers all worker handlers on start; called from `server.js` boot. - `lib/jobs/workers/echo.js` — trivial worker used to prove the harness. Removed at end of Phase D. - `lib/api/routes/jobs.js` — `GET /api/jobs?state=`, `GET /api/jobs/:id`, `POST /api/jobs/:id/retry`, `DELETE /api/jobs/:id`. Owner-only. - `tests/jobs/queue.test.js` — pg-boss roundtrip: enqueue → handler runs → result. - `tests/api/jobs.test.js` — list/retry/delete via HTTP. **Modify:** - `server.js` — call `jobs.start()` on boot, `jobs.shutdown()` on SIGTERM. - `package.json` — add `pg-boss@^10`. - `lib/api/index.js` — mount `/api/jobs`. - `public/router.js` + `public/app.js` + add `public/views/jobs.js` — minimal Jobs view (placeholder for now; fleshed in Phase D). **pg-boss config.** One pg-boss instance per process. Uses the existing `DATABASE_URL`. Default `pg-boss` schema name. `newJobCheckIntervalSeconds: 2` (alpha-tier; tighten later if needed). `archiveCompletedAfterSeconds: 86_400` (1 day archive). `deleteAfterDays: 7`. **Concurrency limits** per the master spec, surfaced via `subscribe(name, handler, {teamSize, teamConcurrency})`: | Worker name | Team size | Reason | |---|---|---| | `ingest.url` | 4 | Network-bound | | `ingest.karakeep` | 4 | Network-bound | | `ingest.blob` | 2 | Disk + sha256 hashing | | `embed.text` | 2 | Ollama-bound (single GPU on CT 102) | **Retry policy.** Per-worker `retryLimit: 5`, `retryBackoff: true`, `retryDelay: 10` (seconds). Effective backoff sequence: 10 s, 20 s, 40 s, 80 s, 160 s, then dead-letter. The spec called out 10 s / 60 s / 5 m but pg-boss only exposes exponential backoff with a base delay; the resulting curve is close enough. **Dead-letter.** pg-boss's archive table (`pgboss.archive`) keeps failed jobs. `/api/jobs?state=failed` queries it. Manual retry copies to active. **Commit:** `feat(jobs): pg-boss harness + Jobs API`. ## Phase B — Capture API + URL worker + blob storage **Capture POST.** `POST /api/capture` (owner or agent with write tier): ```json { "space_id": "uuid", "url": "https://example.com/article", "hint": { "project_id": "uuid?", "title": "string?", "tags": ["string"] } } ``` Response 202 with `{ job_id, idempotency_key, ref_id?: uuid }`. Idempotency key is `sha256(space_id + url)`. If a ref already exists for that key, the response carries the existing `ref_id` and `job_id: null` (no new job enqueued). **URL worker.** `lib/jobs/workers/url.js` for `ingest.url`: 1. Compute idempotency key. If a `refs` row already exists with `source_kind='url'` and `external_id=`, return its id. 2. `fetch(url)` with `User-Agent: void-ingest/2.0` and 15 s timeout. 3. Run readability extraction (npm `@mozilla/readability` + `jsdom`). Pull `title`, `byline`, `excerpt`, `textContent`, `siteName`. 4. Insert a `refs` row: `kind='url'`, `source_url=url`, `title=readability.title`, `summary=readability.excerpt`, `body_text=readability.textContent` (truncate to 200 kB), `source_kind='url'`, `external_id=`, `metadata={ site_name, byline, content_length }`. 5. Return the ref. Embedding is handled by Phase C's repo-level trigger that wraps `refs.create`; in Phase B alone the ref simply lacks an embedding until Phase C ships. **Drag-drop.** `POST /api/capture/upload` (multipart, owner or agent write): - Field `file` — the binary. - Field `space_id` — required. - Field `meta` (json) — optional `{ title, kind, tags }`. Multer stages uploads in `/var/lib/void/uploads-tmp/` (size cap 100 MB per file) and the worker moves the file into the content-addressed blob store on success. Worker `ingest.blob`: 1. Stream the upload to a temp file. Hash with sha256 as it streams. 2. If `/var/lib/void/blobs//` exists, this is a duplicate; reuse the existing path. 3. Otherwise move the temp file into place. 4. Determine `kind` from `Content-Type` / extension: `image` for image/*, `pdf` for application/pdf, `file` for everything else. Video/audio fall through to `file` in Plan 3 (Plan 4 picks them up). 5. Insert a `refs` row: `kind=`, `blob_path=`, `title=filename || sha`, plus metadata. 6. Insert via `refs.create`; Phase C's trigger picks up the embed automatically. In Phase B, no embed runs. **Blob storage.** New directory `/var/lib/void/blobs/` on CT 311, owned by `void:void`, mode 750. Layout `/`. Deploy bootstrap step adds the dir creation. Already on `localzfs` so replication picks it up. **Files:** - `lib/api/routes/capture.js` — both endpoints + multer config. - `lib/jobs/workers/url.js`, `lib/jobs/workers/blob.js`. - `lib/ingest/readability.js` — wraps `@mozilla/readability` for testability. - `lib/ingest/blob_store.js` — sha + path resolution + write. - `tests/api/capture.test.js`, `tests/jobs/workers/url.test.js`, `tests/jobs/workers/blob.test.js`. **Deps to add:** `pg-boss`, `@mozilla/readability`, `jsdom`, `multer`. **Commit:** `feat(jobs): capture API + URL + blob workers`. ## Phase C — Embeddings + hybrid search **Ollama client.** `lib/ai/ollama.js`: ```js async function embedText(text, model = 'nomic-embed-text') { const res = await fetch(`${OLLAMA_URL}/api/embeddings`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model, prompt: text }), signal: AbortSignal.timeout(60_000) }); if (!res.ok) throw new OllamaError(res.status, await res.text()); const j = await res.json(); return j.embedding; // 768-dim } ``` `OLLAMA_URL` env var, default `http://192.168.1.185:11434`. The 768-dim vector is zero-padded to 1024 to match the `vector(1024)` column (per master spec, eases later model swap). **Embed worker.** `embed.text` job payload `{ entity_type, entity_id }`. Worker: 1. Load the entity row. 2. Build the embedding string: - `page`: `${title}\n\n${body_md}`, truncated to ~6 k characters (≈ 1.5 k tokens; well under nomic's 8 k context). - `ref`: `${title || ''}\n${summary || ''}\n${body_text || ''}`, same truncation. - `source_doc`: `${name}\n${body_text || ''}`. - `conversation`: `${title || ''}\n${summary || ''}` — short by design; conversations get richer treatment in Plan 5. 3. Call `embedText`. On `OllamaError` or fetch timeout, throw — pg-boss retry kicks in with exponential backoff. 4. Zero-pad to 1024, UPDATE the entity's `embedding` column. 5. Emit an audit log entry `(actor_kind='worker', action='update', entity_type, entity_id, diff={embedding:'updated'})`. **Re-embed triggers.** Write paths (`repo.create`, `repo.update`) for pages/refs/source_docs already exist. Add a small `lib/jobs/triggers.js` that wraps these — after a successful create/update of an embeddable entity, enqueue `embed.text` with a singleton key `${entity_type}:${entity_id}` so rapid re-edits coalesce. The trigger is called from repo level so MCP and cron paths get it too. **Hybrid search.** Rewrite `lib/db/repos/search.js::fts` into `search.hybrid({ q, space_id?, kinds?, limit, offset })`: 1. FTS branch — current Plan 2 query unchanged, returns up to `limit * 3` results with `ts_rank`. 2. Vector branch — embed `q` via Ollama (with a 5 s timeout — search must stay snappy). For each kind, run an ANN query against the matching table's `embedding` column using HNSW (`<=>` cosine distance). Returns up to `limit * 3` per kind. If Ollama times out or errors, skip this branch entirely — log a `search.vector_skipped` event and continue with FTS-only. 3. RRF fusion — for each unique `(kind, id)`, sum `1 / (60 + rank_fts) + 1 / (60 + rank_vec)`. The `60` constant matches the canonical RRF paper. Sort, slice to `[offset, offset+limit]`. 4. Vector-only rows (no FTS match) and FTS-only rows (no embedding yet) both participate; missing rank is treated as infinity, giving `1 / inf = 0` from that branch. Result shape unchanged: `{ kind, id, space_id, title_or_snippet, rank }`. The `rank` field now carries the fused RRF score. **Files:** - `lib/ai/ollama.js` (new). - `lib/jobs/workers/embed.js` (new). - `lib/jobs/triggers.js` (new). - `lib/db/repos/search.js` (rewrite). - `tests/ai/ollama.test.js` — fetch mock. - `tests/jobs/workers/embed.test.js` — fetch mock; verifies zero-pad + audit. - `tests/repos/search.test.js` (existing) — extended with vector-fixture rows + RRF assertions. **Embedding-test strategy.** Tests insert fixture vectors directly (no Ollama needed). One integration test under `tests/integration/embed_live.test.js` hits a real Ollama, marked `skip()` if `OLLAMA_URL` is unreachable. **Repos that emit triggers:** pages.create, pages.update, refs.create, refs.update, refs.upsertByExternal, source_docs.create, source_docs.update. Conversation embeds are summary-only and re-fire when `setSummary` is called. **Commit:** `feat(jobs): embed worker + hybrid search`. ## Phase D — Karakeep webhook + drag-drop UI + Jobs UI **Karakeep webhook.** `POST /api/ingest/karakeep`. Authenticated by `X-Karakeep-Signature: sha256=` HMAC of the raw body with `KARAKEEP_WEBHOOK_SECRET` env. If the signature is missing or wrong: 401. Payload (Karakeep's webhook shape, normalized): `{ event, bookmark_id, tags }`. For `event === 'bookmark.created'`: 1. Look up the existing space-mapping from env: `KARAKEEP_DEFAULT_SPACE_ID` (a UUID). Future work: per-tag space routing. 2. Enqueue `ingest.karakeep` with `{ bookmark_id, space_id }`. `ingest.karakeep` worker: 1. Fetch the bookmark via Karakeep's API: `GET https://karakeep.hynesy.com/api/v1/bookmarks/{bookmark_id}` with `KARAKEEP_API_TOKEN`. 2. Build the same payload an `ingest.url` job would use (URL + title + tags) and call the URL handler directly. Tags propagate to the `entity_tags` table via repo. 3. If Karakeep returns 404 (bookmark deleted), mark the job done — no error. **Drag-drop UI.** `public/components/dropzone.js` — wraps a target element, intercepts drag events, POSTs each file to `/api/capture/upload`, shows toast progress. Wire onto `
` so dropping anywhere in the main area works. Pre-fills `space_id` with `localStorage.last_space_id` (set when the user navigates to a space view). **Jobs UI fill-in.** Expand `public/views/jobs.js`: - Group rows by `state` (active / completed / failed). - Each row: `id (8 chars)`, `name`, `state`, relative `created_at`, `last_error?`, action buttons. - Polls `/api/jobs?state=active,failed` every 10 s. - Retry button POSTs `/api/jobs/:id/retry`; delete button DELETE `/api/jobs/:id`. **Files:** - `lib/api/routes/ingest.js`. - `lib/jobs/workers/karakeep.js`. - `lib/karakeep/client.js` — thin wrapper. - `public/components/dropzone.js`. - `public/views/jobs.js` (expand). - `tests/api/ingest.test.js` — HMAC check, valid/invalid signature. - `tests/jobs/workers/karakeep.test.js` — Karakeep API mocked via fetch interceptor. **Commit:** `feat(jobs): Karakeep webhook + drag-drop + Jobs UI`. ## Error handling & idempotency - **Idempotency keys.** URL and Karakeep workers compute `sha256(space_id + url)` (URL) or `sha256(space_id + 'karakeep:' + bookmark_id)` (Karakeep). Stored as `refs.external_id` with `source_kind` set to `'url'` or `'karakeep'`. The unique index `idx_refs_external_unique` already enforces this from Plan 1. A duplicate ingest finds the existing ref and short-circuits. - **Singleton embed jobs.** pg-boss `singletonKey: '${entity_type}:${entity_id}'` so rapid edits coalesce into one pending embed. If a job is already in-flight when a new edit lands, a follow-up is enqueued. - **Capture rate limit.** Out of scope. The `agentOrOwner` gate is enough at single-user scale. - **Ollama down.** Embed jobs throw, retry under pg-boss backoff. After dead-letter (≈ 5 min cumulative), entity stays without an embedding; hybrid search falls back to FTS for those rows. Operator restores Ollama, then `POST /api/jobs/:id/retry` or wait for the periodic re-embed cron in a future phase. - **Karakeep down.** Webhook still accepts. The worker dead-letters; tag mapping replays from the operator manually. - **Blob upload partial.** Stream to temp; rename on success only. Failed uploads leave a temp file; a daily cron in Plan 4 sweeps `> 24 h` temps. ## Observability - Pino structured logs already in place. New log keys: `job_id`, `job_name`, `entity_type`, `entity_id`, `idempotency_key`, `outcome`. - `/api/jobs` is the operator surface; the SPA Jobs view fronts it. - pg-boss's archive table is the source of truth for completed/failed jobs; no separate audit needed for job lifecycle (the audit log captures entity-level changes the workers cause). ## Testing strategy - **Unit:** workers and the Ollama client get unit tests with `fetch` mocked (vitest's `vi.fn`). - **Repo:** `tests/repos/search.test.js` extended; new `tests/repos/jobs.test.js` covers `pg-boss`-backed list/retry helpers. - **API:** capture, ingest, jobs routes via supertest. HMAC signature pass/fail. Idempotency on second capture of the same URL. - **Integration (gated):** one test that hits real Ollama; auto-skipped if `OLLAMA_URL` is unreachable. Real pg-boss roundtrips happen inside the existing test DB using `resetDb` + `await pg-boss.stop()` between suites to avoid cross-talk. - **No new vitest config.** `fileParallelism: false` already in place from Plan 1 — pg-boss is happier serialized too. ## Migrations - **No new SQL migrations from Void.** pg-boss creates its own schema on first `start()`. - One-time CT 311 ops: create `/var/lib/void/blobs/` and chown `void:void`. ## Deploy delta - `.env` adds `OLLAMA_URL`, `KARAKEEP_WEBHOOK_SECRET`, `KARAKEEP_API_TOKEN`, `KARAKEEP_API_URL`, `KARAKEEP_DEFAULT_SPACE_ID`. Documented in `deploy/README.md`. - `deploy/push.sh` unchanged (rsync still works). - Snapshot CT 310 + 311 before deploying Plan 3 (standing rule). The Phase A first-deploy is the "major update" — pg-boss creates new tables in the shared DB. ## Known follow-ups (not Plan 3) - AI Space/Project suggestion on capture. - Embedding chunks table. - pdf-text-extract for born-digital PDFs (Plan 4 likely handles this with Tesseract too). - Per-tag Karakeep → Space routing instead of one default space. - Recurring re-embed cron for rows where `embedding IS NULL`. - Real-time Jobs UI via `pg LISTEN/NOTIFY` instead of polling. ## Open items for the user - **Karakeep secrets.** Plan 3 Phase D needs `KARAKEEP_API_TOKEN` (issued from Karakeep settings) and a chosen `KARAKEEP_DEFAULT_SPACE_ID`. Surfaceable when the phase starts. - **The 29-day-old `knowledge_pipeline` memory** (Karakeep → Qdrant → MCP) is now superseded by Void 2.0's pgvector-only architecture. After Plan 3 ships, that memory should be marked obsolete or deleted to avoid future-me reading it as authoritative.