Files
Void-Homelab/docs/superpowers/specs/2026-06-01-void-v2-plan3-capture.md
root 54ba68a11c docs: move void-v2 specs + plans into the repo
All Void 2.0 superpowers specs and implementation plans now live at
docs/superpowers/{specs,plans}/ inside the repo. Previously they were
at /project/docs/superpowers/ which was not under git.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:11:32 +10:00

20 KiB

Void 2.0 — Plan 3 Design Spec: Capture pipeline + hybrid search

Date: 2026-06-01 Builds on: Plan 1 (Foundation, complete) and Plan 2 (API + UI shell, complete, version 2.0.0-alpha.2). Master spec: docs/superpowers/specs/2026-05-31-void-v2-design.md — many decisions inherit from there.

Goal

Wire the Plan 2 SPA's stub Capture button to a real ingest pipeline. Add a pg-boss-backed job queue, capture entry points (URL POST + Karakeep webhook + drag-drop attachment), a URL worker that turns links into refs, an embeddings worker that writes vectors into the existing embedding columns, and a hybrid FTS+vector search that replaces the Plan 2 FTS-only /api/search.

Out of scope (Plan 4 and later)

  • Whisper transcription, Tesseract OCR, yt-dlp video ingestion, scanned-PDF OCR.
  • The Python void-workers service. Plan 3 stays single-process Node.
  • AI Space/Project suggestion on capture (defer; capture takes explicit space_id).
  • Embedding chunks table — Plan 3 uses one whole-doc embedding per entity row; chunks land later once we can measure recall on a real corpus.
  • MCP server surface. Plan 5+.

Decisions locked by brainstorm

Question Answer
Plan 3 slice Node-side: pg-boss + /api/capture POST + Karakeep webhook + URL worker + embed.text worker + hybrid search + Jobs panel. Defers ML-heavy ingest to Plan 4.
Capture entry points /api/capture POST + Karakeep webhook + drag-drop upload. Inbound email skipped.
Embedding granularity Whole-doc per entity row. Add chunks table later.
Search rollout /api/search replaced in-place with hybrid (FTS + vector via RRF). Vector branch graceful-degrades to FTS-only if Ollama is down or the row lacks an embedding.
AI Space/Project suggestion Deferred. Capture requires space_id. SPA preselects the user's last-used space from localStorage.
Jobs visibility /api/jobs?status= + /api/jobs/:id/retry + /api/jobs/:id/delete + a minimal #/jobs SPA view (table grouped by status, 10 s polling, retry/delete per row).
Sequencing Phase A → B → C → D (matches Plan 2 phasing). Each phase ends green and demoable.

Architecture

                     ┌──────────────────────────────────────────┐
                     │  void-server  (CT 311, Node, single proc)│
                     │                                          │
   /api/capture ───▶ │  routes/capture.js                       │
   /api/ingest/      │  routes/ingest.js (Karakeep webhook)     │
     karakeep ─────▶ │      │                                   │
   drag-drop  ─────▶ │      ▼                                   │
                     │  jobs/queue.js (pg-boss client)          │
                     │      │                                   │
                     │      ▼                                   │
                     │  workers/  (in-process pollers)          │
                     │   ├─ url.js                              │
                     │   ├─ karakeep.js                         │
                     │   ├─ embed.js   (Ollama HTTP)            │
                     │   └─ blob.js    (drag-drop attachments)  │
                     │      │                                   │
                     │      ▼                                   │
                     │  lib/db/repos/ (existing) + repos/jobs.js│
                     │      │                                   │
                     └──────┼───────────────────────────────────┘
                            │
              ┌─────────────┼──────────────┐
              ▼             ▼              ▼
       ┌──────────┐  ┌──────────────┐  ┌──────────────┐
       │ Postgres │  │  Ollama      │  │ Blob FS      │
       │ (CT 310, │  │  (CT 102,    │  │ /var/lib/    │
       │ pgvector │  │   nomic-     │  │  void/blobs/ │
       │ + pgboss │  │   embed-text)│  │              │
       │ tables)  │  └──────────────┘  └──────────────┘
       └──────────┘

Process model. Workers and HTTP handlers share the void-server Node process. pg-boss polls Postgres on its own interval; HTTP requests enqueue jobs and return immediately with a job_id. No separate worker process — that's Plan 4 when the Python service arrives.

External dependencies. Postgres (already there), Ollama on CT 102 at http://192.168.1.185:11434 (running, nomic-embed-text pulled, 768-dim embeddings verified 2026-06-01). Graceful-degrade still applies if it goes down later. Blob storage is local FS on CT 311's root pool, content-addressed.

No new entity tables. refs / pages / source_docs / attachments are reused. The embedding vector(1024) columns exist from Plan 1 (migration 002 + 004). pg-boss creates its own schema (pgboss.*) on first run.

Phase A — Queue + worker harness + Jobs API

New files:

  • lib/jobs/queue.js — singleton pg-boss client; start(), enqueue(name, data, opts), subscribe(name, handler, opts).
  • lib/jobs/index.js — registers all worker handlers on start; called from server.js boot.
  • lib/jobs/workers/echo.js — trivial worker used to prove the harness. Removed at end of Phase D.
  • lib/api/routes/jobs.jsGET /api/jobs?state=, GET /api/jobs/:id, POST /api/jobs/:id/retry, DELETE /api/jobs/:id. Owner-only.
  • tests/jobs/queue.test.js — pg-boss roundtrip: enqueue → handler runs → result.
  • tests/api/jobs.test.js — list/retry/delete via HTTP.

Modify:

  • server.js — call jobs.start() on boot, jobs.shutdown() on SIGTERM.
  • package.json — add pg-boss@^10.
  • lib/api/index.js — mount /api/jobs.
  • public/router.js + public/app.js + add public/views/jobs.js — minimal Jobs view (placeholder for now; fleshed in Phase D).

pg-boss config. One pg-boss instance per process. Uses the existing DATABASE_URL. Default pg-boss schema name. newJobCheckIntervalSeconds: 2 (alpha-tier; tighten later if needed). archiveCompletedAfterSeconds: 86_400 (1 day archive). deleteAfterDays: 7.

Concurrency limits per the master spec, surfaced via subscribe(name, handler, {teamSize, teamConcurrency}):

Worker name Team size Reason
ingest.url 4 Network-bound
ingest.karakeep 4 Network-bound
ingest.blob 2 Disk + sha256 hashing
embed.text 2 Ollama-bound (single GPU on CT 102)

Retry policy. Per-worker retryLimit: 5, retryBackoff: true, retryDelay: 10 (seconds). Effective backoff sequence: 10 s, 20 s, 40 s, 80 s, 160 s, then dead-letter. The spec called out 10 s / 60 s / 5 m but pg-boss only exposes exponential backoff with a base delay; the resulting curve is close enough.

Dead-letter. pg-boss's archive table (pgboss.archive) keeps failed jobs. /api/jobs?state=failed queries it. Manual retry copies to active.

Commit: feat(jobs): pg-boss harness + Jobs API.

Phase B — Capture API + URL worker + blob storage

Capture POST. POST /api/capture (owner or agent with write tier):

{
  "space_id": "uuid",
  "url": "https://example.com/article",
  "hint": { "project_id": "uuid?", "title": "string?", "tags": ["string"] }
}

Response 202 with { job_id, idempotency_key, ref_id?: uuid }. Idempotency key is sha256(space_id + url). If a ref already exists for that key, the response carries the existing ref_id and job_id: null (no new job enqueued).

URL worker. lib/jobs/workers/url.js for ingest.url:

  1. Compute idempotency key. If a refs row already exists with source_kind='url' and external_id=<key>, return its id.
  2. fetch(url) with User-Agent: void-ingest/2.0 and 15 s timeout.
  3. Run readability extraction (npm @mozilla/readability + jsdom). Pull title, byline, excerpt, textContent, siteName.
  4. Insert a refs row: kind='url', source_url=url, title=readability.title, summary=readability.excerpt, body_text=readability.textContent (truncate to 200 kB), source_kind='url', external_id=<idempotency_key>, metadata={ site_name, byline, content_length }.
  5. Return the ref. Embedding is handled by Phase C's repo-level trigger that wraps refs.create; in Phase B alone the ref simply lacks an embedding until Phase C ships.

Drag-drop. POST /api/capture/upload (multipart, owner or agent write):

  • Field file — the binary.
  • Field space_id — required.
  • Field meta (json) — optional { title, kind, tags }.

Multer stages uploads in /var/lib/void/uploads-tmp/ (size cap 100 MB per file) and the worker moves the file into the content-addressed blob store on success.

Worker ingest.blob:

  1. Stream the upload to a temp file. Hash with sha256 as it streams.
  2. If /var/lib/void/blobs/<sha-prefix>/<sha> exists, this is a duplicate; reuse the existing path.
  3. Otherwise move the temp file into place.
  4. Determine kind from Content-Type / extension: image for image/*, pdf for application/pdf, file for everything else. Video/audio fall through to file in Plan 3 (Plan 4 picks them up).
  5. Insert a refs row: kind=<derived>, blob_path=<path>, title=filename || sha, plus metadata.
  6. Insert via refs.create; Phase C's trigger picks up the embed automatically. In Phase B, no embed runs.

Blob storage. New directory /var/lib/void/blobs/ on CT 311, owned by void:void, mode 750. Layout <first-2-chars-of-sha>/<full-sha>. Deploy bootstrap step adds the dir creation. Already on localzfs so replication picks it up.

Files:

  • lib/api/routes/capture.js — both endpoints + multer config.
  • lib/jobs/workers/url.js, lib/jobs/workers/blob.js.
  • lib/ingest/readability.js — wraps @mozilla/readability for testability.
  • lib/ingest/blob_store.js — sha + path resolution + write.
  • tests/api/capture.test.js, tests/jobs/workers/url.test.js, tests/jobs/workers/blob.test.js.

Deps to add: pg-boss, @mozilla/readability, jsdom, multer.

Commit: feat(jobs): capture API + URL + blob workers.

Ollama client. lib/ai/ollama.js:

async function embedText(text, model = 'nomic-embed-text') {
  const res = await fetch(`${OLLAMA_URL}/api/embeddings`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ model, prompt: text }),
    signal: AbortSignal.timeout(60_000)
  });
  if (!res.ok) throw new OllamaError(res.status, await res.text());
  const j = await res.json();
  return j.embedding; // 768-dim
}

OLLAMA_URL env var, default http://192.168.1.185:11434. The 768-dim vector is zero-padded to 1024 to match the vector(1024) column (per master spec, eases later model swap).

Embed worker. embed.text job payload { entity_type, entity_id }. Worker:

  1. Load the entity row.
  2. Build the embedding string:
    • page: ${title}\n\n${body_md}, truncated to ~6 k characters (≈ 1.5 k tokens; well under nomic's 8 k context).
    • ref: ${title || ''}\n${summary || ''}\n${body_text || ''}, same truncation.
    • source_doc: ${name}\n${body_text || ''}.
    • conversation: ${title || ''}\n${summary || ''} — short by design; conversations get richer treatment in Plan 5.
  3. Call embedText. On OllamaError or fetch timeout, throw — pg-boss retry kicks in with exponential backoff.
  4. Zero-pad to 1024, UPDATE the entity's embedding column.
  5. Emit an audit log entry (actor_kind='worker', action='update', entity_type, entity_id, diff={embedding:'updated'}).

Re-embed triggers. Write paths (repo.create, repo.update) for pages/refs/source_docs already exist. Add a small lib/jobs/triggers.js that wraps these — after a successful create/update of an embeddable entity, enqueue embed.text with a singleton key ${entity_type}:${entity_id} so rapid re-edits coalesce. The trigger is called from repo level so MCP and cron paths get it too.

Hybrid search. Rewrite lib/db/repos/search.js::fts into search.hybrid({ q, space_id?, kinds?, limit, offset }):

  1. FTS branch — current Plan 2 query unchanged, returns up to limit * 3 results with ts_rank.
  2. Vector branch — embed q via Ollama (with a 5 s timeout — search must stay snappy). For each kind, run an ANN query against the matching table's embedding column using HNSW (<=> cosine distance). Returns up to limit * 3 per kind. If Ollama times out or errors, skip this branch entirely — log a search.vector_skipped event and continue with FTS-only.
  3. RRF fusion — for each unique (kind, id), sum 1 / (60 + rank_fts) + 1 / (60 + rank_vec). The 60 constant matches the canonical RRF paper. Sort, slice to [offset, offset+limit].
  4. Vector-only rows (no FTS match) and FTS-only rows (no embedding yet) both participate; missing rank is treated as infinity, giving 1 / inf = 0 from that branch.

Result shape unchanged: { kind, id, space_id, title_or_snippet, rank }. The rank field now carries the fused RRF score.

Files:

  • lib/ai/ollama.js (new).
  • lib/jobs/workers/embed.js (new).
  • lib/jobs/triggers.js (new).
  • lib/db/repos/search.js (rewrite).
  • tests/ai/ollama.test.js — fetch mock.
  • tests/jobs/workers/embed.test.js — fetch mock; verifies zero-pad + audit.
  • tests/repos/search.test.js (existing) — extended with vector-fixture rows + RRF assertions.

Embedding-test strategy. Tests insert fixture vectors directly (no Ollama needed). One integration test under tests/integration/embed_live.test.js hits a real Ollama, marked skip() if OLLAMA_URL is unreachable.

Repos that emit triggers: pages.create, pages.update, refs.create, refs.update, refs.upsertByExternal, source_docs.create, source_docs.update. Conversation embeds are summary-only and re-fire when setSummary is called.

Commit: feat(jobs): embed worker + hybrid search.

Phase D — Karakeep webhook + drag-drop UI + Jobs UI

Karakeep webhook. POST /api/ingest/karakeep. Authenticated by X-Karakeep-Signature: sha256=<hex> HMAC of the raw body with KARAKEEP_WEBHOOK_SECRET env. If the signature is missing or wrong: 401.

Payload (Karakeep's webhook shape, normalized): { event, bookmark_id, tags }.

For event === 'bookmark.created':

  1. Look up the existing space-mapping from env: KARAKEEP_DEFAULT_SPACE_ID (a UUID). Future work: per-tag space routing.
  2. Enqueue ingest.karakeep with { bookmark_id, space_id }.

ingest.karakeep worker:

  1. Fetch the bookmark via Karakeep's API: GET https://karakeep.hynesy.com/api/v1/bookmarks/{bookmark_id} with KARAKEEP_API_TOKEN.
  2. Build the same payload an ingest.url job would use (URL + title + tags) and call the URL handler directly. Tags propagate to the entity_tags table via repo.
  3. If Karakeep returns 404 (bookmark deleted), mark the job done — no error.

Drag-drop UI. public/components/dropzone.js — wraps a target element, intercepts drag events, POSTs each file to /api/capture/upload, shows toast progress. Wire onto <main> so dropping anywhere in the main area works. Pre-fills space_id with localStorage.last_space_id (set when the user navigates to a space view).

Jobs UI fill-in. Expand public/views/jobs.js:

  • Group rows by state (active / completed / failed).
  • Each row: id (8 chars), name, state, relative created_at, last_error?, action buttons.
  • Polls /api/jobs?state=active,failed every 10 s.
  • Retry button POSTs /api/jobs/:id/retry; delete button DELETE /api/jobs/:id.

Files:

  • lib/api/routes/ingest.js.
  • lib/jobs/workers/karakeep.js.
  • lib/karakeep/client.js — thin wrapper.
  • public/components/dropzone.js.
  • public/views/jobs.js (expand).
  • tests/api/ingest.test.js — HMAC check, valid/invalid signature.
  • tests/jobs/workers/karakeep.test.js — Karakeep API mocked via fetch interceptor.

Commit: feat(jobs): Karakeep webhook + drag-drop + Jobs UI.

Error handling & idempotency

  • Idempotency keys. URL and Karakeep workers compute sha256(space_id + url) (URL) or sha256(space_id + 'karakeep:' + bookmark_id) (Karakeep). Stored as refs.external_id with source_kind set to 'url' or 'karakeep'. The unique index idx_refs_external_unique already enforces this from Plan 1. A duplicate ingest finds the existing ref and short-circuits.
  • Singleton embed jobs. pg-boss singletonKey: '${entity_type}:${entity_id}' so rapid edits coalesce into one pending embed. If a job is already in-flight when a new edit lands, a follow-up is enqueued.
  • Capture rate limit. Out of scope. The agentOrOwner gate is enough at single-user scale.
  • Ollama down. Embed jobs throw, retry under pg-boss backoff. After dead-letter (≈ 5 min cumulative), entity stays without an embedding; hybrid search falls back to FTS for those rows. Operator restores Ollama, then POST /api/jobs/:id/retry or wait for the periodic re-embed cron in a future phase.
  • Karakeep down. Webhook still accepts. The worker dead-letters; tag mapping replays from the operator manually.
  • Blob upload partial. Stream to temp; rename on success only. Failed uploads leave a temp file; a daily cron in Plan 4 sweeps > 24 h temps.

Observability

  • Pino structured logs already in place. New log keys: job_id, job_name, entity_type, entity_id, idempotency_key, outcome.
  • /api/jobs is the operator surface; the SPA Jobs view fronts it.
  • pg-boss's archive table is the source of truth for completed/failed jobs; no separate audit needed for job lifecycle (the audit log captures entity-level changes the workers cause).

Testing strategy

  • Unit: workers and the Ollama client get unit tests with fetch mocked (vitest's vi.fn).
  • Repo: tests/repos/search.test.js extended; new tests/repos/jobs.test.js covers pg-boss-backed list/retry helpers.
  • API: capture, ingest, jobs routes via supertest. HMAC signature pass/fail. Idempotency on second capture of the same URL.
  • Integration (gated): one test that hits real Ollama; auto-skipped if OLLAMA_URL is unreachable. Real pg-boss roundtrips happen inside the existing test DB using resetDb + await pg-boss.stop() between suites to avoid cross-talk.
  • No new vitest config. fileParallelism: false already in place from Plan 1 — pg-boss is happier serialized too.

Migrations

  • No new SQL migrations from Void. pg-boss creates its own schema on first start().
  • One-time CT 311 ops: create /var/lib/void/blobs/ and chown void:void.

Deploy delta

  • .env adds OLLAMA_URL, KARAKEEP_WEBHOOK_SECRET, KARAKEEP_API_TOKEN, KARAKEEP_API_URL, KARAKEEP_DEFAULT_SPACE_ID. Documented in deploy/README.md.
  • deploy/push.sh unchanged (rsync still works).
  • Snapshot CT 310 + 311 before deploying Plan 3 (standing rule). The Phase A first-deploy is the "major update" — pg-boss creates new tables in the shared DB.

Known follow-ups (not Plan 3)

  • AI Space/Project suggestion on capture.
  • Embedding chunks table.
  • pdf-text-extract for born-digital PDFs (Plan 4 likely handles this with Tesseract too).
  • Per-tag Karakeep → Space routing instead of one default space.
  • Recurring re-embed cron for rows where embedding IS NULL.
  • Real-time Jobs UI via pg LISTEN/NOTIFY instead of polling.

Open items for the user

  • Karakeep secrets. Plan 3 Phase D needs KARAKEEP_API_TOKEN (issued from Karakeep settings) and a chosen KARAKEEP_DEFAULT_SPACE_ID. Surfaceable when the phase starts.
  • The 29-day-old knowledge_pipeline memory (Karakeep → Qdrant → MCP) is now superseded by Void 2.0's pgvector-only architecture. After Plan 3 ships, that memory should be marked obsolete or deleted to avoid future-me reading it as authoritative.