All Void 2.0 superpowers specs and implementation plans now live at
docs/superpowers/{specs,plans}/ inside the repo. Previously they were
at /project/docs/superpowers/ which was not under git.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
20 KiB
Void 2.0 — Plan 3 Design Spec: Capture pipeline + hybrid search
Date: 2026-06-01
Builds on: Plan 1 (Foundation, complete) and Plan 2 (API + UI shell, complete, version 2.0.0-alpha.2).
Master spec: docs/superpowers/specs/2026-05-31-void-v2-design.md — many decisions inherit from there.
Goal
Wire the Plan 2 SPA's stub Capture button to a real ingest pipeline. Add a pg-boss-backed job queue, capture entry points (URL POST + Karakeep webhook + drag-drop attachment), a URL worker that turns links into refs, an embeddings worker that writes vectors into the existing embedding columns, and a hybrid FTS+vector search that replaces the Plan 2 FTS-only /api/search.
Out of scope (Plan 4 and later)
- Whisper transcription, Tesseract OCR, yt-dlp video ingestion, scanned-PDF OCR.
- The Python
void-workersservice. Plan 3 stays single-process Node. - AI Space/Project suggestion on capture (defer; capture takes explicit
space_id). - Embedding chunks table — Plan 3 uses one whole-doc embedding per entity row; chunks land later once we can measure recall on a real corpus.
- MCP server surface. Plan 5+.
Decisions locked by brainstorm
| Question | Answer |
|---|---|
| Plan 3 slice | Node-side: pg-boss + /api/capture POST + Karakeep webhook + URL worker + embed.text worker + hybrid search + Jobs panel. Defers ML-heavy ingest to Plan 4. |
| Capture entry points | /api/capture POST + Karakeep webhook + drag-drop upload. Inbound email skipped. |
| Embedding granularity | Whole-doc per entity row. Add chunks table later. |
| Search rollout | /api/search replaced in-place with hybrid (FTS + vector via RRF). Vector branch graceful-degrades to FTS-only if Ollama is down or the row lacks an embedding. |
| AI Space/Project suggestion | Deferred. Capture requires space_id. SPA preselects the user's last-used space from localStorage. |
| Jobs visibility | /api/jobs?status= + /api/jobs/:id/retry + /api/jobs/:id/delete + a minimal #/jobs SPA view (table grouped by status, 10 s polling, retry/delete per row). |
| Sequencing | Phase A → B → C → D (matches Plan 2 phasing). Each phase ends green and demoable. |
Architecture
┌──────────────────────────────────────────┐
│ void-server (CT 311, Node, single proc)│
│ │
/api/capture ───▶ │ routes/capture.js │
/api/ingest/ │ routes/ingest.js (Karakeep webhook) │
karakeep ─────▶ │ │ │
drag-drop ─────▶ │ ▼ │
│ jobs/queue.js (pg-boss client) │
│ │ │
│ ▼ │
│ workers/ (in-process pollers) │
│ ├─ url.js │
│ ├─ karakeep.js │
│ ├─ embed.js (Ollama HTTP) │
│ └─ blob.js (drag-drop attachments) │
│ │ │
│ ▼ │
│ lib/db/repos/ (existing) + repos/jobs.js│
│ │ │
└──────┼───────────────────────────────────┘
│
┌─────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ Postgres │ │ Ollama │ │ Blob FS │
│ (CT 310, │ │ (CT 102, │ │ /var/lib/ │
│ pgvector │ │ nomic- │ │ void/blobs/ │
│ + pgboss │ │ embed-text)│ │ │
│ tables) │ └──────────────┘ └──────────────┘
└──────────┘
Process model. Workers and HTTP handlers share the void-server Node process. pg-boss polls Postgres on its own interval; HTTP requests enqueue jobs and return immediately with a job_id. No separate worker process — that's Plan 4 when the Python service arrives.
External dependencies. Postgres (already there), Ollama on CT 102 at http://192.168.1.185:11434 (running, nomic-embed-text pulled, 768-dim embeddings verified 2026-06-01). Graceful-degrade still applies if it goes down later. Blob storage is local FS on CT 311's root pool, content-addressed.
No new entity tables. refs / pages / source_docs / attachments are reused. The embedding vector(1024) columns exist from Plan 1 (migration 002 + 004). pg-boss creates its own schema (pgboss.*) on first run.
Phase A — Queue + worker harness + Jobs API
New files:
lib/jobs/queue.js— singleton pg-boss client;start(),enqueue(name, data, opts),subscribe(name, handler, opts).lib/jobs/index.js— registers all worker handlers on start; called fromserver.jsboot.lib/jobs/workers/echo.js— trivial worker used to prove the harness. Removed at end of Phase D.lib/api/routes/jobs.js—GET /api/jobs?state=,GET /api/jobs/:id,POST /api/jobs/:id/retry,DELETE /api/jobs/:id. Owner-only.tests/jobs/queue.test.js— pg-boss roundtrip: enqueue → handler runs → result.tests/api/jobs.test.js— list/retry/delete via HTTP.
Modify:
server.js— calljobs.start()on boot,jobs.shutdown()on SIGTERM.package.json— addpg-boss@^10.lib/api/index.js— mount/api/jobs.public/router.js+public/app.js+ addpublic/views/jobs.js— minimal Jobs view (placeholder for now; fleshed in Phase D).
pg-boss config. One pg-boss instance per process. Uses the existing DATABASE_URL. Default pg-boss schema name. newJobCheckIntervalSeconds: 2 (alpha-tier; tighten later if needed). archiveCompletedAfterSeconds: 86_400 (1 day archive). deleteAfterDays: 7.
Concurrency limits per the master spec, surfaced via subscribe(name, handler, {teamSize, teamConcurrency}):
| Worker name | Team size | Reason |
|---|---|---|
ingest.url |
4 | Network-bound |
ingest.karakeep |
4 | Network-bound |
ingest.blob |
2 | Disk + sha256 hashing |
embed.text |
2 | Ollama-bound (single GPU on CT 102) |
Retry policy. Per-worker retryLimit: 5, retryBackoff: true, retryDelay: 10 (seconds). Effective backoff sequence: 10 s, 20 s, 40 s, 80 s, 160 s, then dead-letter. The spec called out 10 s / 60 s / 5 m but pg-boss only exposes exponential backoff with a base delay; the resulting curve is close enough.
Dead-letter. pg-boss's archive table (pgboss.archive) keeps failed jobs. /api/jobs?state=failed queries it. Manual retry copies to active.
Commit: feat(jobs): pg-boss harness + Jobs API.
Phase B — Capture API + URL worker + blob storage
Capture POST. POST /api/capture (owner or agent with write tier):
{
"space_id": "uuid",
"url": "https://example.com/article",
"hint": { "project_id": "uuid?", "title": "string?", "tags": ["string"] }
}
Response 202 with { job_id, idempotency_key, ref_id?: uuid }. Idempotency key is sha256(space_id + url). If a ref already exists for that key, the response carries the existing ref_id and job_id: null (no new job enqueued).
URL worker. lib/jobs/workers/url.js for ingest.url:
- Compute idempotency key. If a
refsrow already exists withsource_kind='url'andexternal_id=<key>, return its id. fetch(url)withUser-Agent: void-ingest/2.0and 15 s timeout.- Run readability extraction (npm
@mozilla/readability+jsdom). Pulltitle,byline,excerpt,textContent,siteName. - Insert a
refsrow:kind='url',source_url=url,title=readability.title,summary=readability.excerpt,body_text=readability.textContent(truncate to 200 kB),source_kind='url',external_id=<idempotency_key>,metadata={ site_name, byline, content_length }. - Return the ref. Embedding is handled by Phase C's repo-level trigger that wraps
refs.create; in Phase B alone the ref simply lacks an embedding until Phase C ships.
Drag-drop. POST /api/capture/upload (multipart, owner or agent write):
- Field
file— the binary. - Field
space_id— required. - Field
meta(json) — optional{ title, kind, tags }.
Multer stages uploads in /var/lib/void/uploads-tmp/ (size cap 100 MB per file) and the worker moves the file into the content-addressed blob store on success.
Worker ingest.blob:
- Stream the upload to a temp file. Hash with sha256 as it streams.
- If
/var/lib/void/blobs/<sha-prefix>/<sha>exists, this is a duplicate; reuse the existing path. - Otherwise move the temp file into place.
- Determine
kindfromContent-Type/ extension:imagefor image/*,pdffor application/pdf,filefor everything else. Video/audio fall through tofilein Plan 3 (Plan 4 picks them up). - Insert a
refsrow:kind=<derived>,blob_path=<path>,title=filename || sha, plus metadata. - Insert via
refs.create; Phase C's trigger picks up the embed automatically. In Phase B, no embed runs.
Blob storage. New directory /var/lib/void/blobs/ on CT 311, owned by void:void, mode 750. Layout <first-2-chars-of-sha>/<full-sha>. Deploy bootstrap step adds the dir creation. Already on localzfs so replication picks it up.
Files:
lib/api/routes/capture.js— both endpoints + multer config.lib/jobs/workers/url.js,lib/jobs/workers/blob.js.lib/ingest/readability.js— wraps@mozilla/readabilityfor testability.lib/ingest/blob_store.js— sha + path resolution + write.tests/api/capture.test.js,tests/jobs/workers/url.test.js,tests/jobs/workers/blob.test.js.
Deps to add: pg-boss, @mozilla/readability, jsdom, multer.
Commit: feat(jobs): capture API + URL + blob workers.
Phase C — Embeddings + hybrid search
Ollama client. lib/ai/ollama.js:
async function embedText(text, model = 'nomic-embed-text') {
const res = await fetch(`${OLLAMA_URL}/api/embeddings`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model, prompt: text }),
signal: AbortSignal.timeout(60_000)
});
if (!res.ok) throw new OllamaError(res.status, await res.text());
const j = await res.json();
return j.embedding; // 768-dim
}
OLLAMA_URL env var, default http://192.168.1.185:11434. The 768-dim vector is zero-padded to 1024 to match the vector(1024) column (per master spec, eases later model swap).
Embed worker. embed.text job payload { entity_type, entity_id }. Worker:
- Load the entity row.
- Build the embedding string:
page:${title}\n\n${body_md}, truncated to ~6 k characters (≈ 1.5 k tokens; well under nomic's 8 k context).ref:${title || ''}\n${summary || ''}\n${body_text || ''}, same truncation.source_doc:${name}\n${body_text || ''}.conversation:${title || ''}\n${summary || ''}— short by design; conversations get richer treatment in Plan 5.
- Call
embedText. OnOllamaErroror fetch timeout, throw — pg-boss retry kicks in with exponential backoff. - Zero-pad to 1024, UPDATE the entity's
embeddingcolumn. - Emit an audit log entry
(actor_kind='worker', action='update', entity_type, entity_id, diff={embedding:'updated'}).
Re-embed triggers. Write paths (repo.create, repo.update) for pages/refs/source_docs already exist. Add a small lib/jobs/triggers.js that wraps these — after a successful create/update of an embeddable entity, enqueue embed.text with a singleton key ${entity_type}:${entity_id} so rapid re-edits coalesce. The trigger is called from repo level so MCP and cron paths get it too.
Hybrid search. Rewrite lib/db/repos/search.js::fts into search.hybrid({ q, space_id?, kinds?, limit, offset }):
- FTS branch — current Plan 2 query unchanged, returns up to
limit * 3results withts_rank. - Vector branch — embed
qvia Ollama (with a 5 s timeout — search must stay snappy). For each kind, run an ANN query against the matching table'sembeddingcolumn using HNSW (<=>cosine distance). Returns up tolimit * 3per kind. If Ollama times out or errors, skip this branch entirely — log asearch.vector_skippedevent and continue with FTS-only. - RRF fusion — for each unique
(kind, id), sum1 / (60 + rank_fts) + 1 / (60 + rank_vec). The60constant matches the canonical RRF paper. Sort, slice to[offset, offset+limit]. - Vector-only rows (no FTS match) and FTS-only rows (no embedding yet) both participate; missing rank is treated as infinity, giving
1 / inf = 0from that branch.
Result shape unchanged: { kind, id, space_id, title_or_snippet, rank }. The rank field now carries the fused RRF score.
Files:
lib/ai/ollama.js(new).lib/jobs/workers/embed.js(new).lib/jobs/triggers.js(new).lib/db/repos/search.js(rewrite).tests/ai/ollama.test.js— fetch mock.tests/jobs/workers/embed.test.js— fetch mock; verifies zero-pad + audit.tests/repos/search.test.js(existing) — extended with vector-fixture rows + RRF assertions.
Embedding-test strategy. Tests insert fixture vectors directly (no Ollama needed). One integration test under tests/integration/embed_live.test.js hits a real Ollama, marked skip() if OLLAMA_URL is unreachable.
Repos that emit triggers: pages.create, pages.update, refs.create, refs.update, refs.upsertByExternal, source_docs.create, source_docs.update. Conversation embeds are summary-only and re-fire when setSummary is called.
Commit: feat(jobs): embed worker + hybrid search.
Phase D — Karakeep webhook + drag-drop UI + Jobs UI
Karakeep webhook. POST /api/ingest/karakeep. Authenticated by X-Karakeep-Signature: sha256=<hex> HMAC of the raw body with KARAKEEP_WEBHOOK_SECRET env. If the signature is missing or wrong: 401.
Payload (Karakeep's webhook shape, normalized): { event, bookmark_id, tags }.
For event === 'bookmark.created':
- Look up the existing space-mapping from env:
KARAKEEP_DEFAULT_SPACE_ID(a UUID). Future work: per-tag space routing. - Enqueue
ingest.karakeepwith{ bookmark_id, space_id }.
ingest.karakeep worker:
- Fetch the bookmark via Karakeep's API:
GET https://karakeep.hynesy.com/api/v1/bookmarks/{bookmark_id}withKARAKEEP_API_TOKEN. - Build the same payload an
ingest.urljob would use (URL + title + tags) and call the URL handler directly. Tags propagate to theentity_tagstable via repo. - If Karakeep returns 404 (bookmark deleted), mark the job done — no error.
Drag-drop UI. public/components/dropzone.js — wraps a target element, intercepts drag events, POSTs each file to /api/capture/upload, shows toast progress. Wire onto <main> so dropping anywhere in the main area works. Pre-fills space_id with localStorage.last_space_id (set when the user navigates to a space view).
Jobs UI fill-in. Expand public/views/jobs.js:
- Group rows by
state(active / completed / failed). - Each row:
id (8 chars),name,state, relativecreated_at,last_error?, action buttons. - Polls
/api/jobs?state=active,failedevery 10 s. - Retry button POSTs
/api/jobs/:id/retry; delete button DELETE/api/jobs/:id.
Files:
lib/api/routes/ingest.js.lib/jobs/workers/karakeep.js.lib/karakeep/client.js— thin wrapper.public/components/dropzone.js.public/views/jobs.js(expand).tests/api/ingest.test.js— HMAC check, valid/invalid signature.tests/jobs/workers/karakeep.test.js— Karakeep API mocked via fetch interceptor.
Commit: feat(jobs): Karakeep webhook + drag-drop + Jobs UI.
Error handling & idempotency
- Idempotency keys. URL and Karakeep workers compute
sha256(space_id + url)(URL) orsha256(space_id + 'karakeep:' + bookmark_id)(Karakeep). Stored asrefs.external_idwithsource_kindset to'url'or'karakeep'. The unique indexidx_refs_external_uniquealready enforces this from Plan 1. A duplicate ingest finds the existing ref and short-circuits. - Singleton embed jobs. pg-boss
singletonKey: '${entity_type}:${entity_id}'so rapid edits coalesce into one pending embed. If a job is already in-flight when a new edit lands, a follow-up is enqueued. - Capture rate limit. Out of scope. The
agentOrOwnergate is enough at single-user scale. - Ollama down. Embed jobs throw, retry under pg-boss backoff. After dead-letter (≈ 5 min cumulative), entity stays without an embedding; hybrid search falls back to FTS for those rows. Operator restores Ollama, then
POST /api/jobs/:id/retryor wait for the periodic re-embed cron in a future phase. - Karakeep down. Webhook still accepts. The worker dead-letters; tag mapping replays from the operator manually.
- Blob upload partial. Stream to temp; rename on success only. Failed uploads leave a temp file; a daily cron in Plan 4 sweeps
> 24 htemps.
Observability
- Pino structured logs already in place. New log keys:
job_id,job_name,entity_type,entity_id,idempotency_key,outcome. /api/jobsis the operator surface; the SPA Jobs view fronts it.- pg-boss's archive table is the source of truth for completed/failed jobs; no separate audit needed for job lifecycle (the audit log captures entity-level changes the workers cause).
Testing strategy
- Unit: workers and the Ollama client get unit tests with
fetchmocked (vitest'svi.fn). - Repo:
tests/repos/search.test.jsextended; newtests/repos/jobs.test.jscoverspg-boss-backed list/retry helpers. - API: capture, ingest, jobs routes via supertest. HMAC signature pass/fail. Idempotency on second capture of the same URL.
- Integration (gated): one test that hits real Ollama; auto-skipped if
OLLAMA_URLis unreachable. Real pg-boss roundtrips happen inside the existing test DB usingresetDb+await pg-boss.stop()between suites to avoid cross-talk. - No new vitest config.
fileParallelism: falsealready in place from Plan 1 — pg-boss is happier serialized too.
Migrations
- No new SQL migrations from Void. pg-boss creates its own schema on first
start(). - One-time CT 311 ops: create
/var/lib/void/blobs/and chownvoid:void.
Deploy delta
.envaddsOLLAMA_URL,KARAKEEP_WEBHOOK_SECRET,KARAKEEP_API_TOKEN,KARAKEEP_API_URL,KARAKEEP_DEFAULT_SPACE_ID. Documented indeploy/README.md.deploy/push.shunchanged (rsync still works).- Snapshot CT 310 + 311 before deploying Plan 3 (standing rule). The Phase A first-deploy is the "major update" — pg-boss creates new tables in the shared DB.
Known follow-ups (not Plan 3)
- AI Space/Project suggestion on capture.
- Embedding chunks table.
- pdf-text-extract for born-digital PDFs (Plan 4 likely handles this with Tesseract too).
- Per-tag Karakeep → Space routing instead of one default space.
- Recurring re-embed cron for rows where
embedding IS NULL. - Real-time Jobs UI via
pg LISTEN/NOTIFYinstead of polling.
Open items for the user
- Karakeep secrets. Plan 3 Phase D needs
KARAKEEP_API_TOKEN(issued from Karakeep settings) and a chosenKARAKEEP_DEFAULT_SPACE_ID. Surfaceable when the phase starts. - The 29-day-old
knowledge_pipelinememory (Karakeep → Qdrant → MCP) is now superseded by Void 2.0's pgvector-only architecture. After Plan 3 ships, that memory should be marked obsolete or deleted to avoid future-me reading it as authoritative.