docs: move void-v2 specs + plans into the repo
All Void 2.0 superpowers specs and implementation plans now live at
docs/superpowers/{specs,plans}/ inside the repo. Previously they were
at /project/docs/superpowers/ which was not under git.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
295
docs/superpowers/specs/2026-06-01-void-v2-plan3-capture.md
Normal file
295
docs/superpowers/specs/2026-06-01-void-v2-plan3-capture.md
Normal file
@@ -0,0 +1,295 @@
|
||||
# Void 2.0 — Plan 3 Design Spec: Capture pipeline + hybrid search
|
||||
|
||||
**Date:** 2026-06-01
|
||||
**Builds on:** Plan 1 (Foundation, complete) and Plan 2 (API + UI shell, complete, version 2.0.0-alpha.2).
|
||||
**Master spec:** `docs/superpowers/specs/2026-05-31-void-v2-design.md` — many decisions inherit from there.
|
||||
|
||||
## Goal
|
||||
|
||||
Wire the Plan 2 SPA's stub Capture button to a real ingest pipeline. Add a pg-boss-backed job queue, capture entry points (URL POST + Karakeep webhook + drag-drop attachment), a URL worker that turns links into `refs`, an embeddings worker that writes vectors into the existing `embedding` columns, and a hybrid FTS+vector search that replaces the Plan 2 FTS-only `/api/search`.
|
||||
|
||||
## Out of scope (Plan 4 and later)
|
||||
|
||||
- Whisper transcription, Tesseract OCR, yt-dlp video ingestion, scanned-PDF OCR.
|
||||
- The Python `void-workers` service. Plan 3 stays single-process Node.
|
||||
- AI Space/Project suggestion on capture (defer; capture takes explicit `space_id`).
|
||||
- Embedding chunks table — Plan 3 uses one whole-doc embedding per entity row; chunks land later once we can measure recall on a real corpus.
|
||||
- MCP server surface. Plan 5+.
|
||||
|
||||
## Decisions locked by brainstorm
|
||||
|
||||
| Question | Answer |
|
||||
|---|---|
|
||||
| Plan 3 slice | Node-side: pg-boss + `/api/capture` POST + Karakeep webhook + URL worker + embed.text worker + hybrid search + Jobs panel. Defers ML-heavy ingest to Plan 4. |
|
||||
| Capture entry points | `/api/capture` POST + Karakeep webhook + drag-drop upload. Inbound email skipped. |
|
||||
| Embedding granularity | Whole-doc per entity row. Add chunks table later. |
|
||||
| Search rollout | `/api/search` replaced in-place with hybrid (FTS + vector via RRF). Vector branch graceful-degrades to FTS-only if Ollama is down or the row lacks an embedding. |
|
||||
| AI Space/Project suggestion | Deferred. Capture requires `space_id`. SPA preselects the user's last-used space from `localStorage`. |
|
||||
| Jobs visibility | `/api/jobs?status=` + `/api/jobs/:id/retry` + `/api/jobs/:id/delete` + a minimal `#/jobs` SPA view (table grouped by status, 10 s polling, retry/delete per row). |
|
||||
| Sequencing | Phase A → B → C → D (matches Plan 2 phasing). Each phase ends green and demoable. |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────┐
|
||||
│ void-server (CT 311, Node, single proc)│
|
||||
│ │
|
||||
/api/capture ───▶ │ routes/capture.js │
|
||||
/api/ingest/ │ routes/ingest.js (Karakeep webhook) │
|
||||
karakeep ─────▶ │ │ │
|
||||
drag-drop ─────▶ │ ▼ │
|
||||
│ jobs/queue.js (pg-boss client) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ workers/ (in-process pollers) │
|
||||
│ ├─ url.js │
|
||||
│ ├─ karakeep.js │
|
||||
│ ├─ embed.js (Ollama HTTP) │
|
||||
│ └─ blob.js (drag-drop attachments) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ lib/db/repos/ (existing) + repos/jobs.js│
|
||||
│ │ │
|
||||
└──────┼───────────────────────────────────┘
|
||||
│
|
||||
┌─────────────┼──────────────┐
|
||||
▼ ▼ ▼
|
||||
┌──────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Postgres │ │ Ollama │ │ Blob FS │
|
||||
│ (CT 310, │ │ (CT 102, │ │ /var/lib/ │
|
||||
│ pgvector │ │ nomic- │ │ void/blobs/ │
|
||||
│ + pgboss │ │ embed-text)│ │ │
|
||||
│ tables) │ └──────────────┘ └──────────────┘
|
||||
└──────────┘
|
||||
```
|
||||
|
||||
**Process model.** Workers and HTTP handlers share the void-server Node process. pg-boss polls Postgres on its own interval; HTTP requests enqueue jobs and return immediately with a `job_id`. No separate worker process — that's Plan 4 when the Python service arrives.
|
||||
|
||||
**External dependencies.** Postgres (already there), Ollama on CT 102 at `http://192.168.1.185:11434` (running, `nomic-embed-text` pulled, 768-dim embeddings verified 2026-06-01). Graceful-degrade still applies if it goes down later. Blob storage is local FS on CT 311's root pool, content-addressed.
|
||||
|
||||
**No new entity tables.** refs / pages / source_docs / attachments are reused. The `embedding vector(1024)` columns exist from Plan 1 (migration 002 + 004). pg-boss creates its own schema (`pgboss.*`) on first run.
|
||||
|
||||
## Phase A — Queue + worker harness + Jobs API
|
||||
|
||||
**New files:**
|
||||
- `lib/jobs/queue.js` — singleton pg-boss client; `start()`, `enqueue(name, data, opts)`, `subscribe(name, handler, opts)`.
|
||||
- `lib/jobs/index.js` — registers all worker handlers on start; called from `server.js` boot.
|
||||
- `lib/jobs/workers/echo.js` — trivial worker used to prove the harness. Removed at end of Phase D.
|
||||
- `lib/api/routes/jobs.js` — `GET /api/jobs?state=`, `GET /api/jobs/:id`, `POST /api/jobs/:id/retry`, `DELETE /api/jobs/:id`. Owner-only.
|
||||
- `tests/jobs/queue.test.js` — pg-boss roundtrip: enqueue → handler runs → result.
|
||||
- `tests/api/jobs.test.js` — list/retry/delete via HTTP.
|
||||
|
||||
**Modify:**
|
||||
- `server.js` — call `jobs.start()` on boot, `jobs.shutdown()` on SIGTERM.
|
||||
- `package.json` — add `pg-boss@^10`.
|
||||
- `lib/api/index.js` — mount `/api/jobs`.
|
||||
- `public/router.js` + `public/app.js` + add `public/views/jobs.js` — minimal Jobs view (placeholder for now; fleshed in Phase D).
|
||||
|
||||
**pg-boss config.** One pg-boss instance per process. Uses the existing `DATABASE_URL`. Default `pg-boss` schema name. `newJobCheckIntervalSeconds: 2` (alpha-tier; tighten later if needed). `archiveCompletedAfterSeconds: 86_400` (1 day archive). `deleteAfterDays: 7`.
|
||||
|
||||
**Concurrency limits** per the master spec, surfaced via `subscribe(name, handler, {teamSize, teamConcurrency})`:
|
||||
|
||||
| Worker name | Team size | Reason |
|
||||
|---|---|---|
|
||||
| `ingest.url` | 4 | Network-bound |
|
||||
| `ingest.karakeep` | 4 | Network-bound |
|
||||
| `ingest.blob` | 2 | Disk + sha256 hashing |
|
||||
| `embed.text` | 2 | Ollama-bound (single GPU on CT 102) |
|
||||
|
||||
**Retry policy.** Per-worker `retryLimit: 5`, `retryBackoff: true`, `retryDelay: 10` (seconds). Effective backoff sequence: 10 s, 20 s, 40 s, 80 s, 160 s, then dead-letter. The spec called out 10 s / 60 s / 5 m but pg-boss only exposes exponential backoff with a base delay; the resulting curve is close enough.
|
||||
|
||||
**Dead-letter.** pg-boss's archive table (`pgboss.archive`) keeps failed jobs. `/api/jobs?state=failed` queries it. Manual retry copies to active.
|
||||
|
||||
**Commit:** `feat(jobs): pg-boss harness + Jobs API`.
|
||||
|
||||
## Phase B — Capture API + URL worker + blob storage
|
||||
|
||||
**Capture POST.** `POST /api/capture` (owner or agent with write tier):
|
||||
|
||||
```json
|
||||
{
|
||||
"space_id": "uuid",
|
||||
"url": "https://example.com/article",
|
||||
"hint": { "project_id": "uuid?", "title": "string?", "tags": ["string"] }
|
||||
}
|
||||
```
|
||||
|
||||
Response 202 with `{ job_id, idempotency_key, ref_id?: uuid }`. Idempotency key is `sha256(space_id + url)`. If a ref already exists for that key, the response carries the existing `ref_id` and `job_id: null` (no new job enqueued).
|
||||
|
||||
**URL worker.** `lib/jobs/workers/url.js` for `ingest.url`:
|
||||
|
||||
1. Compute idempotency key. If a `refs` row already exists with `source_kind='url'` and `external_id=<key>`, return its id.
|
||||
2. `fetch(url)` with `User-Agent: void-ingest/2.0` and 15 s timeout.
|
||||
3. Run readability extraction (npm `@mozilla/readability` + `jsdom`). Pull `title`, `byline`, `excerpt`, `textContent`, `siteName`.
|
||||
4. Insert a `refs` row: `kind='url'`, `source_url=url`, `title=readability.title`, `summary=readability.excerpt`, `body_text=readability.textContent` (truncate to 200 kB), `source_kind='url'`, `external_id=<idempotency_key>`, `metadata={ site_name, byline, content_length }`.
|
||||
5. Return the ref. Embedding is handled by Phase C's repo-level trigger that wraps `refs.create`; in Phase B alone the ref simply lacks an embedding until Phase C ships.
|
||||
|
||||
**Drag-drop.** `POST /api/capture/upload` (multipart, owner or agent write):
|
||||
|
||||
- Field `file` — the binary.
|
||||
- Field `space_id` — required.
|
||||
- Field `meta` (json) — optional `{ title, kind, tags }`.
|
||||
|
||||
Multer stages uploads in `/var/lib/void/uploads-tmp/` (size cap 100 MB per file) and the worker moves the file into the content-addressed blob store on success.
|
||||
|
||||
Worker `ingest.blob`:
|
||||
|
||||
1. Stream the upload to a temp file. Hash with sha256 as it streams.
|
||||
2. If `/var/lib/void/blobs/<sha-prefix>/<sha>` exists, this is a duplicate; reuse the existing path.
|
||||
3. Otherwise move the temp file into place.
|
||||
4. Determine `kind` from `Content-Type` / extension: `image` for image/*, `pdf` for application/pdf, `file` for everything else. Video/audio fall through to `file` in Plan 3 (Plan 4 picks them up).
|
||||
5. Insert a `refs` row: `kind=<derived>`, `blob_path=<path>`, `title=filename || sha`, plus metadata.
|
||||
6. Insert via `refs.create`; Phase C's trigger picks up the embed automatically. In Phase B, no embed runs.
|
||||
|
||||
**Blob storage.** New directory `/var/lib/void/blobs/` on CT 311, owned by `void:void`, mode 750. Layout `<first-2-chars-of-sha>/<full-sha>`. Deploy bootstrap step adds the dir creation. Already on `localzfs` so replication picks it up.
|
||||
|
||||
**Files:**
|
||||
- `lib/api/routes/capture.js` — both endpoints + multer config.
|
||||
- `lib/jobs/workers/url.js`, `lib/jobs/workers/blob.js`.
|
||||
- `lib/ingest/readability.js` — wraps `@mozilla/readability` for testability.
|
||||
- `lib/ingest/blob_store.js` — sha + path resolution + write.
|
||||
- `tests/api/capture.test.js`, `tests/jobs/workers/url.test.js`, `tests/jobs/workers/blob.test.js`.
|
||||
|
||||
**Deps to add:** `pg-boss`, `@mozilla/readability`, `jsdom`, `multer`.
|
||||
|
||||
**Commit:** `feat(jobs): capture API + URL + blob workers`.
|
||||
|
||||
## Phase C — Embeddings + hybrid search
|
||||
|
||||
**Ollama client.** `lib/ai/ollama.js`:
|
||||
|
||||
```js
|
||||
async function embedText(text, model = 'nomic-embed-text') {
|
||||
const res = await fetch(`${OLLAMA_URL}/api/embeddings`, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ model, prompt: text }),
|
||||
signal: AbortSignal.timeout(60_000)
|
||||
});
|
||||
if (!res.ok) throw new OllamaError(res.status, await res.text());
|
||||
const j = await res.json();
|
||||
return j.embedding; // 768-dim
|
||||
}
|
||||
```
|
||||
|
||||
`OLLAMA_URL` env var, default `http://192.168.1.185:11434`. The 768-dim vector is zero-padded to 1024 to match the `vector(1024)` column (per master spec, eases later model swap).
|
||||
|
||||
**Embed worker.** `embed.text` job payload `{ entity_type, entity_id }`. Worker:
|
||||
|
||||
1. Load the entity row.
|
||||
2. Build the embedding string:
|
||||
- `page`: `${title}\n\n${body_md}`, truncated to ~6 k characters (≈ 1.5 k tokens; well under nomic's 8 k context).
|
||||
- `ref`: `${title || ''}\n${summary || ''}\n${body_text || ''}`, same truncation.
|
||||
- `source_doc`: `${name}\n${body_text || ''}`.
|
||||
- `conversation`: `${title || ''}\n${summary || ''}` — short by design; conversations get richer treatment in Plan 5.
|
||||
3. Call `embedText`. On `OllamaError` or fetch timeout, throw — pg-boss retry kicks in with exponential backoff.
|
||||
4. Zero-pad to 1024, UPDATE the entity's `embedding` column.
|
||||
5. Emit an audit log entry `(actor_kind='worker', action='update', entity_type, entity_id, diff={embedding:'updated'})`.
|
||||
|
||||
**Re-embed triggers.** Write paths (`repo.create`, `repo.update`) for pages/refs/source_docs already exist. Add a small `lib/jobs/triggers.js` that wraps these — after a successful create/update of an embeddable entity, enqueue `embed.text` with a singleton key `${entity_type}:${entity_id}` so rapid re-edits coalesce. The trigger is called from repo level so MCP and cron paths get it too.
|
||||
|
||||
**Hybrid search.** Rewrite `lib/db/repos/search.js::fts` into `search.hybrid({ q, space_id?, kinds?, limit, offset })`:
|
||||
|
||||
1. FTS branch — current Plan 2 query unchanged, returns up to `limit * 3` results with `ts_rank`.
|
||||
2. Vector branch — embed `q` via Ollama (with a 5 s timeout — search must stay snappy). For each kind, run an ANN query against the matching table's `embedding` column using HNSW (`<=>` cosine distance). Returns up to `limit * 3` per kind. If Ollama times out or errors, skip this branch entirely — log a `search.vector_skipped` event and continue with FTS-only.
|
||||
3. RRF fusion — for each unique `(kind, id)`, sum `1 / (60 + rank_fts) + 1 / (60 + rank_vec)`. The `60` constant matches the canonical RRF paper. Sort, slice to `[offset, offset+limit]`.
|
||||
4. Vector-only rows (no FTS match) and FTS-only rows (no embedding yet) both participate; missing rank is treated as infinity, giving `1 / inf = 0` from that branch.
|
||||
|
||||
Result shape unchanged: `{ kind, id, space_id, title_or_snippet, rank }`. The `rank` field now carries the fused RRF score.
|
||||
|
||||
**Files:**
|
||||
- `lib/ai/ollama.js` (new).
|
||||
- `lib/jobs/workers/embed.js` (new).
|
||||
- `lib/jobs/triggers.js` (new).
|
||||
- `lib/db/repos/search.js` (rewrite).
|
||||
- `tests/ai/ollama.test.js` — fetch mock.
|
||||
- `tests/jobs/workers/embed.test.js` — fetch mock; verifies zero-pad + audit.
|
||||
- `tests/repos/search.test.js` (existing) — extended with vector-fixture rows + RRF assertions.
|
||||
|
||||
**Embedding-test strategy.** Tests insert fixture vectors directly (no Ollama needed). One integration test under `tests/integration/embed_live.test.js` hits a real Ollama, marked `skip()` if `OLLAMA_URL` is unreachable.
|
||||
|
||||
**Repos that emit triggers:** pages.create, pages.update, refs.create, refs.update, refs.upsertByExternal, source_docs.create, source_docs.update. Conversation embeds are summary-only and re-fire when `setSummary` is called.
|
||||
|
||||
**Commit:** `feat(jobs): embed worker + hybrid search`.
|
||||
|
||||
## Phase D — Karakeep webhook + drag-drop UI + Jobs UI
|
||||
|
||||
**Karakeep webhook.** `POST /api/ingest/karakeep`. Authenticated by `X-Karakeep-Signature: sha256=<hex>` HMAC of the raw body with `KARAKEEP_WEBHOOK_SECRET` env. If the signature is missing or wrong: 401.
|
||||
|
||||
Payload (Karakeep's webhook shape, normalized): `{ event, bookmark_id, tags }`.
|
||||
|
||||
For `event === 'bookmark.created'`:
|
||||
1. Look up the existing space-mapping from env: `KARAKEEP_DEFAULT_SPACE_ID` (a UUID). Future work: per-tag space routing.
|
||||
2. Enqueue `ingest.karakeep` with `{ bookmark_id, space_id }`.
|
||||
|
||||
`ingest.karakeep` worker:
|
||||
1. Fetch the bookmark via Karakeep's API: `GET https://karakeep.hynesy.com/api/v1/bookmarks/{bookmark_id}` with `KARAKEEP_API_TOKEN`.
|
||||
2. Build the same payload an `ingest.url` job would use (URL + title + tags) and call the URL handler directly. Tags propagate to the `entity_tags` table via repo.
|
||||
3. If Karakeep returns 404 (bookmark deleted), mark the job done — no error.
|
||||
|
||||
**Drag-drop UI.** `public/components/dropzone.js` — wraps a target element, intercepts drag events, POSTs each file to `/api/capture/upload`, shows toast progress. Wire onto `<main>` so dropping anywhere in the main area works. Pre-fills `space_id` with `localStorage.last_space_id` (set when the user navigates to a space view).
|
||||
|
||||
**Jobs UI fill-in.** Expand `public/views/jobs.js`:
|
||||
- Group rows by `state` (active / completed / failed).
|
||||
- Each row: `id (8 chars)`, `name`, `state`, relative `created_at`, `last_error?`, action buttons.
|
||||
- Polls `/api/jobs?state=active,failed` every 10 s.
|
||||
- Retry button POSTs `/api/jobs/:id/retry`; delete button DELETE `/api/jobs/:id`.
|
||||
|
||||
**Files:**
|
||||
- `lib/api/routes/ingest.js`.
|
||||
- `lib/jobs/workers/karakeep.js`.
|
||||
- `lib/karakeep/client.js` — thin wrapper.
|
||||
- `public/components/dropzone.js`.
|
||||
- `public/views/jobs.js` (expand).
|
||||
- `tests/api/ingest.test.js` — HMAC check, valid/invalid signature.
|
||||
- `tests/jobs/workers/karakeep.test.js` — Karakeep API mocked via fetch interceptor.
|
||||
|
||||
**Commit:** `feat(jobs): Karakeep webhook + drag-drop + Jobs UI`.
|
||||
|
||||
## Error handling & idempotency
|
||||
|
||||
- **Idempotency keys.** URL and Karakeep workers compute `sha256(space_id + url)` (URL) or `sha256(space_id + 'karakeep:' + bookmark_id)` (Karakeep). Stored as `refs.external_id` with `source_kind` set to `'url'` or `'karakeep'`. The unique index `idx_refs_external_unique` already enforces this from Plan 1. A duplicate ingest finds the existing ref and short-circuits.
|
||||
- **Singleton embed jobs.** pg-boss `singletonKey: '${entity_type}:${entity_id}'` so rapid edits coalesce into one pending embed. If a job is already in-flight when a new edit lands, a follow-up is enqueued.
|
||||
- **Capture rate limit.** Out of scope. The `agentOrOwner` gate is enough at single-user scale.
|
||||
- **Ollama down.** Embed jobs throw, retry under pg-boss backoff. After dead-letter (≈ 5 min cumulative), entity stays without an embedding; hybrid search falls back to FTS for those rows. Operator restores Ollama, then `POST /api/jobs/:id/retry` or wait for the periodic re-embed cron in a future phase.
|
||||
- **Karakeep down.** Webhook still accepts. The worker dead-letters; tag mapping replays from the operator manually.
|
||||
- **Blob upload partial.** Stream to temp; rename on success only. Failed uploads leave a temp file; a daily cron in Plan 4 sweeps `> 24 h` temps.
|
||||
|
||||
## Observability
|
||||
|
||||
- Pino structured logs already in place. New log keys: `job_id`, `job_name`, `entity_type`, `entity_id`, `idempotency_key`, `outcome`.
|
||||
- `/api/jobs` is the operator surface; the SPA Jobs view fronts it.
|
||||
- pg-boss's archive table is the source of truth for completed/failed jobs; no separate audit needed for job lifecycle (the audit log captures entity-level changes the workers cause).
|
||||
|
||||
## Testing strategy
|
||||
|
||||
- **Unit:** workers and the Ollama client get unit tests with `fetch` mocked (vitest's `vi.fn`).
|
||||
- **Repo:** `tests/repos/search.test.js` extended; new `tests/repos/jobs.test.js` covers `pg-boss`-backed list/retry helpers.
|
||||
- **API:** capture, ingest, jobs routes via supertest. HMAC signature pass/fail. Idempotency on second capture of the same URL.
|
||||
- **Integration (gated):** one test that hits real Ollama; auto-skipped if `OLLAMA_URL` is unreachable. Real pg-boss roundtrips happen inside the existing test DB using `resetDb` + `await pg-boss.stop()` between suites to avoid cross-talk.
|
||||
- **No new vitest config.** `fileParallelism: false` already in place from Plan 1 — pg-boss is happier serialized too.
|
||||
|
||||
## Migrations
|
||||
|
||||
- **No new SQL migrations from Void.** pg-boss creates its own schema on first `start()`.
|
||||
- One-time CT 311 ops: create `/var/lib/void/blobs/` and chown `void:void`.
|
||||
|
||||
## Deploy delta
|
||||
|
||||
- `.env` adds `OLLAMA_URL`, `KARAKEEP_WEBHOOK_SECRET`, `KARAKEEP_API_TOKEN`, `KARAKEEP_API_URL`, `KARAKEEP_DEFAULT_SPACE_ID`. Documented in `deploy/README.md`.
|
||||
- `deploy/push.sh` unchanged (rsync still works).
|
||||
- Snapshot CT 310 + 311 before deploying Plan 3 (standing rule). The Phase A first-deploy is the "major update" — pg-boss creates new tables in the shared DB.
|
||||
|
||||
## Known follow-ups (not Plan 3)
|
||||
|
||||
- AI Space/Project suggestion on capture.
|
||||
- Embedding chunks table.
|
||||
- pdf-text-extract for born-digital PDFs (Plan 4 likely handles this with Tesseract too).
|
||||
- Per-tag Karakeep → Space routing instead of one default space.
|
||||
- Recurring re-embed cron for rows where `embedding IS NULL`.
|
||||
- Real-time Jobs UI via `pg LISTEN/NOTIFY` instead of polling.
|
||||
|
||||
## Open items for the user
|
||||
|
||||
- **Karakeep secrets.** Plan 3 Phase D needs `KARAKEEP_API_TOKEN` (issued from Karakeep settings) and a chosen `KARAKEEP_DEFAULT_SPACE_ID`. Surfaceable when the phase starts.
|
||||
- **The 29-day-old `knowledge_pipeline` memory** (Karakeep → Qdrant → MCP) is now superseded by Void 2.0's pgvector-only architecture. After Plan 3 ships, that memory should be marked obsolete or deleted to avoid future-me reading it as authoritative.
|
||||
Reference in New Issue
Block a user