docs: security sweep, code review, Yerin design, Plan 6 brainstorm brief

- security-sweep-2026-06-01.md: fresh sweep of alpha.6 (1 fixed, findings + carry-overs) - code-review-2026-06-01.md: optimisation/cleanliness notes (pool error handler, O(n) bcrypt token scan, FTS index alignment, dup auth parsing) - yerin-security-agent.md: security-agent design + tool roadmap + Orthos role proposal - plan-6-brainstorm-brief.md: Sacred Valley widget inventory + open design questions - security-followups.md: marked the pending_changes CHECK finding RESOLVED Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 23:26:46 +10:00
parent 6c393d8069
commit afbf075d26
4 changed files with 323 additions and 0 deletions
--- a/docs/code-review-2026-06-01.md
+++ b/docs/code-review-2026-06-01.md
@@ -0,0 +1,62 @@
+# Void 2.0 — Code Review & Optimisation Notes (2026-06-01, alpha.6)
+
+Full read-through of `lib/` (~4.2k LOC), the public SPA, and the Python workers.
+Overall the codebase is in good shape: ~1:1 test-to-code ratio, consistent repo +
+audit pattern, clean tool registry, disciplined safe-DOM and SSRF handling. Items
+below are improvements, ordered by value. Severity ≠ urgency — most are "before
+scale / before more agents", not "broken now".
+
+## Correctness / robustness
+
+1. **`pool.js` has no error handler or statement timeout.** `lib/db/pool.js`
+   creates a `pg.Pool` with no `pool.on('error', …)`. An error on an *idle*
+   pooled client (e.g. DB restart, replication failover on the .215 cluster)
+   emits an `error` event with no listener → **process crash**. Add a logging
+   error handler. Also consider `statement_timeout` / `query_timeout` so a
+   pathological query can't pin a connection indefinitely (the pool is only
+   `max: 10`).
+
+2. **`applyPendingChange`'s `upsert` arm assumes a refs-shaped repo.** After
+   today's fix, `case 'upsert'` calls `repo.upsertByExternal(...)`. Only
+   `refsRepo` has that method; the action only originates from the refs route so
+   it's safe today, but it's an implicit coupling. Cheap guard: assert
+   `typeof repo.upsertByExternal === 'function'` and throw a clear
+   `ValidationError` otherwise, so a future entity emitting `upsert` fails loud,
+   not with a bare `TypeError`.
+
+## Performance / scale
+
+3. **`verifyToken` is O(n·bcrypt) per request.** (Also in the security sweep.)
+   Loads every non-revoked token and bcrypt-compares each. Fine at 1–2 agents,
+   quadratic-feeling as agents grow. Selector+verifier split makes it O(1). This
+   is the single highest-value optimisation once Yerin/others get their own tokens.
+
+4. **Hybrid-search FTS uses inline `to_tsvector(...)` expressions.** `search.js`
+   computes `to_tsvector('english', title || …)` at query time per branch. This
+   only avoids a seq scan if a **GIN index on the identical expression** exists
+   for each table. Verify the expression indexes match these exact expressions
+   (incl. the `coalesce(...)` shapes); a drift means silent seq scans. Worth a
+   `EXPLAIN` check on prod-sized data.
+
+## Cleanliness / maintainability
+
+5. **`owner.js` and `agent_auth.js` duplicate bearer parsing + owner check.**
+   Both split `Authorization`, both compare `OWNER_TOKEN`. Now that both use
+   `timingSafeStrEqual`, factor the "parse bearer + is-owner?" step into one
+   helper to keep the two in lockstep.
+
+6. **Doc/code name drift.** Completion docs reference `divertToPending` /
+   `applyPendingChange` — those are now the real names (good), but earlier notes
+   also referenced a `n()` export. Keep `docs/plan-*-complete.md` symbol names in
+   sync with the code; stale symbol names cost a grep on every revisit.
+
+7. **`context` tool `SELECT *`.** Project an explicit column list (see security
+   sweep LOW) — also makes the tool's output contract stable for the UI.
+
+## Things that are good (keep doing)
+- Tool registry (`createRegistry`) is a clean extension point — the new
+  `securityRegistry` reused it with zero changes.
+- Audit emission + recursive redaction is consistent across repos.
+- SSRF `safeFetch` is genuinely well built (DNS-pin TOCTOU defeat + per-hop
+  re-validation).
+- Test isolation on `void_test` with a prod-guard in `tests/helpers/setup.js`.
--- a/docs/plan-6-brainstorm-brief.md
+++ b/docs/plan-6-brainstorm-brief.md
@@ -0,0 +1,64 @@
+# Plan 6 — Sacred Valley Widgets — Brainstorming Brief
+
+> **Status: PREP, not a plan.** Plan 6 is design-led. Per
+> `[[void-v2-frontend-design-plugin]]`, the actual brainstorm must run **with you**
+> and the `frontend-design` skill invoked at the brainstorm stage so it shapes the
+> spec, not the implementation. This brief assembles the raw material so that
+> session is fast — it does **not** make the design decisions.
+
+## What Sacred Valley was in Void 1.x (the inventory to port)
+
+From the live Void 1.x dashboard (CT 301), gridstack cards (`gs-id`):
+
+| Card | Source / data | Void 2.0 status |
+|------|---------------|-----------------|
+| `clock` | client clock, NY + Melbourne | Trivial port; pure client. (Note: your TZ is Melbourne — drop NY-primary.) |
+| `weather` | Open-Meteo, Melbourne, 15-min cache, no key | Port as-is (add `/api/weather` or keep client-side). |
+| `speedtest` | hourly `speedtest-cli` cron → history bar chart | Needs a worker/cron + a results table in v2. Medium. |
+| `host-perf` | `/proc` CPU/RAM/disk/net, 30s refresh | Port `lib/resources.js`; new `/api/host` endpoint. |
+| `yerin-alerts` | 5-min service-health cron | **Becomes the Yerin security agent surface** — see `docs/yerin-security-agent.md`. |
+| `littleblue-nightly` | 2 AM heal/diagnostics | Needs the Little Blue agent + cron in v2. Defer. |
+| `claude` | Claude usage (today/week/session) | v2 companion runs on the **Max subscription via CLI** — there's no API usage meter. Rethink: show CLI turn/cost from the `result` events instead. |
+| `briefing` | 7 AM Dross briefing (Claude) | Dross exists in v2; needs a scheduled briefing pass + card. |
+| `orthos-council` | 6:50 AM Orthos briefing (Ollama) | **Becomes the Orthos Advisor surface** — local-model reflective digest. |
+| dynamic service cards | per-YAML health check | v2 has a real `resources` table + status — render from that, not YAML. |
+
+## What Void 2.0's *new* backend unlocks (v2-native widgets)
+
+These didn't exist in v1 and are the more exciting half of Plan 6:
+
+- **Capture queue / Jobs** — live pg-boss job states (ingest.blob → extract.pdf →
+  embed). A "what's flowing in" card. Data already in the Jobs view/API.
+- **Hybrid search spotlight** — a search-box card over the FTS+vector RRF endpoint.
+- **Inbox / pending changes** — agent-proposed drafts awaiting approval, as a card
+  (shares the rail's approve/reject path).
+- **Knowledge connections (Orthos)** — embedding-graph "things converging" card.
+- **Security pulse (Yerin)** — recent audit highlights + alerts card.
+- **Space switcher / recent entities** — v2 is Space-scoped; v1 wasn't.
+
+## Open design questions for the brainstorm (decide WITH frontend-design)
+
+1. **Grid library.** v1 used gridstack v10. Keep it, or go CSS-grid + a lighter
+   DnD? (Affects bundle + the no-build-step constraint.)
+2. **Card taxonomy & sizing.** Status cards vs. interactive cards vs. agent-output
+   cards — do they share one chrome or differentiate?
+3. **Theme tokens.** v1 is blackflame (`--accent: #ff4f2e`). The companion rail
+   already shipped orange (mockup wanted violet — `--rail-accent` deferred). Plan 6
+   is the moment to settle the token system (accent, rail-accent, per-agent hues:
+   Dross / Yerin / Orthos identity colours?).
+4. **Real-time.** Cards poll (v1: 10–30s) vs. an SSE/event bus (v2 already streams
+   companion turns). Worth a shared `space-active`-style event channel?
+5. **Per-agent cards as a pattern.** If Dross/Yerin/Orthos each own a card, define
+   one "agent output card" component once.
+6. **Persistence.** v1 stored layout in localStorage. Per-Space layouts in the DB?
+
+## Suggested first slice (to propose at the brainstorm, not commit)
+A thin vertical: **host-perf + weather + clock** (pure ports, prove the grid +
+theme tokens) → then **Jobs card + Inbox card** (prove v2-native data) → then the
+**agent cards** (Yerin pulse, Orthos council) once those agents are wired. This
+sequences design risk before agent-integration risk.
+
+## Pre-reqs that aren't UI
+- Orthos & Yerin agents need seeding + wiring (see `docs/yerin-security-agent.md`)
+  before their cards have data.
+- speedtest/host-perf need their v2 data endpoints.
--- a/docs/security-sweep-2026-06-01.md
+++ b/docs/security-sweep-2026-06-01.md
@@ -0,0 +1,97 @@
+# Void 2.0 — Security Sweep (2026-06-01, alpha.6)
+
+Fresh security pass over the live alpha.6 surfaces. Severity is for the current
+deployment context: **single owner token, owner-only authn, one suggest-tier
+companion agent, LAN + Cloudflare-Access perimeter**. Several items move up in
+severity once additional agents or wider exposure land.
+
+Legend: ✅ fixed this pass · 🔧 recommended (needs your call / migration) · ℹ️ note.
+
+---
+
+## ✅ FIXED — Owner token compared in non-constant time  (was MEDIUM)
+`lib/auth/owner.js` and `lib/api/middleware/agent_auth.js` compared the bearer
+token with `===` / `!==`. String `===` short-circuits on the first differing
+byte, leaking token length and prefix through response timing — over enough
+samples this enables byte-by-byte recovery of `OWNER_TOKEN`.
+
+**Fix:** new `lib/auth/timingSafeStrEqual` (constant-time via
+`crypto.timingSafeEqual` with a length pre-check so it never throws on a
+length mismatch). Both auth paths now use it. Tests: `tests/auth/safe_compare.test.js`.
+
+---
+
+## 🔧 HIGH — `verifyToken` does an O(n) bcrypt scan over every token
+`lib/db/repos/agents.js::verifyToken` loads **all** non-revoked agent tokens and
+runs `bcrypt.compare` against each (cost factor 12 ≈ 250 ms/compare).
+
+- **Auth latency scales linearly with token count.** With N agents/tokens, every
+  authenticated request pays N bcrypt comparisons.
+- **DoS lever:** an attacker who can hit an authenticated endpoint with a `vk_`-
+  prefixed token forces a full-table bcrypt scan per request.
+
+**Recommended fix (needs a migration — left for your sign-off):** give each token
+a non-secret lookup key. Store `token_id = first 8 chars of the random body`
+(or a separate indexed `selector` column), index it, and `bcrypt.compare` exactly
+the one row it points at. Keeps bcrypt's offline-cracking resistance while making
+verification O(1). This is the standard "selector + verifier" split.
+
+---
+
+## 🔧 HIGH — `void` DB role still has SUPERUSER  (carried over)
+Documented in `security-followups.md`: the `void` role was granted SUPERUSER so
+the test harness could `CREATE EXTENSION`. On prod (CT 311 → CT 310 DB) this is
+far more privilege than the app needs. Revoke on prod; create extensions once as
+a superuser during bootstrap, then run the app as a non-superuser role.
+
+---
+
+## 🔧 MEDIUM — Companion subprocess inherits the full server environment
+`lib/ai/claude_cli.js` clones `process.env` for the `claude` child and only
+deletes `ANTHROPIC_API_KEY` / `ANTHROPIC_AUTH_TOKEN`. The child therefore also
+inherits `OWNER_TOKEN`, `DB_PASS`/connection strings, and the Karakeep secrets.
+
+Today the companion is constrained to `mcp__void__*` tools only (built-ins like
+Bash/Read are stripped via `--tools`), so it has no primitive to read its own
+env — contained in practice. But it is one config slip (re-enabling a built-in)
+away from full secret exposure.
+
+**Recommended (defense in depth):** pass an explicit allow-list env to the child
+(HOME, PATH, the few `CLAUDE_*` / `VOID_CLAUDE_HOME` vars, and only the MCP
+server's own needs) rather than the whole environment.
+
+---
+
+## ℹ️ LOW — `context` tool returns `SELECT *` of the active entity
+`lib/ai/agent/tools/context.js` returns every column of the active row to the
+agent. For `resources` that includes `monitoring`/`metadata` JSON, which may hold
+connection hints or `vault_path` pointers. Not a secret-value leak today (the
+resolver keeps values out of the row), but project a column allow-list before
+Yerin (or any future agent) queries resource rows broadly.
+
+---
+
+## ✅ Reviewed and sound (no action)
+- **SSRF guard** (`lib/ingest/safe_fetch.js`): http/https-only, blocks loopback /
+  RFC1918 / link-local / CGNAT(100.64) / 0.0.0.0 / IPv6 ULA+link-local+v4-mapped,
+  validates **all** DNS records, pins resolved IPs into the undici dispatcher to
+  defeat TOCTOU rebinding, and re-validates every redirect hop. Solid.
+- **Karakeep webhook HMAC** (`lib/api/routes/ingest.js`): `timingSafeEqual` over
+  the raw body, guarded against length-mismatch throw, fails closed on missing
+  secret/sig. Good.
+- **Audit redaction** (`lib/db/repos/audit.js`): redacts token/password/api_key/
+  secret/authorization recursively at write time.
+- **Capability model** (`lib/auth/capability.js`): agents default-deny; mutations
+  require `write`+scope or fall to `suggest`; unknown actions deny.
+- **XSS**: safe-DOM invariant in `public/dom.js`; assistant markdown only via the
+  DOMPurify-sanitized path.
+- **Bearer-only API** (no cookies) ⇒ CSRF not applicable.
+
+---
+
+## Carried-over from `security-followups.md` (design decisions, your call)
+- **[HIGH]** Polymorphic `entity_tags`/`entity_links`/`attachments` lack `space_id`
+  (cross-tenant linkage possible at the DB layer). Defensible for single-tenant
+  alpha; gate at the REST/MCP layer until multi-tenant is load-bearing.
+- **[MEDIUM]** `tags.name` UNIQUE is global, not per-space.
+- **[MEDIUM]** No cascade on polymorphic parent delete.
--- a/docs/yerin-security-agent.md
+++ b/docs/yerin-security-agent.md
@@ -0,0 +1,100 @@
+# Yerin — the Void 2.0 Security Agent
+
+> Cradle note: Yerin is the sword artist — fast, vigilant, the one who *notices the
+> threat first and calls it*. In Void 1.x she already owns the 5-minute alert
+> cron. In Void 2.0 she becomes a first-class **read-only security/observability
+> agent**: she watches, reports, and proposes — she never silently acts.
+
+## Design principles
+
+1. **Read-only by capability.** Yerin's agent record gets `{ read: true }` and
+   **no** `write`/`suggest`. Anything she wants changed she raises to you; she
+   cannot mutate state. (If you later want her to be able to *propose* a
+   remediation, add `suggest: true` and the existing pending-change flow gives
+   you an approval gate for free.)
+2. **Own toolset, own registry.** Security tools live in
+   `lib/ai/agent/tools/security/` behind `securityRegistry` — separate from
+   Dross's `companionRegistry`, so she gets investigative tools, not
+   `propose_change`.
+3. **No secret material, ever.** Tools project explicit columns and rely on the
+   audit layer's write-time redaction. `agent_inventory` deliberately never does
+   `SELECT *`.
+
+## Built this pass (TDD, on `main`, not deployed)
+
+`lib/ai/agent/tools/security/` + `securityRegistry`, tests in
+`tests/ai/security_tools.test.js`:
+
+| Tool | What it answers |
+|------|-----------------|
+| `audit_log` | "Who did what, newest first?" Filter by `actor_kind` / `actor_id`. Reads the redacted audit trail. Capped at 200. |
+| `agent_inventory` | "Which agents exist and what is each allowed to do?" Returns id/slug/name/kind/model + capabilities + scopes. Never returns token material. |
+
+## Roadmap — tools to add next (designed, not yet built)
+
+Each is read-only and maps to an existing repo, so each is a small TDD task:
+
+- **`pending_review`** — list `pending_changes` awaiting approval (the queue of
+  agent-proposed mutations). Wraps `pendingChanges.listPending`. Security-relevant
+  because it's exactly where a prompt-injected agent's intent surfaces.
+- **`resource_exposure`** — inventory of resources with `url`/`host`/`status`
+  (attack surface / what's reachable). Wraps `resources.listBySpace` across spaces.
+- **`token_audit`** — agent tokens with `label`, `last_used`, `revoked_at`
+  (NOT the hash) to spot stale/unused credentials. Needs a small repo read.
+- **`recent_captures`** — newly ingested refs/source_docs (untrusted external
+  content entering the system), so Yerin can flag suspicious inbound material.
+
+Stretch (needs new plumbing, your call):
+- **`tls_expiry` / `service_health`** — port Void 1.x's `lib/security.js` cert
+  checks + the Yerin alert cron into a tool she can call on demand.
+- **Active probes** (e.g. "is this CT's admin port exposed?") would require
+  network egress — must go through `safeFetch` and be owner-gated. Defer.
+
+## Wiring Yerin up (the remaining integration steps — left for you)
+
+These touch live agent seeding / MCP config, so they're documented rather than
+done unsupervised:
+
+1. **Seed the agent.** A migration (010) inserting a `yerin` agent row with
+   `capabilities = {"read": true}`, `kind: 'claude'` (or `'ollama'` if you want
+   her cheap/local), and a persona prompt. Mirror migration 007's companion seed.
+2. **Expose `securityRegistry` over MCP.** `lib/mcp/companion-stdio.js` currently
+   hardcodes `companionRegistry`. Parameterise it (env `VOID_TOOL_REGISTRY=security`
+   selects the registry) so a Yerin session spawns with her tools. ~10 lines.
+3. **A Yerin entry point.** Either a route (`POST /api/security/ask`) reusing the
+   `claude_cli` driver with Yerin's persona + `securityRegistry`, or a scheduled
+   cron that runs a standing "anything suspicious in the last 24h?" pass and files
+   the result as a Sacred Valley card (ties into Plan 6).
+
+## Adjacent agent roles (so the roster stays coherent)
+
+- **Dross** — companion/chat (Lindon's own spirit-AI; the right-rail assistant).
+  Read + suggest. Already live.
+- **Yerin** — security/vigilance (this doc). Read-only.
+- **Orthos** — see the role proposal below.
+
+### Proposed role: Orthos — the local-model Advisor / Council
+
+Orthos in Cradle is the ancient dragon-turtle: a **powerful ally with long
+experience and a deep spiritual (Path of Black Flame) connection** — exactly your
+framing. That maps cleanly onto a specific, useful Void 2.0 role:
+
+**Orthos = the reflective Advisor, running on the local Ollama model (free,
+always-on), not Claude.** He plays to a local model's strengths (summarisation &
+synthesis, not precise tool-use):
+
+- **The long view / "council briefing."** A scheduled pass that fuses recent
+  captures, open tasks, and audit highlights into a short reflective digest —
+  the Void 1.x "Orthos council briefing" reborn as a Sacred Valley widget.
+- **"Spiritual connection" = the knowledge graph.** Let Orthos traverse the
+  embedding/RRF links to surface *non-obvious connections* across Spaces
+  ("these three captures in two different Spaces are converging on X"). That's
+  the literal mechanisation of his deep-sight: semantic connection-finding over
+  the vector store.
+- **Why local:** it's high-frequency, low-stakes, cost-sensitive reflection —
+  perfect for `llama3.1:8b` on `192.168.1.185`, keeping Claude budget for Dross.
+  Capability: `{ read: true }`, `kind: 'ollama'`, `model: 'llama3.1:8b'`.
+
+So the trio reads naturally: **Dross** converses, **Yerin** guards, **Orthos**
+reflects — and two of the three (Yerin alerts, Orthos council) become Plan 6
+Sacred Valley widgets. See `docs/plan-6-brainstorm-brief.md`.