The prod venv at /opt/void-workers/venv was being deleted on every
push because rsync --delete saw no matching dir in the source (which
has .venv/, not venv/). Now both names are excluded.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two real findings from the security reviewer:
1. urllib auto-follows 3xx redirects via the default HTTPRedirectHandler.
The previous code's hop loop never ran — urllib silently followed.
Replaced with http.client + a manual hop loop. Every hop re-runs
_validate_url, so an open-redirect to 127.0.0.1 / RFC1918 / metadata
gets caught on the second hop.
2. DNS TOCTOU — _resolve() validated but urllib.request re-resolved on
connect. Now the connection is pinned to the validated IP via a
PinnedHTTPConn / PinnedHTTPSConn subclass that overrides connect() to
bind socket.create_connection to (addr, port). For HTTPS, TLS
server_hostname is set to the original host so SNI + cert
verification still work against the named host while the TCP
destination is the pinned IP.
Tests added: redirect-to-loopback short-circuits at validation;
too-many-redirects exhausts max_hops; 2xx returns body; non-2xx raises.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
node-cron schedules runSync at 03:00 local time; runSync enqueues
sync.source_doc for every source_docs row with sync_source='url'.
Started from server.js's CLI gate alongside the job queue.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fetches upstream URL via safe_fetch, sha256-diffs against the prior
body_sha stored in metadata, updates body_text + last_synced only when
content changed. Unchanged syncs just touch last_synced.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors lib/ingest/safe_fetch.js. Same scheme + IP-range checks and
VOID_INGEST_ALLOW_PRIVATE env gate. Used by sync.source_doc and any
future Python workers that fetch user-controlled URLs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The url passed to yt-dlp is user-controllable (via /api/capture). Any
string starting with '-' would be parsed as a flag (e.g.
--config-location=/etc/passwd). Mitigations:
1. Validate scheme is http(s) and hostname is present before subprocess.
2. Pass `--` to yt-dlp so it stops flag parsing before the positional
URL.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
POST /api/capture with a youtube.com / youtu.be / vimeo.com URL
enqueues ingest.video (Python worker) instead of ingest.url
(Node worker). Detection by URL hostname; idempotency_key + response
shape unchanged.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
yt-dlp pulls metadata (title, description, uploader, thumbnail) and
bestaudio (opus). faster-whisper transcribes; audio file removed after.
Creates a refs row with kind='video' and source_kind='youtube' for
YouTube URLs, generic 'video' otherwise. Idempotent on
sha256(space_id + url) via refs.external_id.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rsync ran as root over SSH so files landed root-owned, but workers run
as voidworkers — the service couldn't even reach the venv binary.
Now: chown -R voidworkers after rsync, run venv create + pip install
under `su voidworkers -c`. Also excludes .env, .gitignore, .pytest_cache
so they survive across deploys.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After creating a ref, the Node-side ingest.blob worker enqueues a
follow-up job for the Python void-workers (Plan 4) to OCR / extract
text. Other kinds (file) get no follow-up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
pdftotext first; falls back to per-page pdftoppm rasterization +
Tesseract OCR when the extracted text is < 200 chars. Updates
refs.body_text + metadata.extract.{method,chars} via the repo shim;
audit entry emitted with actor_kind='worker'.
born_digital.pdf fixture padded so pdftotext yields > 200 chars and
the test exercises the pdftotext path, not the OCR fallback.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Deploy README extended with workers bootstrap + note on the void2-db
SQL_ASCII cluster requiring client_encoding=UTF8 on Python clients.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the Boss class — SELECT … FOR UPDATE SKIP LOCKED to atomically
claim, UPDATE state on completion. Retry semantics match pg-boss:
exponential backoff via retry_count / retry_delay / retry_backoff.
Forces client_encoding=UTF8 on every connection. The void2-db cluster
was initialized as SQL_ASCII so psycopg refuses to decode text by
default; UTF8 client_encoding works because the data is already UTF-8.
Node's pg lib is more forgiving and didn't surface this.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plan 4 Phase A scaffolding. void-workers package at /workers/, sibling
of /lib/. pyproject.toml pins Python 3.12 with separate extras for
pdf / image / video / test.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All Void 2.0 superpowers specs and implementation plans now live at
docs/superpowers/{specs,plans}/ inside the repo. Previously they were
at /project/docs/superpowers/ which was not under git.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cb(null, address, family) was returning Invalid IP address: undefined
under undici v6. Returning the full records array (each {address, family})
gives undici what it expects and lets it pick the best family.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops into #main POST /api/capture/upload one file at a time, with
space_id pre-filled from localStorage.last_space_id (set whenever the
space view renders).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
POST /api/ingest/karakeep accepts Karakeep webhook payloads. HMAC
signature on the raw body captured by express.json's verify hook.
Mounted on app before mountApi so it bypasses agentOrOwner — the
shared secret IS the auth.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces FTS-only /api/search in place. RRF (k=60) fuses ts_rank and
pgvector cosine distance rankings. Vector branch silently skipped when
Ollama times out / errors, keeping search snappy and resilient.
Messages have no embeddings in Plan 3, so they participate in the FTS
branch only.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the validate-then-call-fetch pattern (which left a TOCTOU
window where the OS resolver could return a different IP at connect
time) with an undici Agent dispatcher whose lookup() returns the IP we
already validated. Same hardening on every redirect hop.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
create/update on embeddable repos enqueue embed.text with a singleton
key that coalesces rapid edits. No-op when the queue is not running
(server tests construct createApp without booting pg-boss).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pads nomic-embed-text's 768 dims to 1024 zeros so a later 1024-dim model
swap is a re-embed, not a migration (per master spec).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
safe_fetch.js validates URLs before fetch: rejects non-http(s), literal
or DNS-resolved loopback / RFC1918 / link-local / CGNAT / metadata
addresses; follows redirects manually with the same checks on each hop.
Test fixtures gate the check with VOID_INGEST_ALLOW_PRIVATE for offline
fixtures that hit 127.0.0.1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Unifies pgboss.job (current, per-queue partitioned) and pgboss.archive
under one SELECT for operator views. retry promotes archived rows back
into the active partition.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Job queue starts only in the CLI gate (not inside createApp), so tests
manage their own queue lifecycle. waitForJob() takes a (name, id) pair
to match pg-boss v10's getJobById signature.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per-name ensureQueue promise dedup so concurrent enqueue+subscribe
on the same queue do not race createQueue (Postgres deadlock).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Search view: read ?q from hash, call /api/search, group hits by kind
with rank + space_id; sidebar filters for kinds and space_id; updates
on Enter or filter change.
Bumps package.json + server.js VERSION to 2.0.0-alpha.2 and pins the
/health version assertion to match.
CHANGELOG: full Plan 2 entry covering API surface, capability tiering,
audit chain extension (approve/reject events), and the SPA shell.
Security: adds safeHref() to dom.js and applies it everywhere an
API-supplied URL becomes href / src (reference media block + reference
source_url anchor + resource url anchor). javascript: and other
non-http(s)/mailto schemes from agent-suggested content can no longer
execute in the owner's browser.
Plan 2 surface is feature-complete: 22/22 tasks landed, 185 tests
across 43 files, SPA renders end-to-end including the suggest -> approve
agent flow.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>