Commit Graph

13 Commits

Author SHA1 Message Date
root
3988425e67 chore: gitignore python build artifacts (untrack pycache/egg-info)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 22:23:06 +10:00
root
16497bd9db chore: version 2.0.0-alpha.6 — companion on Claude CLI subprocess (Max subscription)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 22:22:53 +10:00
root
a8b2cddcf5 fix(workers): safe_fetch pins IP + manual redirect re-validation
Two real findings from the security reviewer:

1. urllib auto-follows 3xx redirects via the default HTTPRedirectHandler.
   The previous code's hop loop never ran — urllib silently followed.
   Replaced with http.client + a manual hop loop. Every hop re-runs
   _validate_url, so an open-redirect to 127.0.0.1 / RFC1918 / metadata
   gets caught on the second hop.

2. DNS TOCTOU — _resolve() validated but urllib.request re-resolved on
   connect. Now the connection is pinned to the validated IP via a
   PinnedHTTPConn / PinnedHTTPSConn subclass that overrides connect() to
   bind socket.create_connection to (addr, port). For HTTPS, TLS
   server_hostname is set to the original host so SNI + cert
   verification still work against the named host while the TCP
   destination is the pinned IP.

Tests added: redirect-to-loopback short-circuits at validation;
too-many-redirects exhausts max_hops; 2xx returns body; non-2xx raises.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:28:55 +10:00
root
8fa7f71694 feat(workers): sync.source_doc with sha256 diff
Fetches upstream URL via safe_fetch, sha256-diffs against the prior
body_sha stored in metadata, updates body_text + last_synced only when
content changed. Unchanged syncs just touch last_synced.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:13:27 +10:00
root
cd1d69c689 feat(workers): safe_fetch Python port
Mirrors lib/ingest/safe_fetch.js. Same scheme + IP-range checks and
VOID_INGEST_ALLOW_PRIVATE env gate. Used by sync.source_doc and any
future Python workers that fetch user-controlled URLs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:12:47 +10:00
root
65fd71dc0d fix(workers): yt-dlp argv injection — scheme check + -- separator
The url passed to yt-dlp is user-controllable (via /api/capture). Any
string starting with '-' would be parsed as a flag (e.g.
--config-location=/etc/passwd). Mitigations:
1. Validate scheme is http(s) and hostname is present before subprocess.
2. Pass `--` to yt-dlp so it stops flag parsing before the positional
   URL.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:11:57 +10:00
root
1ba7aae439 feat(workers): ingest.video via yt-dlp + Whisper
yt-dlp pulls metadata (title, description, uploader, thumbnail) and
bestaudio (opus). faster-whisper transcribes; audio file removed after.
Creates a refs row with kind='video' and source_kind='youtube' for
YouTube URLs, generic 'video' otherwise. Idempotent on
sha256(space_id + url) via refs.external_id.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:07:33 +10:00
root
e64f1345f6 feat(workers): whisper loader with CUDA detect + CPU fallback
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:06:50 +10:00
root
f2035c1de6 feat(workers): extract.image via Tesseract
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 05:00:21 +10:00
root
1f0e9a5f1b feat(workers): extract.pdf with Tesseract fallback
pdftotext first; falls back to per-page pdftoppm rasterization +
Tesseract OCR when the extracted text is < 200 chars. Updates
refs.body_text + metadata.extract.{method,chars} via the repo shim;
audit entry emitted with actor_kind='worker'.

born_digital.pdf fixture padded so pdftotext yields > 200 chars and
the test exercises the pdftotext path, not the OCR fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:59:53 +10:00
root
bbb08a677e test(workers): pdf/image test fixtures
born_digital.pdf (pdftotext extractable), scanned.pdf (image-only, OCR
fallback target), eng_text.png (clear Tesseract-readable text).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:57:41 +10:00
root
fba1ce48e4 feat(workers): runner loop + echo handler
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:43:52 +10:00
root
3e1dcbb7f8 feat(workers): pgboss claim/complete/fail via psycopg
Adds the Boss class — SELECT … FOR UPDATE SKIP LOCKED to atomically
claim, UPDATE state on completion. Retry semantics match pg-boss:
exponential backoff via retry_count / retry_delay / retry_backoff.

Forces client_encoding=UTF8 on every connection. The void2-db cluster
was initialized as SQL_ASCII so psycopg refuses to decode text by
default; UTF8 client_encoding works because the data is already UTF-8.
Node's pg lib is more forgiving and didn't surface this.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:43:26 +10:00