16 Commits

Author SHA1 Message Date
root
a9191cee00 feat(workers): free Ollama VRAM before loading Whisper on the GPU
Whisper (CT 311) and Ollama (CT 102) share one A2000. Before loading
Whisper on CUDA, ask Ollama to unload its models (GET /api/ps then POST
/api/generate keep_alive:0) and wait for the card to clear, so the GPU
load has headroom. Best-effort and stdlib-only; Ollama reloads
cooperatively, and the existing CUDA->CPU fallback covers any failure.
Toggle via OLLAMA_FREE_BEFORE_STT; endpoint via OLLAMA_URL.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 21:12:05 +10:00
root
3c028fed5a fix(workers): graceful GPU→CPU fallback for Whisper at load time
cuda_available() only covers "no GPU present". On a shared card the GPU
can exist but fail to load the model (VRAM exhausted by another process
e.g. Ollama). Try CUDA first, fall back to a CPU model on any load
error instead of crashing the transcription job. Supports HA portability
(node without GPU) and a contended GPU. Adds GPU-path + fallback tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 08:04:14 +10:00
root
3988425e67 chore: gitignore python build artifacts (untrack pycache/egg-info)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 22:23:06 +10:00
root
16497bd9db chore: version 2.0.0-alpha.6 — companion on Claude CLI subprocess (Max subscription)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 22:22:53 +10:00
root
a8b2cddcf5 fix(workers): safe_fetch pins IP + manual redirect re-validation
Two real findings from the security reviewer:

1. urllib auto-follows 3xx redirects via the default HTTPRedirectHandler.
   The previous code's hop loop never ran — urllib silently followed.
   Replaced with http.client + a manual hop loop. Every hop re-runs
   _validate_url, so an open-redirect to 127.0.0.1 / RFC1918 / metadata
   gets caught on the second hop.

2. DNS TOCTOU — _resolve() validated but urllib.request re-resolved on
   connect. Now the connection is pinned to the validated IP via a
   PinnedHTTPConn / PinnedHTTPSConn subclass that overrides connect() to
   bind socket.create_connection to (addr, port). For HTTPS, TLS
   server_hostname is set to the original host so SNI + cert
   verification still work against the named host while the TCP
   destination is the pinned IP.

Tests added: redirect-to-loopback short-circuits at validation;
too-many-redirects exhausts max_hops; 2xx returns body; non-2xx raises.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:28:55 +10:00
root
8fa7f71694 feat(workers): sync.source_doc with sha256 diff
Fetches upstream URL via safe_fetch, sha256-diffs against the prior
body_sha stored in metadata, updates body_text + last_synced only when
content changed. Unchanged syncs just touch last_synced.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:13:27 +10:00
root
cd1d69c689 feat(workers): safe_fetch Python port
Mirrors lib/ingest/safe_fetch.js. Same scheme + IP-range checks and
VOID_INGEST_ALLOW_PRIVATE env gate. Used by sync.source_doc and any
future Python workers that fetch user-controlled URLs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:12:47 +10:00
root
65fd71dc0d fix(workers): yt-dlp argv injection — scheme check + -- separator
The url passed to yt-dlp is user-controllable (via /api/capture). Any
string starting with '-' would be parsed as a flag (e.g.
--config-location=/etc/passwd). Mitigations:
1. Validate scheme is http(s) and hostname is present before subprocess.
2. Pass `--` to yt-dlp so it stops flag parsing before the positional
   URL.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:11:57 +10:00
root
1ba7aae439 feat(workers): ingest.video via yt-dlp + Whisper
yt-dlp pulls metadata (title, description, uploader, thumbnail) and
bestaudio (opus). faster-whisper transcribes; audio file removed after.
Creates a refs row with kind='video' and source_kind='youtube' for
YouTube URLs, generic 'video' otherwise. Idempotent on
sha256(space_id + url) via refs.external_id.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:07:33 +10:00
root
e64f1345f6 feat(workers): whisper loader with CUDA detect + CPU fallback
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:06:50 +10:00
root
f2035c1de6 feat(workers): extract.image via Tesseract
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 05:00:21 +10:00
root
1f0e9a5f1b feat(workers): extract.pdf with Tesseract fallback
pdftotext first; falls back to per-page pdftoppm rasterization +
Tesseract OCR when the extracted text is < 200 chars. Updates
refs.body_text + metadata.extract.{method,chars} via the repo shim;
audit entry emitted with actor_kind='worker'.

born_digital.pdf fixture padded so pdftotext yields > 200 chars and
the test exercises the pdftotext path, not the OCR fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:59:53 +10:00
root
bbb08a677e test(workers): pdf/image test fixtures
born_digital.pdf (pdftotext extractable), scanned.pdf (image-only, OCR
fallback target), eng_text.png (clear Tesseract-readable text).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:57:41 +10:00
root
fba1ce48e4 feat(workers): runner loop + echo handler
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:43:52 +10:00
root
3e1dcbb7f8 feat(workers): pgboss claim/complete/fail via psycopg
Adds the Boss class — SELECT … FOR UPDATE SKIP LOCKED to atomically
claim, UPDATE state on completion. Retry semantics match pg-boss:
exponential backoff via retry_count / retry_delay / retry_backoff.

Forces client_encoding=UTF8 on every connection. The void2-db cluster
was initialized as SQL_ASCII so psycopg refuses to decode text by
default; UTF8 client_encoding works because the data is already UTF-8.
Node's pg lib is more forgiving and didn't surface this.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:43:26 +10:00
root
6e3798f6d1 feat(workers): Python skeleton + config + structlog
Plan 4 Phase A scaffolding. void-workers package at /workers/, sibling
of /lib/. pyproject.toml pins Python 3.12 with separate extras for
pdf / image / video / test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:41:33 +10:00