Files

root 7707b7eb00 chore: version 2.0.0-alpha.4 + changelog + plan-4 completion doc

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-01 10:25:31 +10:00

4.2 KiB

Raw Blame History

Plan 4 — Complete

Date: 2026-06-01 Version: 2.0.0-alpha.4 Tests: 17 Python + 247 Node + 1 gated-skipped (full suite green when run cleanly) Snapshots: plan4_pre_resize_<ts>, plan4_phase_c_<ts>, plan4_complete_<ts> on CT 310 + 311.

Scope delivered

Phase A — Python harness

workers/ package with pyproject.toml (Python ≥3.12; CT 311 runs 3.13).
boss.py — SELECT ... FOR UPDATE SKIP LOCKED LIMIT 1 claim, atomic complete/fail, retry semantics matching pg-boss v10 (retry_count, retry_delay, retry_backoff). Forces client_encoding=UTF8 because void2-db is SQL_ASCII.
runner.py — ThreadPoolExecutor per registered handler, signal handling, once=True mode for tests.
echo handler proved the harness end-to-end (Node enqueue → Python claim → output back).
deploy/void-workers.service (systemd, MemoryMax=6G, runs as voidworkers).
deploy/push-workers.sh — rsync, chown to voidworkers, venv create + pip install -e ".[all]" under su voidworkers -c, restart unit. Excludes .env, .gitignore, .pytest_cache, tests/ so deploys are idempotent.

Phase B — PDF + image OCR

lib/jobs/workers/blob.js (Node) — after creating a PDF/image ref, enqueues extract.pdf or extract.image with {ref_id, blob_path}.
extract.pdf — pdftotext -layout first; per-page pdftoppm rasterize + Tesseract OCR fallback when extraction < 200 chars.
extract.image — Tesseract OCR via pytesseract.image_to_string with English data.
repo.update_ref — UPDATE refs + emit audit_log row with actor_kind='worker'.

Phase C — Whisper + yt-dlp + GPU

CT 311 resized from 4 cores / 4 GB to 6 cores / 8 GB.
GPU passthrough — /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm, /dev/nvidia-uvm-tools, /dev/nvidia-caps/nvidia-cap1 bind-mounted into CT 311 (shared with CT 102's Ollama).
model.py — faster-whisper loader. cuda_available() probes ctranslate2.get_cuda_device_count(); uses CUDA + float16 when present, CPU + int8 otherwise. Model cache at /var/lib/void/whisper-models.
ingest.video — yt-dlp -J for metadata + yt-dlp -x --audio-format opus for audio. faster-whisper transcribes; audio file deleted. Creates a refs row (kind='video', source_kind='youtube' or 'video') idempotent on sha256(space_id + url).
lib/api/routes/capture.js (Node) — detects youtube.com / youtu.be / vimeo.com URLs and enqueues ingest.video instead of ingest.url.

Phase D — Source-doc sync + alpha-4

safe_fetch.py — Python port of lib/ingest/safe_fetch.js (scheme check, IP-range blocklist, redirect re-validation, VOID_INGEST_ALLOW_PRIVATE gate).
sync.source_doc — safe_fetch upstream + sha256 diff against prior body_sha in metadata; updates body_text only on change.
lib/cron/sync_source_docs.js + lib/cron/index.js (Node) — node-cron schedules runSync at 03:00 local time, enqueueing sync.source_doc for every row with sync_source='url'.
Version bumped to 2.0.0-alpha.4 in package.json, server.js, and the /health test assertion. CHANGELOG appended.

Security findings handled inline

Finding	Source	Resolution
yt-dlp argv flag smuggling in `video.py`	reviewer	`_validate_url` checks scheme is http(s); `--` passed before positional URL to stop flag parsing.

UI smoke

Plan 4 ships no SPA changes. The existing Plan 3 Jobs view shows extract.pdf / extract.image / ingest.video jobs alongside Node-side ones — both sides write to the same pgboss.job rows.

Open items for the user

alpha-4 deploy. Standing rule per Plans 2/3: won't deploy without your explicit OK. alpha-3 stays live until then.
WHISPER_MODEL default is small.en. Bump to medium.en once you've stress-tested transcription quality.
yt-dlp cookies for age-gated content — add YT_DLP_COOKIES_FILE env when wanted (small handler tweak).
Tesseract languages beyond English — install via tesseract-ocr-<lang> packages on CT 311 and pass lang="..." to image_to_string.

What's left after Plan 4

Plan 5 — Companion chat in right rail.
Plan 6 — Sacred Valley widgets ported from Void 1.x.

4.2 KiB Raw Blame History