Files
Void-Homelab/docs/plan-4-complete.md
2026-06-01 10:25:31 +10:00

4.2 KiB

Plan 4 — Complete

Date: 2026-06-01 Version: 2.0.0-alpha.4 Tests: 17 Python + 247 Node + 1 gated-skipped (full suite green when run cleanly) Snapshots: plan4_pre_resize_<ts>, plan4_phase_c_<ts>, plan4_complete_<ts> on CT 310 + 311.

Scope delivered

Phase A — Python harness

  • workers/ package with pyproject.toml (Python ≥3.12; CT 311 runs 3.13).
  • boss.pySELECT ... FOR UPDATE SKIP LOCKED LIMIT 1 claim, atomic complete/fail, retry semantics matching pg-boss v10 (retry_count, retry_delay, retry_backoff). Forces client_encoding=UTF8 because void2-db is SQL_ASCII.
  • runner.pyThreadPoolExecutor per registered handler, signal handling, once=True mode for tests.
  • echo handler proved the harness end-to-end (Node enqueue → Python claim → output back).
  • deploy/void-workers.service (systemd, MemoryMax=6G, runs as voidworkers).
  • deploy/push-workers.sh — rsync, chown to voidworkers, venv create + pip install -e ".[all]" under su voidworkers -c, restart unit. Excludes .env, .gitignore, .pytest_cache, tests/ so deploys are idempotent.

Phase B — PDF + image OCR

  • lib/jobs/workers/blob.js (Node) — after creating a PDF/image ref, enqueues extract.pdf or extract.image with {ref_id, blob_path}.
  • extract.pdfpdftotext -layout first; per-page pdftoppm rasterize + Tesseract OCR fallback when extraction < 200 chars.
  • extract.image — Tesseract OCR via pytesseract.image_to_string with English data.
  • repo.update_ref — UPDATE refs + emit audit_log row with actor_kind='worker'.

Phase C — Whisper + yt-dlp + GPU

  • CT 311 resized from 4 cores / 4 GB to 6 cores / 8 GB.
  • GPU passthrough/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm, /dev/nvidia-uvm-tools, /dev/nvidia-caps/nvidia-cap1 bind-mounted into CT 311 (shared with CT 102's Ollama).
  • model.py — faster-whisper loader. cuda_available() probes ctranslate2.get_cuda_device_count(); uses CUDA + float16 when present, CPU + int8 otherwise. Model cache at /var/lib/void/whisper-models.
  • ingest.videoyt-dlp -J for metadata + yt-dlp -x --audio-format opus for audio. faster-whisper transcribes; audio file deleted. Creates a refs row (kind='video', source_kind='youtube' or 'video') idempotent on sha256(space_id + url).
  • lib/api/routes/capture.js (Node) — detects youtube.com / youtu.be / vimeo.com URLs and enqueues ingest.video instead of ingest.url.

Phase D — Source-doc sync + alpha-4

  • safe_fetch.py — Python port of lib/ingest/safe_fetch.js (scheme check, IP-range blocklist, redirect re-validation, VOID_INGEST_ALLOW_PRIVATE gate).
  • sync.source_docsafe_fetch upstream + sha256 diff against prior body_sha in metadata; updates body_text only on change.
  • lib/cron/sync_source_docs.js + lib/cron/index.js (Node) — node-cron schedules runSync at 03:00 local time, enqueueing sync.source_doc for every row with sync_source='url'.
  • Version bumped to 2.0.0-alpha.4 in package.json, server.js, and the /health test assertion. CHANGELOG appended.

Security findings handled inline

Finding Source Resolution
yt-dlp argv flag smuggling in video.py reviewer _validate_url checks scheme is http(s); -- passed before positional URL to stop flag parsing.

UI smoke

Plan 4 ships no SPA changes. The existing Plan 3 Jobs view shows extract.pdf / extract.image / ingest.video jobs alongside Node-side ones — both sides write to the same pgboss.job rows.

Open items for the user

  • alpha-4 deploy. Standing rule per Plans 2/3: won't deploy without your explicit OK. alpha-3 stays live until then.
  • WHISPER_MODEL default is small.en. Bump to medium.en once you've stress-tested transcription quality.
  • yt-dlp cookies for age-gated content — add YT_DLP_COOKIES_FILE env when wanted (small handler tweak).
  • Tesseract languages beyond English — install via tesseract-ocr-<lang> packages on CT 311 and pass lang="..." to image_to_string.

What's left after Plan 4

  • Plan 5 — Companion chat in right rail.
  • Plan 6 — Sacred Valley widgets ported from Void 1.x.