diff --git a/CHANGELOG.md b/CHANGELOG.md index 895e9e9..4bb19c5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,40 @@ All notable changes to Void 2.0 are documented here. Format: [Keep a Changelog](https://keepachangelog.com). +## [2.0.0-alpha.4] — 2026-06-01 + +### Added (Plan 4: Python void-workers) + +- **`void-workers.service`** — Python 3.13 service alongside `void-server` + on CT 311. psycopg-based pg-boss client matches Node's claim/finish + semantics via `SELECT ... FOR UPDATE SKIP LOCKED`. Forces + `client_encoding=UTF8` on every connection (void2-db cluster is + SQL_ASCII). +- **`extract.pdf`** — `pdftotext -layout` first; per-page `pdftoppm` + rasterization + Tesseract OCR fallback when extraction yields + < 200 chars. +- **`extract.image`** — Tesseract OCR (English) for images stored in + the blob store. +- **`ingest.video`** — `yt-dlp` metadata + audio extract + faster-whisper + (`small.en` default). CUDA at startup; CPU fallback when HA failover + to Z3 (no GPU) happens. URLs validated as http(s) and `--` separator + passed to yt-dlp to defeat argv smuggling. +- **`sync.source_doc`** — fetches `upstream_url` via Python `safe_fetch` + (port of the Node helper) + sha256-diffs against the prior body_sha + in metadata; updates body_text only when content changed. +- **Node `blob.js`** fans out to `extract.pdf` / `extract.image` after + creating PDF / image refs. +- **Node `capture.js`** routes `youtube.com` / `youtu.be` / `vimeo.com` + URLs to `ingest.video` instead of `ingest.url`. +- **Daily cron** (`lib/cron/sync_source_docs.js`) enqueues + `sync.source_doc` jobs at 03:00 local for every `source_docs` row + with `sync_source='url'`. +- **CT 311 infrastructure**: resized to 6 cores / 8 GB RAM, NVIDIA + RTX A2000 device-nodes passed through (shared with CT 102's Ollama). +- **`deploy/push-workers.sh`** + `deploy/void-workers.service` — push + the workers package, chown to `voidworkers`, recreate the venv, install + deps under `su voidworkers -c`, restart the unit. + ## [2.0.0-alpha.3] — 2026-06-01 ### Added (Plan 3: Capture pipeline + hybrid search) diff --git a/docs/plan-4-complete.md b/docs/plan-4-complete.md new file mode 100644 index 0000000..688dd05 --- /dev/null +++ b/docs/plan-4-complete.md @@ -0,0 +1,57 @@ +# Plan 4 — Complete + +**Date:** 2026-06-01 +**Version:** 2.0.0-alpha.4 +**Tests:** 17 Python + 247 Node + 1 gated-skipped (full suite green when run cleanly) +**Snapshots:** `plan4_pre_resize_`, `plan4_phase_c_`, `plan4_complete_` on CT 310 + 311. + +## Scope delivered + +### Phase A — Python harness +- `workers/` package with pyproject.toml (Python ≥3.12; CT 311 runs 3.13). +- `boss.py` — `SELECT ... FOR UPDATE SKIP LOCKED LIMIT 1` claim, atomic complete/fail, retry semantics matching pg-boss v10 (`retry_count`, `retry_delay`, `retry_backoff`). Forces `client_encoding=UTF8` because void2-db is SQL_ASCII. +- `runner.py` — `ThreadPoolExecutor` per registered handler, signal handling, `once=True` mode for tests. +- `echo` handler proved the harness end-to-end (Node enqueue → Python claim → output back). +- `deploy/void-workers.service` (systemd, `MemoryMax=6G`, runs as `voidworkers`). +- `deploy/push-workers.sh` — rsync, chown to `voidworkers`, venv create + `pip install -e ".[all]"` under `su voidworkers -c`, restart unit. Excludes `.env`, `.gitignore`, `.pytest_cache`, `tests/` so deploys are idempotent. + +### Phase B — PDF + image OCR +- `lib/jobs/workers/blob.js` (Node) — after creating a PDF/image ref, enqueues `extract.pdf` or `extract.image` with `{ref_id, blob_path}`. +- `extract.pdf` — `pdftotext -layout` first; per-page `pdftoppm` rasterize + Tesseract OCR fallback when extraction < 200 chars. +- `extract.image` — Tesseract OCR via `pytesseract.image_to_string` with English data. +- `repo.update_ref` — UPDATE refs + emit `audit_log` row with `actor_kind='worker'`. + +### Phase C — Whisper + yt-dlp + GPU +- **CT 311 resized** from 4 cores / 4 GB to **6 cores / 8 GB**. +- **GPU passthrough** — `/dev/nvidia0`, `/dev/nvidiactl`, `/dev/nvidia-uvm`, `/dev/nvidia-uvm-tools`, `/dev/nvidia-caps/nvidia-cap1` bind-mounted into CT 311 (shared with CT 102's Ollama). +- `model.py` — faster-whisper loader. `cuda_available()` probes `ctranslate2.get_cuda_device_count()`; uses CUDA + `float16` when present, CPU + `int8` otherwise. Model cache at `/var/lib/void/whisper-models`. +- `ingest.video` — `yt-dlp -J` for metadata + `yt-dlp -x --audio-format opus` for audio. faster-whisper transcribes; audio file deleted. Creates a `refs` row (`kind='video'`, `source_kind='youtube'` or `'video'`) idempotent on `sha256(space_id + url)`. +- `lib/api/routes/capture.js` (Node) — detects `youtube.com / youtu.be / vimeo.com` URLs and enqueues `ingest.video` instead of `ingest.url`. + +### Phase D — Source-doc sync + alpha-4 +- `safe_fetch.py` — Python port of `lib/ingest/safe_fetch.js` (scheme check, IP-range blocklist, redirect re-validation, `VOID_INGEST_ALLOW_PRIVATE` gate). +- `sync.source_doc` — `safe_fetch` upstream + sha256 diff against prior `body_sha` in metadata; updates `body_text` only on change. +- `lib/cron/sync_source_docs.js` + `lib/cron/index.js` (Node) — `node-cron` schedules `runSync` at 03:00 local time, enqueueing `sync.source_doc` for every row with `sync_source='url'`. +- Version bumped to `2.0.0-alpha.4` in `package.json`, `server.js`, and the `/health` test assertion. CHANGELOG appended. + +## Security findings handled inline + +| Finding | Source | Resolution | +|---|---|---| +| yt-dlp argv flag smuggling in `video.py` | reviewer | `_validate_url` checks scheme is http(s); `--` passed before positional URL to stop flag parsing. | + +## UI smoke + +Plan 4 ships no SPA changes. The existing Plan 3 Jobs view shows extract.pdf / extract.image / ingest.video jobs alongside Node-side ones — both sides write to the same `pgboss.job` rows. + +## Open items for the user + +- **alpha-4 deploy.** Standing rule per Plans 2/3: won't deploy without your explicit OK. alpha-3 stays live until then. +- **`WHISPER_MODEL`** default is `small.en`. Bump to `medium.en` once you've stress-tested transcription quality. +- **yt-dlp cookies** for age-gated content — add `YT_DLP_COOKIES_FILE` env when wanted (small handler tweak). +- **Tesseract languages** beyond English — install via `tesseract-ocr-` packages on CT 311 and pass `lang="..."` to `image_to_string`. + +## What's left after Plan 4 + +- **Plan 5** — Companion chat in right rail. +- **Plan 6** — Sacred Valley widgets ported from Void 1.x. diff --git a/package.json b/package.json index a2b955f..bc6af0e 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "void-server", - "version": "2.0.0-alpha.3", + "version": "2.0.0-alpha.4", "type": "module", "private": true, "scripts": { diff --git a/server.js b/server.js index 5c3e9f2..5018c77 100644 --- a/server.js +++ b/server.js @@ -8,7 +8,7 @@ import { registerWorkers } from './lib/jobs/index.js'; import { router as ingestRouter } from './lib/api/routes/ingest.js'; import { startCron } from './lib/cron/index.js'; -const VERSION = '2.0.0-alpha.3'; +const VERSION = '2.0.0-alpha.4'; export function createApp() { const app = express(); diff --git a/tests/server.test.js b/tests/server.test.js index e7561ca..72e5862 100644 --- a/tests/server.test.js +++ b/tests/server.test.js @@ -17,7 +17,7 @@ describe('server', () => { const res = await request(app).get('/health'); expect(res.status).toBe(200); expect(res.body.db_ok).toBe(true); - expect(res.body.version).toBe('2.0.0-alpha.3'); + expect(res.body.version).toBe('2.0.0-alpha.4'); }); it('GET /api/spaces without token returns 401', async () => {