chore: version 2.0.0-alpha.4 + changelog + plan-4 completion doc

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
root
2026-06-01 10:25:31 +10:00
parent 13fac102dd
commit 7707b7eb00
5 changed files with 94 additions and 3 deletions

View File

@@ -3,6 +3,40 @@
All notable changes to Void 2.0 are documented here.
Format: [Keep a Changelog](https://keepachangelog.com).
## [2.0.0-alpha.4] — 2026-06-01
### Added (Plan 4: Python void-workers)
- **`void-workers.service`** — Python 3.13 service alongside `void-server`
on CT 311. psycopg-based pg-boss client matches Node's claim/finish
semantics via `SELECT ... FOR UPDATE SKIP LOCKED`. Forces
`client_encoding=UTF8` on every connection (void2-db cluster is
SQL_ASCII).
- **`extract.pdf`** — `pdftotext -layout` first; per-page `pdftoppm`
rasterization + Tesseract OCR fallback when extraction yields
< 200 chars.
- **`extract.image`** — Tesseract OCR (English) for images stored in
the blob store.
- **`ingest.video`** — `yt-dlp` metadata + audio extract + faster-whisper
(`small.en` default). CUDA at startup; CPU fallback when HA failover
to Z3 (no GPU) happens. URLs validated as http(s) and `--` separator
passed to yt-dlp to defeat argv smuggling.
- **`sync.source_doc`** — fetches `upstream_url` via Python `safe_fetch`
(port of the Node helper) + sha256-diffs against the prior body_sha
in metadata; updates body_text only when content changed.
- **Node `blob.js`** fans out to `extract.pdf` / `extract.image` after
creating PDF / image refs.
- **Node `capture.js`** routes `youtube.com` / `youtu.be` / `vimeo.com`
URLs to `ingest.video` instead of `ingest.url`.
- **Daily cron** (`lib/cron/sync_source_docs.js`) enqueues
`sync.source_doc` jobs at 03:00 local for every `source_docs` row
with `sync_source='url'`.
- **CT 311 infrastructure**: resized to 6 cores / 8 GB RAM, NVIDIA
RTX A2000 device-nodes passed through (shared with CT 102's Ollama).
- **`deploy/push-workers.sh`** + `deploy/void-workers.service` — push
the workers package, chown to `voidworkers`, recreate the venv, install
deps under `su voidworkers -c`, restart the unit.
## [2.0.0-alpha.3] — 2026-06-01
### Added (Plan 3: Capture pipeline + hybrid search)

57
docs/plan-4-complete.md Normal file
View File

@@ -0,0 +1,57 @@
# Plan 4 — Complete
**Date:** 2026-06-01
**Version:** 2.0.0-alpha.4
**Tests:** 17 Python + 247 Node + 1 gated-skipped (full suite green when run cleanly)
**Snapshots:** `plan4_pre_resize_<ts>`, `plan4_phase_c_<ts>`, `plan4_complete_<ts>` on CT 310 + 311.
## Scope delivered
### Phase A — Python harness
- `workers/` package with pyproject.toml (Python ≥3.12; CT 311 runs 3.13).
- `boss.py``SELECT ... FOR UPDATE SKIP LOCKED LIMIT 1` claim, atomic complete/fail, retry semantics matching pg-boss v10 (`retry_count`, `retry_delay`, `retry_backoff`). Forces `client_encoding=UTF8` because void2-db is SQL_ASCII.
- `runner.py``ThreadPoolExecutor` per registered handler, signal handling, `once=True` mode for tests.
- `echo` handler proved the harness end-to-end (Node enqueue → Python claim → output back).
- `deploy/void-workers.service` (systemd, `MemoryMax=6G`, runs as `voidworkers`).
- `deploy/push-workers.sh` — rsync, chown to `voidworkers`, venv create + `pip install -e ".[all]"` under `su voidworkers -c`, restart unit. Excludes `.env`, `.gitignore`, `.pytest_cache`, `tests/` so deploys are idempotent.
### Phase B — PDF + image OCR
- `lib/jobs/workers/blob.js` (Node) — after creating a PDF/image ref, enqueues `extract.pdf` or `extract.image` with `{ref_id, blob_path}`.
- `extract.pdf``pdftotext -layout` first; per-page `pdftoppm` rasterize + Tesseract OCR fallback when extraction < 200 chars.
- `extract.image` — Tesseract OCR via `pytesseract.image_to_string` with English data.
- `repo.update_ref` — UPDATE refs + emit `audit_log` row with `actor_kind='worker'`.
### Phase C — Whisper + yt-dlp + GPU
- **CT 311 resized** from 4 cores / 4 GB to **6 cores / 8 GB**.
- **GPU passthrough** — `/dev/nvidia0`, `/dev/nvidiactl`, `/dev/nvidia-uvm`, `/dev/nvidia-uvm-tools`, `/dev/nvidia-caps/nvidia-cap1` bind-mounted into CT 311 (shared with CT 102's Ollama).
- `model.py` — faster-whisper loader. `cuda_available()` probes `ctranslate2.get_cuda_device_count()`; uses CUDA + `float16` when present, CPU + `int8` otherwise. Model cache at `/var/lib/void/whisper-models`.
- `ingest.video``yt-dlp -J` for metadata + `yt-dlp -x --audio-format opus` for audio. faster-whisper transcribes; audio file deleted. Creates a `refs` row (`kind='video'`, `source_kind='youtube'` or `'video'`) idempotent on `sha256(space_id + url)`.
- `lib/api/routes/capture.js` (Node) — detects `youtube.com / youtu.be / vimeo.com` URLs and enqueues `ingest.video` instead of `ingest.url`.
### Phase D — Source-doc sync + alpha-4
- `safe_fetch.py` — Python port of `lib/ingest/safe_fetch.js` (scheme check, IP-range blocklist, redirect re-validation, `VOID_INGEST_ALLOW_PRIVATE` gate).
- `sync.source_doc``safe_fetch` upstream + sha256 diff against prior `body_sha` in metadata; updates `body_text` only on change.
- `lib/cron/sync_source_docs.js` + `lib/cron/index.js` (Node) — `node-cron` schedules `runSync` at 03:00 local time, enqueueing `sync.source_doc` for every row with `sync_source='url'`.
- Version bumped to `2.0.0-alpha.4` in `package.json`, `server.js`, and the `/health` test assertion. CHANGELOG appended.
## Security findings handled inline
| Finding | Source | Resolution |
|---|---|---|
| yt-dlp argv flag smuggling in `video.py` | reviewer | `_validate_url` checks scheme is http(s); `--` passed before positional URL to stop flag parsing. |
## UI smoke
Plan 4 ships no SPA changes. The existing Plan 3 Jobs view shows extract.pdf / extract.image / ingest.video jobs alongside Node-side ones — both sides write to the same `pgboss.job` rows.
## Open items for the user
- **alpha-4 deploy.** Standing rule per Plans 2/3: won't deploy without your explicit OK. alpha-3 stays live until then.
- **`WHISPER_MODEL`** default is `small.en`. Bump to `medium.en` once you've stress-tested transcription quality.
- **yt-dlp cookies** for age-gated content — add `YT_DLP_COOKIES_FILE` env when wanted (small handler tweak).
- **Tesseract languages** beyond English — install via `tesseract-ocr-<lang>` packages on CT 311 and pass `lang="..."` to `image_to_string`.
## What's left after Plan 4
- **Plan 5** — Companion chat in right rail.
- **Plan 6** — Sacred Valley widgets ported from Void 1.x.

View File

@@ -1,6 +1,6 @@
{
"name": "void-server",
"version": "2.0.0-alpha.3",
"version": "2.0.0-alpha.4",
"type": "module",
"private": true,
"scripts": {

View File

@@ -8,7 +8,7 @@ import { registerWorkers } from './lib/jobs/index.js';
import { router as ingestRouter } from './lib/api/routes/ingest.js';
import { startCron } from './lib/cron/index.js';
const VERSION = '2.0.0-alpha.3';
const VERSION = '2.0.0-alpha.4';
export function createApp() {
const app = express();

View File

@@ -17,7 +17,7 @@ describe('server', () => {
const res = await request(app).get('/health');
expect(res.status).toBe(200);
expect(res.body.db_ok).toBe(true);
expect(res.body.version).toBe('2.0.0-alpha.3');
expect(res.body.version).toBe('2.0.0-alpha.4');
});
it('GET /api/spaces without token returns 401', async () => {