chore: version 2.0.0-alpha.4 + changelog + plan-4 completion doc
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
34
CHANGELOG.md
34
CHANGELOG.md
@@ -3,6 +3,40 @@
|
||||
All notable changes to Void 2.0 are documented here.
|
||||
Format: [Keep a Changelog](https://keepachangelog.com).
|
||||
|
||||
## [2.0.0-alpha.4] — 2026-06-01
|
||||
|
||||
### Added (Plan 4: Python void-workers)
|
||||
|
||||
- **`void-workers.service`** — Python 3.13 service alongside `void-server`
|
||||
on CT 311. psycopg-based pg-boss client matches Node's claim/finish
|
||||
semantics via `SELECT ... FOR UPDATE SKIP LOCKED`. Forces
|
||||
`client_encoding=UTF8` on every connection (void2-db cluster is
|
||||
SQL_ASCII).
|
||||
- **`extract.pdf`** — `pdftotext -layout` first; per-page `pdftoppm`
|
||||
rasterization + Tesseract OCR fallback when extraction yields
|
||||
< 200 chars.
|
||||
- **`extract.image`** — Tesseract OCR (English) for images stored in
|
||||
the blob store.
|
||||
- **`ingest.video`** — `yt-dlp` metadata + audio extract + faster-whisper
|
||||
(`small.en` default). CUDA at startup; CPU fallback when HA failover
|
||||
to Z3 (no GPU) happens. URLs validated as http(s) and `--` separator
|
||||
passed to yt-dlp to defeat argv smuggling.
|
||||
- **`sync.source_doc`** — fetches `upstream_url` via Python `safe_fetch`
|
||||
(port of the Node helper) + sha256-diffs against the prior body_sha
|
||||
in metadata; updates body_text only when content changed.
|
||||
- **Node `blob.js`** fans out to `extract.pdf` / `extract.image` after
|
||||
creating PDF / image refs.
|
||||
- **Node `capture.js`** routes `youtube.com` / `youtu.be` / `vimeo.com`
|
||||
URLs to `ingest.video` instead of `ingest.url`.
|
||||
- **Daily cron** (`lib/cron/sync_source_docs.js`) enqueues
|
||||
`sync.source_doc` jobs at 03:00 local for every `source_docs` row
|
||||
with `sync_source='url'`.
|
||||
- **CT 311 infrastructure**: resized to 6 cores / 8 GB RAM, NVIDIA
|
||||
RTX A2000 device-nodes passed through (shared with CT 102's Ollama).
|
||||
- **`deploy/push-workers.sh`** + `deploy/void-workers.service` — push
|
||||
the workers package, chown to `voidworkers`, recreate the venv, install
|
||||
deps under `su voidworkers -c`, restart the unit.
|
||||
|
||||
## [2.0.0-alpha.3] — 2026-06-01
|
||||
|
||||
### Added (Plan 3: Capture pipeline + hybrid search)
|
||||
|
||||
57
docs/plan-4-complete.md
Normal file
57
docs/plan-4-complete.md
Normal file
@@ -0,0 +1,57 @@
|
||||
# Plan 4 — Complete
|
||||
|
||||
**Date:** 2026-06-01
|
||||
**Version:** 2.0.0-alpha.4
|
||||
**Tests:** 17 Python + 247 Node + 1 gated-skipped (full suite green when run cleanly)
|
||||
**Snapshots:** `plan4_pre_resize_<ts>`, `plan4_phase_c_<ts>`, `plan4_complete_<ts>` on CT 310 + 311.
|
||||
|
||||
## Scope delivered
|
||||
|
||||
### Phase A — Python harness
|
||||
- `workers/` package with pyproject.toml (Python ≥3.12; CT 311 runs 3.13).
|
||||
- `boss.py` — `SELECT ... FOR UPDATE SKIP LOCKED LIMIT 1` claim, atomic complete/fail, retry semantics matching pg-boss v10 (`retry_count`, `retry_delay`, `retry_backoff`). Forces `client_encoding=UTF8` because void2-db is SQL_ASCII.
|
||||
- `runner.py` — `ThreadPoolExecutor` per registered handler, signal handling, `once=True` mode for tests.
|
||||
- `echo` handler proved the harness end-to-end (Node enqueue → Python claim → output back).
|
||||
- `deploy/void-workers.service` (systemd, `MemoryMax=6G`, runs as `voidworkers`).
|
||||
- `deploy/push-workers.sh` — rsync, chown to `voidworkers`, venv create + `pip install -e ".[all]"` under `su voidworkers -c`, restart unit. Excludes `.env`, `.gitignore`, `.pytest_cache`, `tests/` so deploys are idempotent.
|
||||
|
||||
### Phase B — PDF + image OCR
|
||||
- `lib/jobs/workers/blob.js` (Node) — after creating a PDF/image ref, enqueues `extract.pdf` or `extract.image` with `{ref_id, blob_path}`.
|
||||
- `extract.pdf` — `pdftotext -layout` first; per-page `pdftoppm` rasterize + Tesseract OCR fallback when extraction < 200 chars.
|
||||
- `extract.image` — Tesseract OCR via `pytesseract.image_to_string` with English data.
|
||||
- `repo.update_ref` — UPDATE refs + emit `audit_log` row with `actor_kind='worker'`.
|
||||
|
||||
### Phase C — Whisper + yt-dlp + GPU
|
||||
- **CT 311 resized** from 4 cores / 4 GB to **6 cores / 8 GB**.
|
||||
- **GPU passthrough** — `/dev/nvidia0`, `/dev/nvidiactl`, `/dev/nvidia-uvm`, `/dev/nvidia-uvm-tools`, `/dev/nvidia-caps/nvidia-cap1` bind-mounted into CT 311 (shared with CT 102's Ollama).
|
||||
- `model.py` — faster-whisper loader. `cuda_available()` probes `ctranslate2.get_cuda_device_count()`; uses CUDA + `float16` when present, CPU + `int8` otherwise. Model cache at `/var/lib/void/whisper-models`.
|
||||
- `ingest.video` — `yt-dlp -J` for metadata + `yt-dlp -x --audio-format opus` for audio. faster-whisper transcribes; audio file deleted. Creates a `refs` row (`kind='video'`, `source_kind='youtube'` or `'video'`) idempotent on `sha256(space_id + url)`.
|
||||
- `lib/api/routes/capture.js` (Node) — detects `youtube.com / youtu.be / vimeo.com` URLs and enqueues `ingest.video` instead of `ingest.url`.
|
||||
|
||||
### Phase D — Source-doc sync + alpha-4
|
||||
- `safe_fetch.py` — Python port of `lib/ingest/safe_fetch.js` (scheme check, IP-range blocklist, redirect re-validation, `VOID_INGEST_ALLOW_PRIVATE` gate).
|
||||
- `sync.source_doc` — `safe_fetch` upstream + sha256 diff against prior `body_sha` in metadata; updates `body_text` only on change.
|
||||
- `lib/cron/sync_source_docs.js` + `lib/cron/index.js` (Node) — `node-cron` schedules `runSync` at 03:00 local time, enqueueing `sync.source_doc` for every row with `sync_source='url'`.
|
||||
- Version bumped to `2.0.0-alpha.4` in `package.json`, `server.js`, and the `/health` test assertion. CHANGELOG appended.
|
||||
|
||||
## Security findings handled inline
|
||||
|
||||
| Finding | Source | Resolution |
|
||||
|---|---|---|
|
||||
| yt-dlp argv flag smuggling in `video.py` | reviewer | `_validate_url` checks scheme is http(s); `--` passed before positional URL to stop flag parsing. |
|
||||
|
||||
## UI smoke
|
||||
|
||||
Plan 4 ships no SPA changes. The existing Plan 3 Jobs view shows extract.pdf / extract.image / ingest.video jobs alongside Node-side ones — both sides write to the same `pgboss.job` rows.
|
||||
|
||||
## Open items for the user
|
||||
|
||||
- **alpha-4 deploy.** Standing rule per Plans 2/3: won't deploy without your explicit OK. alpha-3 stays live until then.
|
||||
- **`WHISPER_MODEL`** default is `small.en`. Bump to `medium.en` once you've stress-tested transcription quality.
|
||||
- **yt-dlp cookies** for age-gated content — add `YT_DLP_COOKIES_FILE` env when wanted (small handler tweak).
|
||||
- **Tesseract languages** beyond English — install via `tesseract-ocr-<lang>` packages on CT 311 and pass `lang="..."` to `image_to_string`.
|
||||
|
||||
## What's left after Plan 4
|
||||
|
||||
- **Plan 5** — Companion chat in right rail.
|
||||
- **Plan 6** — Sacred Valley widgets ported from Void 1.x.
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "void-server",
|
||||
"version": "2.0.0-alpha.3",
|
||||
"version": "2.0.0-alpha.4",
|
||||
"type": "module",
|
||||
"private": true,
|
||||
"scripts": {
|
||||
|
||||
@@ -8,7 +8,7 @@ import { registerWorkers } from './lib/jobs/index.js';
|
||||
import { router as ingestRouter } from './lib/api/routes/ingest.js';
|
||||
import { startCron } from './lib/cron/index.js';
|
||||
|
||||
const VERSION = '2.0.0-alpha.3';
|
||||
const VERSION = '2.0.0-alpha.4';
|
||||
|
||||
export function createApp() {
|
||||
const app = express();
|
||||
|
||||
@@ -17,7 +17,7 @@ describe('server', () => {
|
||||
const res = await request(app).get('/health');
|
||||
expect(res.status).toBe(200);
|
||||
expect(res.body.db_ok).toBe(true);
|
||||
expect(res.body.version).toBe('2.0.0-alpha.3');
|
||||
expect(res.body.version).toBe('2.0.0-alpha.4');
|
||||
});
|
||||
|
||||
it('GET /api/spaces without token returns 401', async () => {
|
||||
|
||||
Reference in New Issue
Block a user