chore: version 2.0.0-alpha.4 + changelog + plan-4 completion doc
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
34
CHANGELOG.md
34
CHANGELOG.md
@@ -3,6 +3,40 @@
|
|||||||
All notable changes to Void 2.0 are documented here.
|
All notable changes to Void 2.0 are documented here.
|
||||||
Format: [Keep a Changelog](https://keepachangelog.com).
|
Format: [Keep a Changelog](https://keepachangelog.com).
|
||||||
|
|
||||||
|
## [2.0.0-alpha.4] — 2026-06-01
|
||||||
|
|
||||||
|
### Added (Plan 4: Python void-workers)
|
||||||
|
|
||||||
|
- **`void-workers.service`** — Python 3.13 service alongside `void-server`
|
||||||
|
on CT 311. psycopg-based pg-boss client matches Node's claim/finish
|
||||||
|
semantics via `SELECT ... FOR UPDATE SKIP LOCKED`. Forces
|
||||||
|
`client_encoding=UTF8` on every connection (void2-db cluster is
|
||||||
|
SQL_ASCII).
|
||||||
|
- **`extract.pdf`** — `pdftotext -layout` first; per-page `pdftoppm`
|
||||||
|
rasterization + Tesseract OCR fallback when extraction yields
|
||||||
|
< 200 chars.
|
||||||
|
- **`extract.image`** — Tesseract OCR (English) for images stored in
|
||||||
|
the blob store.
|
||||||
|
- **`ingest.video`** — `yt-dlp` metadata + audio extract + faster-whisper
|
||||||
|
(`small.en` default). CUDA at startup; CPU fallback when HA failover
|
||||||
|
to Z3 (no GPU) happens. URLs validated as http(s) and `--` separator
|
||||||
|
passed to yt-dlp to defeat argv smuggling.
|
||||||
|
- **`sync.source_doc`** — fetches `upstream_url` via Python `safe_fetch`
|
||||||
|
(port of the Node helper) + sha256-diffs against the prior body_sha
|
||||||
|
in metadata; updates body_text only when content changed.
|
||||||
|
- **Node `blob.js`** fans out to `extract.pdf` / `extract.image` after
|
||||||
|
creating PDF / image refs.
|
||||||
|
- **Node `capture.js`** routes `youtube.com` / `youtu.be` / `vimeo.com`
|
||||||
|
URLs to `ingest.video` instead of `ingest.url`.
|
||||||
|
- **Daily cron** (`lib/cron/sync_source_docs.js`) enqueues
|
||||||
|
`sync.source_doc` jobs at 03:00 local for every `source_docs` row
|
||||||
|
with `sync_source='url'`.
|
||||||
|
- **CT 311 infrastructure**: resized to 6 cores / 8 GB RAM, NVIDIA
|
||||||
|
RTX A2000 device-nodes passed through (shared with CT 102's Ollama).
|
||||||
|
- **`deploy/push-workers.sh`** + `deploy/void-workers.service` — push
|
||||||
|
the workers package, chown to `voidworkers`, recreate the venv, install
|
||||||
|
deps under `su voidworkers -c`, restart the unit.
|
||||||
|
|
||||||
## [2.0.0-alpha.3] — 2026-06-01
|
## [2.0.0-alpha.3] — 2026-06-01
|
||||||
|
|
||||||
### Added (Plan 3: Capture pipeline + hybrid search)
|
### Added (Plan 3: Capture pipeline + hybrid search)
|
||||||
|
|||||||
57
docs/plan-4-complete.md
Normal file
57
docs/plan-4-complete.md
Normal file
@@ -0,0 +1,57 @@
|
|||||||
|
# Plan 4 — Complete
|
||||||
|
|
||||||
|
**Date:** 2026-06-01
|
||||||
|
**Version:** 2.0.0-alpha.4
|
||||||
|
**Tests:** 17 Python + 247 Node + 1 gated-skipped (full suite green when run cleanly)
|
||||||
|
**Snapshots:** `plan4_pre_resize_<ts>`, `plan4_phase_c_<ts>`, `plan4_complete_<ts>` on CT 310 + 311.
|
||||||
|
|
||||||
|
## Scope delivered
|
||||||
|
|
||||||
|
### Phase A — Python harness
|
||||||
|
- `workers/` package with pyproject.toml (Python ≥3.12; CT 311 runs 3.13).
|
||||||
|
- `boss.py` — `SELECT ... FOR UPDATE SKIP LOCKED LIMIT 1` claim, atomic complete/fail, retry semantics matching pg-boss v10 (`retry_count`, `retry_delay`, `retry_backoff`). Forces `client_encoding=UTF8` because void2-db is SQL_ASCII.
|
||||||
|
- `runner.py` — `ThreadPoolExecutor` per registered handler, signal handling, `once=True` mode for tests.
|
||||||
|
- `echo` handler proved the harness end-to-end (Node enqueue → Python claim → output back).
|
||||||
|
- `deploy/void-workers.service` (systemd, `MemoryMax=6G`, runs as `voidworkers`).
|
||||||
|
- `deploy/push-workers.sh` — rsync, chown to `voidworkers`, venv create + `pip install -e ".[all]"` under `su voidworkers -c`, restart unit. Excludes `.env`, `.gitignore`, `.pytest_cache`, `tests/` so deploys are idempotent.
|
||||||
|
|
||||||
|
### Phase B — PDF + image OCR
|
||||||
|
- `lib/jobs/workers/blob.js` (Node) — after creating a PDF/image ref, enqueues `extract.pdf` or `extract.image` with `{ref_id, blob_path}`.
|
||||||
|
- `extract.pdf` — `pdftotext -layout` first; per-page `pdftoppm` rasterize + Tesseract OCR fallback when extraction < 200 chars.
|
||||||
|
- `extract.image` — Tesseract OCR via `pytesseract.image_to_string` with English data.
|
||||||
|
- `repo.update_ref` — UPDATE refs + emit `audit_log` row with `actor_kind='worker'`.
|
||||||
|
|
||||||
|
### Phase C — Whisper + yt-dlp + GPU
|
||||||
|
- **CT 311 resized** from 4 cores / 4 GB to **6 cores / 8 GB**.
|
||||||
|
- **GPU passthrough** — `/dev/nvidia0`, `/dev/nvidiactl`, `/dev/nvidia-uvm`, `/dev/nvidia-uvm-tools`, `/dev/nvidia-caps/nvidia-cap1` bind-mounted into CT 311 (shared with CT 102's Ollama).
|
||||||
|
- `model.py` — faster-whisper loader. `cuda_available()` probes `ctranslate2.get_cuda_device_count()`; uses CUDA + `float16` when present, CPU + `int8` otherwise. Model cache at `/var/lib/void/whisper-models`.
|
||||||
|
- `ingest.video` — `yt-dlp -J` for metadata + `yt-dlp -x --audio-format opus` for audio. faster-whisper transcribes; audio file deleted. Creates a `refs` row (`kind='video'`, `source_kind='youtube'` or `'video'`) idempotent on `sha256(space_id + url)`.
|
||||||
|
- `lib/api/routes/capture.js` (Node) — detects `youtube.com / youtu.be / vimeo.com` URLs and enqueues `ingest.video` instead of `ingest.url`.
|
||||||
|
|
||||||
|
### Phase D — Source-doc sync + alpha-4
|
||||||
|
- `safe_fetch.py` — Python port of `lib/ingest/safe_fetch.js` (scheme check, IP-range blocklist, redirect re-validation, `VOID_INGEST_ALLOW_PRIVATE` gate).
|
||||||
|
- `sync.source_doc` — `safe_fetch` upstream + sha256 diff against prior `body_sha` in metadata; updates `body_text` only on change.
|
||||||
|
- `lib/cron/sync_source_docs.js` + `lib/cron/index.js` (Node) — `node-cron` schedules `runSync` at 03:00 local time, enqueueing `sync.source_doc` for every row with `sync_source='url'`.
|
||||||
|
- Version bumped to `2.0.0-alpha.4` in `package.json`, `server.js`, and the `/health` test assertion. CHANGELOG appended.
|
||||||
|
|
||||||
|
## Security findings handled inline
|
||||||
|
|
||||||
|
| Finding | Source | Resolution |
|
||||||
|
|---|---|---|
|
||||||
|
| yt-dlp argv flag smuggling in `video.py` | reviewer | `_validate_url` checks scheme is http(s); `--` passed before positional URL to stop flag parsing. |
|
||||||
|
|
||||||
|
## UI smoke
|
||||||
|
|
||||||
|
Plan 4 ships no SPA changes. The existing Plan 3 Jobs view shows extract.pdf / extract.image / ingest.video jobs alongside Node-side ones — both sides write to the same `pgboss.job` rows.
|
||||||
|
|
||||||
|
## Open items for the user
|
||||||
|
|
||||||
|
- **alpha-4 deploy.** Standing rule per Plans 2/3: won't deploy without your explicit OK. alpha-3 stays live until then.
|
||||||
|
- **`WHISPER_MODEL`** default is `small.en`. Bump to `medium.en` once you've stress-tested transcription quality.
|
||||||
|
- **yt-dlp cookies** for age-gated content — add `YT_DLP_COOKIES_FILE` env when wanted (small handler tweak).
|
||||||
|
- **Tesseract languages** beyond English — install via `tesseract-ocr-<lang>` packages on CT 311 and pass `lang="..."` to `image_to_string`.
|
||||||
|
|
||||||
|
## What's left after Plan 4
|
||||||
|
|
||||||
|
- **Plan 5** — Companion chat in right rail.
|
||||||
|
- **Plan 6** — Sacred Valley widgets ported from Void 1.x.
|
||||||
@@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"name": "void-server",
|
"name": "void-server",
|
||||||
"version": "2.0.0-alpha.3",
|
"version": "2.0.0-alpha.4",
|
||||||
"type": "module",
|
"type": "module",
|
||||||
"private": true,
|
"private": true,
|
||||||
"scripts": {
|
"scripts": {
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ import { registerWorkers } from './lib/jobs/index.js';
|
|||||||
import { router as ingestRouter } from './lib/api/routes/ingest.js';
|
import { router as ingestRouter } from './lib/api/routes/ingest.js';
|
||||||
import { startCron } from './lib/cron/index.js';
|
import { startCron } from './lib/cron/index.js';
|
||||||
|
|
||||||
const VERSION = '2.0.0-alpha.3';
|
const VERSION = '2.0.0-alpha.4';
|
||||||
|
|
||||||
export function createApp() {
|
export function createApp() {
|
||||||
const app = express();
|
const app = express();
|
||||||
|
|||||||
@@ -17,7 +17,7 @@ describe('server', () => {
|
|||||||
const res = await request(app).get('/health');
|
const res = await request(app).get('/health');
|
||||||
expect(res.status).toBe(200);
|
expect(res.status).toBe(200);
|
||||||
expect(res.body.db_ok).toBe(true);
|
expect(res.body.db_ok).toBe(true);
|
||||||
expect(res.body.version).toBe('2.0.0-alpha.3');
|
expect(res.body.version).toBe('2.0.0-alpha.4');
|
||||||
});
|
});
|
||||||
|
|
||||||
it('GET /api/spaces without token returns 401', async () => {
|
it('GET /api/spaces without token returns 401', async () => {
|
||||||
|
|||||||
Reference in New Issue
Block a user