Files

root 1f0e9a5f1b feat(workers): extract.pdf with Tesseract fallback

pdftotext first; falls back to per-page pdftoppm rasterization +
Tesseract OCR when the extracted text is < 200 chars. Updates
refs.body_text + metadata.extract.{method,chars} via the repo shim;
audit entry emitted with actor_kind='worker'.

born_digital.pdf fixture padded so pdftotext yields > 200 chars and
the test exercises the pdftotext path, not the OCR fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-01 04:59:53 +10:00

tests

feat(workers): extract.pdf with Tesseract fallback

2026-06-01 04:59:53 +10:00

void_workers

feat(workers): extract.pdf with Tesseract fallback

2026-06-01 04:59:53 +10:00

.gitignore

feat(workers): Python skeleton + config + structlog

2026-06-01 04:41:33 +10:00

pyproject.toml

feat(workers): Python skeleton + config + structlog

2026-06-01 04:41:33 +10:00

README.md

feat(workers): Python skeleton + config + structlog

2026-06-01 04:41:33 +10:00

README.md

void-workers

Python ML ingest service alongside void-server (Node). Sibling of lib/ in the void-v2 repo.

Local dev

cd workers
python3.12 -m venv .venv
. .venv/bin/activate
pip install -e ".[all]"
export DATABASE_URL="postgres://..."
python -m void_workers.runner

Tests

pip install -e ".[test,all]"
DATABASE_URL="postgres://..." pytest -v

See ../docs/superpowers/plans/2026-06-01-void-v2-plan4-workers.md for the full plan and ../docs/superpowers/specs/2026-06-01-void-v2-plan4-workers.md for the design.