Commit Graph

5 Commits

Author SHA1 Message Date
root
1f0e9a5f1b feat(workers): extract.pdf with Tesseract fallback
pdftotext first; falls back to per-page pdftoppm rasterization +
Tesseract OCR when the extracted text is < 200 chars. Updates
refs.body_text + metadata.extract.{method,chars} via the repo shim;
audit entry emitted with actor_kind='worker'.

born_digital.pdf fixture padded so pdftotext yields > 200 chars and
the test exercises the pdftotext path, not the OCR fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:59:53 +10:00
root
bbb08a677e test(workers): pdf/image test fixtures
born_digital.pdf (pdftotext extractable), scanned.pdf (image-only, OCR
fallback target), eng_text.png (clear Tesseract-readable text).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:57:41 +10:00
root
fba1ce48e4 feat(workers): runner loop + echo handler
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:43:52 +10:00
root
3e1dcbb7f8 feat(workers): pgboss claim/complete/fail via psycopg
Adds the Boss class — SELECT … FOR UPDATE SKIP LOCKED to atomically
claim, UPDATE state on completion. Retry semantics match pg-boss:
exponential backoff via retry_count / retry_delay / retry_backoff.

Forces client_encoding=UTF8 on every connection. The void2-db cluster
was initialized as SQL_ASCII so psycopg refuses to decode text by
default; UTF8 client_encoding works because the data is already UTF-8.
Node's pg lib is more forgiving and didn't surface this.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:43:26 +10:00
root
6e3798f6d1 feat(workers): Python skeleton + config + structlog
Plan 4 Phase A scaffolding. void-workers package at /workers/, sibling
of /lib/. pyproject.toml pins Python 3.12 with separate extras for
pdf / image / video / test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:41:33 +10:00