feat(workers): extract.pdf with Tesseract fallback

pdftotext first; falls back to per-page pdftoppm rasterization + Tesseract OCR when the extracted text is < 200 chars. Updates refs.body_text + metadata.extract.{method,chars} via the repo shim; audit entry emitted with actor_kind='worker'. born_digital.pdf fixture padded so pdftotext yields > 200 chars and the test exercises the pdftotext path, not the OCR fallback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:59:53 +10:00
parent bbb08a677e
commit 1f0e9a5f1b
5 changed files with 206 additions and 1 deletions
--- a/workers/void_workers/handlers/init.py
+++ b/workers/void_workers/handlers/init.py
@@ -1,5 +1,6 @@
-from . import echo
+from . import echo, pdf

 REGISTRY = {
    echo.NAME: echo.handle,
+    pdf.NAME: pdf.handle,
 }