feat(workers): extract.pdf with Tesseract fallback

pdftotext first; falls back to per-page pdftoppm rasterization +
Tesseract OCR when the extracted text is < 200 chars. Updates
refs.body_text + metadata.extract.{method,chars} via the repo shim;
audit entry emitted with actor_kind='worker'.

born_digital.pdf fixture padded so pdftotext yields > 200 chars and
the test exercises the pdftotext path, not the OCR fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This commit is contained in:

root

2026-06-01 04:59:53 +10:00

parent bbb08a677e

commit 1f0e9a5f1b

5 changed files with 206 additions and 1 deletions

BIN
workers/tests/fixtures/born_digital.pdf vendored

View File

Binary file not shown.

feat(workers): extract.pdf with Tesseract fallback

BIN workers/tests/fixtures/born_digital.pdf vendored View File

BIN
workers/tests/fixtures/born_digital.pdf vendored

View File