pdftotext first; falls back to per-page pdftoppm rasterization +
Tesseract OCR when the extracted text is < 200 chars. Updates
refs.body_text + metadata.extract.{method,chars} via the repo shim;
audit entry emitted with actor_kind='worker'.
born_digital.pdf fixture padded so pdftotext yields > 200 chars and
the test exercises the pdftotext path, not the OCR fallback.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Test fixtures
Used by tests/test_pdf.py and tests/test_image.py. Three invariants:
born_digital.pdf— must contain the literal stringvoid-workerswhen extracted viapdftotext. Generated from/tmp/text.psthenps2pdf.scanned.pdf—pdftotextmust return near-empty output (the OCR fallback test depends on this). Generated by convertingeng_text.pngto a single-page image-only PDF:convert -density 200 eng_text.png scanned.pdf.eng_text.png— must contain the literal stringblackflame palette, rendered clearly enough for Tesseract to read it. Generated withconvert -size 800x200 xc:white -font DejaVu-Sans -pointsize 36 -fill black -annotate +50+100 "blackflame palette" eng_text.png.
Regenerate via the snippets in ../../docs/superpowers/plans/2026-06-01-void-v2-plan4-workers.md Task B1.