Files

root 1f0e9a5f1b feat(workers): extract.pdf with Tesseract fallback

pdftotext first; falls back to per-page pdftoppm rasterization +
Tesseract OCR when the extracted text is < 200 chars. Updates
refs.body_text + metadata.extract.{method,chars} via the repo shim;
audit entry emitted with actor_kind='worker'.

born_digital.pdf fixture padded so pdftotext yields > 200 chars and
the test exercises the pdftotext path, not the OCR fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-01 04:59:53 +10:00

born_digital.pdf

feat(workers): extract.pdf with Tesseract fallback

2026-06-01 04:59:53 +10:00

eng_text.png

test(workers): pdf/image test fixtures

2026-06-01 04:57:41 +10:00

README.md

test(workers): pdf/image test fixtures

2026-06-01 04:57:41 +10:00

scanned.pdf

test(workers): pdf/image test fixtures

2026-06-01 04:57:41 +10:00

README.md

Test fixtures

Used by tests/test_pdf.py and tests/test_image.py. Three invariants:

born_digital.pdf — must contain the literal string void-workers when extracted via pdftotext. Generated from /tmp/text.ps then ps2pdf.
scanned.pdf — pdftotext must return near-empty output (the OCR fallback test depends on this). Generated by converting eng_text.png to a single-page image-only PDF: convert -density 200 eng_text.png scanned.pdf.
eng_text.png — must contain the literal string blackflame palette, rendered clearly enough for Tesseract to read it. Generated with convert -size 800x200 xc:white -font DejaVu-Sans -pointsize 36 -fill black -annotate +50+100 "blackflame palette" eng_text.png.

Regenerate via the snippets in ../../docs/superpowers/plans/2026-06-01-void-v2-plan4-workers.md Task B1.