Void-Homelab

Hynes/Void-Homelab

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	1f0e9a5f1b	feat(workers): extract.pdf with Tesseract fallback pdftotext first; falls back to per-page pdftoppm rasterization + Tesseract OCR when the extracted text is < 200 chars. Updates refs.body_text + metadata.extract.{method,chars} via the repo shim; audit entry emitted with actor_kind='worker'. born_digital.pdf fixture padded so pdftotext yields > 200 chars and the test exercises the pdftotext path, not the OCR fallback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 04:59:53 +10:00
root	bbb08a677e	test(workers): pdf/image test fixtures born_digital.pdf (pdftotext extractable), scanned.pdf (image-only, OCR fallback target), eng_text.png (clear Tesseract-readable text). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 04:57:41 +10:00

Author

SHA1

Message

Date

root

1f0e9a5f1b

feat(workers): extract.pdf with Tesseract fallback

pdftotext first; falls back to per-page pdftoppm rasterization +
Tesseract OCR when the extracted text is < 200 chars. Updates
refs.body_text + metadata.extract.{method,chars} via the repo shim;
audit entry emitted with actor_kind='worker'.

born_digital.pdf fixture padded so pdftotext yields > 200 chars and
the test exercises the pdftotext path, not the OCR fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-01 04:59:53 +10:00

root

bbb08a677e

test(workers): pdf/image test fixtures

born_digital.pdf (pdftotext extractable), scanned.pdf (image-only, OCR
fallback target), eng_text.png (clear Tesseract-readable text).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-01 04:57:41 +10:00

2 Commits