Commit Graph

4 Commits

Author SHA1 Message Date
root
1ba7aae439 feat(workers): ingest.video via yt-dlp + Whisper
yt-dlp pulls metadata (title, description, uploader, thumbnail) and
bestaudio (opus). faster-whisper transcribes; audio file removed after.
Creates a refs row with kind='video' and source_kind='youtube' for
YouTube URLs, generic 'video' otherwise. Idempotent on
sha256(space_id + url) via refs.external_id.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:07:33 +10:00
root
f2035c1de6 feat(workers): extract.image via Tesseract
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 05:00:21 +10:00
root
1f0e9a5f1b feat(workers): extract.pdf with Tesseract fallback
pdftotext first; falls back to per-page pdftoppm rasterization +
Tesseract OCR when the extracted text is < 200 chars. Updates
refs.body_text + metadata.extract.{method,chars} via the repo shim;
audit entry emitted with actor_kind='worker'.

born_digital.pdf fixture padded so pdftotext yields > 200 chars and
the test exercises the pdftotext path, not the OCR fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:59:53 +10:00
root
fba1ce48e4 feat(workers): runner loop + echo handler
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 04:43:52 +10:00