The url passed to yt-dlp is user-controllable (via /api/capture). Any
string starting with '-' would be parsed as a flag (e.g.
--config-location=/etc/passwd). Mitigations:
1. Validate scheme is http(s) and hostname is present before subprocess.
2. Pass `--` to yt-dlp so it stops flag parsing before the positional
URL.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
yt-dlp pulls metadata (title, description, uploader, thumbnail) and
bestaudio (opus). faster-whisper transcribes; audio file removed after.
Creates a refs row with kind='video' and source_kind='youtube' for
YouTube URLs, generic 'video' otherwise. Idempotent on
sha256(space_id + url) via refs.external_id.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
pdftotext first; falls back to per-page pdftoppm rasterization +
Tesseract OCR when the extracted text is < 200 chars. Updates
refs.body_text + metadata.extract.{method,chars} via the repo shim;
audit entry emitted with actor_kind='worker'.
born_digital.pdf fixture padded so pdftotext yields > 200 chars and
the test exercises the pdftotext path, not the OCR fallback.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>