Whisper (CT 311) and Ollama (CT 102) share one A2000. Before loading
Whisper on CUDA, ask Ollama to unload its models (GET /api/ps then POST
/api/generate keep_alive:0) and wait for the card to clear, so the GPU
load has headroom. Best-effort and stdlib-only; Ollama reloads
cooperatively, and the existing CUDA->CPU fallback covers any failure.
Toggle via OLLAMA_FREE_BEFORE_STT; endpoint via OLLAMA_URL.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cuda_available() only covers "no GPU present". On a shared card the GPU
can exist but fail to load the model (VRAM exhausted by another process
e.g. Ollama). Try CUDA first, fall back to a CPU model on any load
error instead of crashing the transcription job. Supports HA portability
(node without GPU) and a contended GPU. Adds GPU-path + fallback tests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>