feat(workers): free Ollama VRAM before loading Whisper on the GPU
Whisper (CT 311) and Ollama (CT 102) share one A2000. Before loading Whisper on CUDA, ask Ollama to unload its models (GET /api/ps then POST /api/generate keep_alive:0) and wait for the card to clear, so the GPU load has headroom. Best-effort and stdlib-only; Ollama reloads cooperatively, and the existing CUDA->CPU fallback covers any failure. Toggle via OLLAMA_FREE_BEFORE_STT; endpoint via OLLAMA_URL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -32,6 +32,12 @@ def whisper_model():
|
||||
# another process sharing the card). HA portability + a shared GPU
|
||||
# mean this must degrade gracefully, never hard-fail a transcription.
|
||||
if cuda_available():
|
||||
# Make room on the shared GPU first (best-effort; never raises).
|
||||
try:
|
||||
from . import gpu
|
||||
gpu.free_ollama_vram()
|
||||
except Exception as e:
|
||||
log.info("ollama_free_skipped", err=str(e))
|
||||
try:
|
||||
_whisper_model = _load_whisper("cuda", "float16")
|
||||
except Exception as e:
|
||||
|
||||
Reference in New Issue
Block a user