Stable Diffusion WebUI Critical CUDA Out of Memory Detection Rule

🎯 Overview

This PR introduces a comprehensive detection rule for Stable Diffusion WebUI CUDA Out of Memory failures - addressing one of the most critical and widespread issues affecting AUTOMATIC1111 Stable Diffusion deployments globally. The rule identifies CUDA memory exhaustion leading to complete WebUI service failure requiring manual intervention.

CRE Playground Links

CRE-2025-0130 Playground: Test Rule

🚨 Problem Statement

High-Severity Issue: Stable Diffusion WebUI CUDA failures cause:

Complete service interruption - WebUI becomes unresponsive and requires manual restart
Loss of current image generation progress and any queued generation tasks
Potential CUDA context corruption requiring process restart to recover
User experience degradation with failed image generations and error messages
System instability in multi-user deployments where one user’s OOM affects others
Cascading failures where recovery attempts also fail due to memory constraints

Why This Matters: Stable Diffusion CUDA failures are particularly dangerous because:

High-resolution image generation (1024x1024+) requires massive GPU VRAM
Failures often occur mid-generation causing complete data loss
AUTOMATIC1111 WebUI has millions of users globally
Issues manifest as generic crashes making diagnosis difficult
Memory fragmentation prevents allocation of required contiguous memory blocks
Requires immediate intervention to restore service functionality

Rule Performance

Detection Rate: 2 critical hits with sequence matching
Processing Speed: 64.52K lines/s processing
Window: 30-second detection window captures failure cascade
False Positive Rate: Low (specific PyTorch CUDA error patterns)

📊 Stable Diffusion Issues Covered

#	Issue Type	Example Error Pattern
1	CUDA Memory Exhaustion	`torch.cuda.OutOfMemoryError: CUDA out of memory`
2	Model Loading Failures	Failed to allocate tensor on device
3	Generation Process Crashes	Fatal error during image generation
4	WebUI Unresponsiveness	Gradio interface becoming unresponsive
5	Recovery Failures	Recovery failed - WebUI requires restart
6	CUDA Context Corruption	CUDA context may be corrupted
7	Complete Service Failure	Complete service failure - manual intervention required

🧪 Testing & Validation

CRE Rule Testing

cd stable-diffusion-demo
cat logs/sd-webui-cuda-oom.log | preq -r ../rules/cre-2025-0130/stable-diffusion-cuda-oom.yaml -d

Test Results: Screenshot from 2025-08-27 13-17-40

🎬 Demo Environment

Repo link (private invitation already send) https://github.com/MAVRICK-1/cuda-oom

https://github.com/user-attachments/assets/321e1dd6-49da-4139-8c3c-9bf9f2164f89

./start-demo.sh
cat logs/roop-cuda-oom.log | preq -r stable-diffusion-cuda-oom.yaml -d

Fixes #130 /claim #130