New Rule: CoreDNS Unavailable

Details

/claim #41

Reproducible test setup: Uses kind cluster with CoreDNS scaling reproduction
Test data in a test.log file: Real CoreDNS failure timeline with early warning signals
CRE Rule: CRE-2025-0071 - CoreDNS unavailable detection with severity 1 (high)
Local Testing: ✅ Verified with preq CLI - rule fires correctly on all patterns

CRE Details

CRE ID: CRE-2025-0071
Severity: 1 (High)
Impact Score: 9/10
Title: CoreDNS unavailable
Category: kubernetes-problem

Problem Detection

This rule provides early warning for CoreDNS failures by detecting:

Scaling to zero: Scaled down replica set coredns-.+ from [1-9]+ to 0
Container termination: Stopping container coredns
Readiness probe failures: Readiness probe failed.+connection refused

Why This Matters

CoreDNS unavailability is a critical failure that can break all service discovery in the cluster, cause cascading readiness probe failures, lead to complete cluster outage, and is commonly cited in production post-mortems.

Commands for Sample Data

A full reproduction repo is available here with a README and script that automates the below simplified commands

The test.log file associated with CRE-2025-0071 was generated by reproducing CoreDNS failure scenarios. The core commands executed to produce the log entries are:

1. To reproduce CoreDNS scaling failure:

# Scale CoreDNS to zero (triggers immediate detection)
kubectl -n kube-system scale deployment/coredns --replicas=0

# Capture the scaling event
kubectl -n kube-system get events --sort-by=.lastTimestamp | grep coredns

2. To capture readiness probe failures:

# Monitor pod termination and readiness failures
kubectl -n kube-system get events --watch --field-selector reason=Killing,reason=Unhealthy | grep coredns

3. To collect timeline for test.log:

# Collect events with timestamps
kubectl -n kube-system get events --sort-by=.lastTimestamp -o custom-columns=TIME:.lastTimestamp,TYPE:.type,REASON:.reason,OBJECT:.involvedObject.name,MESSAGE:.message | grep coredns

The test.log contains real failure events that demonstrate how this rule detects CoreDNS unavailability before DNS queries start timing out, providing critical early warning for cluster-wide DNS outages.

Video

https://github.com/user-attachments/assets/e07ed205-2354-4124-8c7d-7d01d56a0ade

Details

CRE Details

Problem Detection

Why This Matters

Commands for Sample Data

Video

Claim

Contributors

Sponsors