/claim #41
This rule provides early warning for CoreDNS failures by detecting:
Scaled down replica set coredns-.+ from [1-9]+ to 0
Stopping container coredns
Readiness probe failed.+connection refused
CoreDNS unavailability is a critical failure that can break all service discovery in the cluster, cause cascading readiness probe failures, lead to complete cluster outage, and is commonly cited in production post-mortems.
A full reproduction repo is available here with a README and script that automates the below simplified commands
The test.log
file associated with CRE-2025-0071 was generated by reproducing CoreDNS failure scenarios. The core commands executed to produce the log entries are:
1. To reproduce CoreDNS scaling failure:
# Scale CoreDNS to zero (triggers immediate detection)
kubectl -n kube-system scale deployment/coredns --replicas=0
# Capture the scaling event
kubectl -n kube-system get events --sort-by=.lastTimestamp | grep coredns
2. To capture readiness probe failures:
# Monitor pod termination and readiness failures
kubectl -n kube-system get events --watch --field-selector reason=Killing,reason=Unhealthy | grep coredns
3. To collect timeline for test.log:
# Collect events with timestamps
kubectl -n kube-system get events --sort-by=.lastTimestamp -o custom-columns=TIME:.lastTimestamp,TYPE:.type,REASON:.reason,OBJECT:.involvedObject.name,MESSAGE:.message | grep coredns
The test.log
contains real failure events that demonstrate how this rule detects CoreDNS unavailability before DNS queries start timing out, providing critical early warning for cluster-wide DNS outages.
https://github.com/user-attachments/assets/e07ed205-2354-4124-8c7d-7d01d56a0ade
Nicolas Yarosz
@yarosz
Prequel
@prequel-dev