🔧 Fix Summary
This PR resolves the sporadic SSH ‘Permission denied (publickey)’ errors reported in #7724.
🎯 Root Cause Analysis
The issue was caused by three interconnected problems in SSH multiplexing:
- Incomplete SSH Key Validation - Only checked if file exists, not content validity or permissions
- Missing Retry Logic - Failed connections weren’t retried, leaving stale multiplexed sockets
- No Connection Recovery - Health checks detected failures but didn’t auto-recover
✅ Fixes Implemented
Fix 1: Enhanced SSH Key Validation
- ✅ Verify file permissions (must be 0600 per SSH spec)
- ✅ Validate PEM format with regex check
- ✅ Test key accessibility with
ssh-keygen
- ✅ Detailed logging for diagnostics
Fix 2: Automatic Retry with Recovery
- ✅ Up to 3 retry attempts on connection failures
- ✅ Re-validate key before each retry
- ✅ Clean up stale mux sockets between attempts
- ✅ Progressive backoff with 1s sleep
Fix 3: Self-Healing Health Checks
- ✅ Auto-recover connections on health check failures
- ✅ Rebuild multiplexed connection when unhealthy
- ✅ Log all recovery attempts
📊 Testing Recommendations
- Monitor SSH connection logs for retry attempts
- Force a key permission change and verify auto-fix
- Test with network interruptions to verify recovery
- Verify no ‘Permission denied’ errors in next deployment
💡 Impact
This should eliminate the sporadic authentication failures by:
- Ensuring SSH keys are always valid and accessible
- Automatically recovering from temporary connection issues
- Providing detailed diagnostics for any remaining issues
Fixes: #7724
/claim #7724