A week or two ago, some of a number of servers became unreachable via Zabbix. Both pairs of servers (as we run everything in pairs) showed the same problem, and no other host did.
Looking in the Zabbix log showed this rather cryptic error:
SSL_shutdown() with 172.31.16.32 set result code to 6
That’s not very helpful, and several hours of head-scratching went by before we finally stumbled across what was happening.
The X.509 (or SSL) certificate on the target machine is issued by Let’s Encrypt, who are in the process of signing new certificates with a new key. Within Zabbix, we check that the certificate presented by the client when we connect is issued by a specific issuer and since this had changed, the server was refusing to connect.
How did we fix it? Really easily – by going in to the host configuration in Zabbix, and setting the issuer to:
CN=R3,O=Let's Encrypt,C=US
Ridiculously straightforward, and if you hover over the red ‘ZBX’ status box, you’ll see an error saying that the wrong issuer was found on the certificate.
That’s a few hours we’ll never get back, but it’s great to have solved the problem. And as it happens, there was a correlation between the servers affected – two pairs were built at the same time, and the remaining pair was built just about 90 days before the others. Let’s Encrypt certificates have a lifetime of 90 days.
We’re expecting further servers to drop off Zabbix, but at least we know why and we can fix it.