All services unavailable
Resolved
Mar 26 at 09:51am CET
From 9:50 AM CET to 11:27 AM CET our Mercury service experienced a major incident due to failing health checks on Hetzner Cloud load balancers. The incident response team began mitigation at 10:02 AM CET.
Timeline of Events:
9:50 AM: First alert received through our incident management platform.
10:02 AM: Failing load balancer identified; mitigation efforts began.
10:14 AM: A new load balancer was created to reroute traffic, while work continued to restore the original load balancer.
10:16 AM: The failed load balancer reported passing health checks; connectivity was temporarily restored.
10:20 AM: Load balancer health checks began failing again.
10:53 AM: The failing load balancer was moved to the hel1 datacenter.
11:27 AM: Traffic was rerouted, and the health check configuration was updated; connectivity to load balancer targets was restored.
Future mitigation steps:
Establishing a fallback address for our services to reduce response time if traffic rerouting becomes necessary.
We are currently testing an alternative load balancing solution outside Hetzner Cloud and will send an update through our customer success team, once we have finished testing the redundancy improvement.
Affected services