All services unreachable
Resolved
Jun 27 at 02:38pm CEST
Thursday, Jun 27, 2024 - Incident post mortem
By the Mercury Infrastructure Team
Today we experienced an outage in our Kubernetes infrastructure. We’re providing a preliminary incident report that details the nature of the outage and our response.
The following is the incident report for the Kubernetes outage that occurred on Jun 27, 2024. We understand this issue has severely impacted our users and apologize sincerely to everyone who was affected.
Issue Summary
From 1:32 PM to 2:38 PM CEST, requests to all of our applications resulted in PRENDOFFILEERROR messages (on Firefox) and no response (on Chrome).
Timeline (all times CEST)
1:32 PM: Our monitoring tools alerts us to the error
until 1:32 PM: Our infrastructure team starts investigating the error
until 1:33 PM: First user reports
until 1:35 PM: We start investigating possible culprits, like certificate validities and other SSL related issues
until 1:52 PM: We identify a LoadBalancer as a possible root cause for the problem
until 2:06 PM: We start replacing one of our control plane nodes base on log anomalies
until 2:12 PM: We manage to briefly recover from the error by rescaling the LoadBalancer
until 2:18 PM: The LoadBalancer loses all connections to relevant services again
until 2:17 PM: We run our cluster deployment pipeline to replace the faulty node
until 2:24 PM: While monitoring the deployment pipeline, we notice something is off with another node
until 2:28 PM: The replacement node comes online
until 2:34 PM: We ensure that no negative side-effects will occur upon replacing another node
until 2:38 PM: We remove the second node from the control plane
2:38 PM: All services are responsive again
Root Cause
A faulty control plane node.
How one out of five control plane nodes could cause this downtime needs further investigation.
Resolution and recovery
Replacing control plane nodes in Kubernetes
Corrective and Preventative Measures
tba
incerely,
The Mercury infrastructure Team
Posted by Toby Irmer, MD
Affected services
Created
Jun 27 at 01:32pm CEST
All services are unreachable
Affected services