Back to overview
Downtime

All services unreachable

Jun 27 at 01:32pm CEST
Affected services
Mercury
MMM
Ad Schedule
Data Connection Service

Resolved
Jun 27 at 02:38pm CEST

Thursday, Jun 27, 2024 - Incident post mortem

By the Mercury Infrastructure Team

Today we experienced an outage in our Kubernetes infrastructure. We’re providing a preliminary incident report that details the nature of the outage and our response.
The following is the incident report for the Kubernetes outage that occurred on Jun 27, 2024. We understand this issue has severely impacted our users and apologize sincerely to everyone who was affected.

Issue Summary

From 1:32 PM to 2:38 PM CEST, requests to all of our applications resulted in PRENDOFFILEERROR messages (on Firefox) and no response (on Chrome).

Timeline (all times CEST)

1:32 PM: Our monitoring tools alerts us to the error
until 1:32 PM: Our infrastructure team starts investigating the error
until 1:33 PM: First user reports
until 1:35 PM: We start investigating possible culprits, like certificate validities and other SSL related issues
until 1:52 PM: We identify a LoadBalancer as a possible root cause for the problem
until 2:06 PM: We start replacing one of our control plane nodes base on log anomalies
until 2:12 PM: We manage to briefly recover from the error by rescaling the LoadBalancer
until 2:18 PM: The LoadBalancer loses all connections to relevant services again
until 2:17 PM: We run our cluster deployment pipeline to replace the faulty node
until 2:24 PM: While monitoring the deployment pipeline, we notice something is off with another node
until 2:28 PM: The replacement node comes online
until 2:34 PM: We ensure that no negative side-effects will occur upon replacing another node
until 2:38 PM: We remove the second node from the control plane
2:38 PM: All services are responsive again

Root Cause

A faulty control plane node.
How one out of five control plane nodes could cause this downtime needs further investigation.

Resolution and recovery

Replacing control plane nodes in Kubernetes

Corrective and Preventative Measures

tba

incerely,
The Mercury infrastructure Team

Posted by Toby Irmer, MD

Created
Jun 27 at 01:32pm CEST

All services are unreachable