Incidents | Mercury Media Technology GmbH & Co. KG

Incidents | Mercury Media Technology GmbH & Co. KG Incidents reported on status page for Mercury Media Technology GmbH & Co. KG https://status.onmercury.io/ https://d1lppblt9t2x15.cloudfront.net/logos/9097bf7b26b25f1020b97687ece65834.svg Incidents | Mercury Media Technology GmbH & Co. KG https://status.onmercury.io/ en Ad Schedule recovered https://status.onmercury.io/ Mon, 15 Dec 2025 14:51:29 +0000 https://status.onmercury.io/#de1a7c2c777f876afca24b14b2fe83ec91d0b054c73d644d4238dc35138ed40c Ad Schedule recovered Ad Schedule went down https://status.onmercury.io/ Mon, 15 Dec 2025 14:51:04 +0000 https://status.onmercury.io/#de1a7c2c777f876afca24b14b2fe83ec91d0b054c73d644d4238dc35138ed40c Ad Schedule went down Ad Schedule recovered https://status.onmercury.io/ Thu, 11 Dec 2025 19:12:02 +0000 https://status.onmercury.io/#b8c0ae6d6fec06eec909a93b6e1a4742612c4952d7563d9c6eff2ad770fba161 Ad Schedule recovered MMM recovered https://status.onmercury.io/ Thu, 11 Dec 2025 19:12:00 +0000 https://status.onmercury.io/#2a063d394022d18e19dac519541af671a17f340b6cf734c81b02a012ba6bfd9c MMM recovered MMM went down https://status.onmercury.io/ Thu, 11 Dec 2025 19:11:28 +0000 https://status.onmercury.io/#2a063d394022d18e19dac519541af671a17f340b6cf734c81b02a012ba6bfd9c MMM went down Ad Schedule went down https://status.onmercury.io/ Thu, 11 Dec 2025 19:11:28 +0000 https://status.onmercury.io/#b8c0ae6d6fec06eec909a93b6e1a4742612c4952d7563d9c6eff2ad770fba161 Ad Schedule went down Data Connection Service recovered https://status.onmercury.io/ Wed, 19 Nov 2025 12:09:25 +0000 https://status.onmercury.io/#172fe93f4dc602458e4a10a90c52a6af1436847dee299f0849807a00bd90427c Data Connection Service recovered MMM recovered https://status.onmercury.io/ Wed, 19 Nov 2025 12:09:15 +0000 https://status.onmercury.io/#bf2d91b84da2a7dad59f4aac685d7481e53e50a3086c451a44e9204dc9cd1449 MMM recovered Ad Schedule recovered https://status.onmercury.io/ Wed, 19 Nov 2025 12:09:12 +0000 https://status.onmercury.io/#af4a42a4415c43b27d32e708013b86471f18aacaff7cf7576f7598d2aaee59cc Ad Schedule recovered Data Connection Service went down https://status.onmercury.io/ Wed, 19 Nov 2025 12:08:56 +0000 https://status.onmercury.io/#172fe93f4dc602458e4a10a90c52a6af1436847dee299f0849807a00bd90427c Data Connection Service went down Ad Schedule went down https://status.onmercury.io/ Wed, 19 Nov 2025 12:08:45 +0000 https://status.onmercury.io/#af4a42a4415c43b27d32e708013b86471f18aacaff7cf7576f7598d2aaee59cc Ad Schedule went down MMM went down https://status.onmercury.io/ Wed, 19 Nov 2025 12:08:44 +0000 https://status.onmercury.io/#bf2d91b84da2a7dad59f4aac685d7481e53e50a3086c451a44e9204dc9cd1449 MMM went down All services unavailable https://status.onmercury.io/incident/535485 Wed, 26 Mar 2025 08:51:00 -0000 https://status.onmercury.io/incident/535485#ee730b49a7d9bd242875f341942e61313c2f452a2a4c410a889af453e47f4803 From 9:50 AM CET to 11:27 AM CET our Mercury service experienced a major incident due to failing health checks on Hetzner Cloud load balancers. The incident response team began mitigation at 10:02 AM CET. Timeline of Events: 9:50 AM: First alert received through our incident management platform. 10:02 AM: Failing load balancer identified; mitigation efforts began. 10:14 AM: A new load balancer was created to reroute traffic, while work continued to restore the original load balancer. 10:16 AM: The failed load balancer reported passing health checks; connectivity was temporarily restored. 10:20 AM: Load balancer health checks began failing again. 10:53 AM: The failing load balancer was moved to the hel1 datacenter. 11:27 AM: Traffic was rerouted, and the health check configuration was updated; connectivity to load balancer targets was restored. Future mitigation steps: Establishing a fallback address for our services to reduce response time if traffic rerouting becomes necessary. We are currently testing an alternative load balancing solution outside Hetzner Cloud and will send an update through our customer success team, once we have finished testing the redundancy improvement. All services unavailable https://status.onmercury.io/incident/535485 Wed, 26 Mar 2025 08:51:00 -0000 https://status.onmercury.io/incident/535485#ee730b49a7d9bd242875f341942e61313c2f452a2a4c410a889af453e47f4803 From 9:50 AM CET to 11:27 AM CET our Mercury service experienced a major incident due to failing health checks on Hetzner Cloud load balancers. The incident response team began mitigation at 10:02 AM CET. Timeline of Events: 9:50 AM: First alert received through our incident management platform. 10:02 AM: Failing load balancer identified; mitigation efforts began. 10:14 AM: A new load balancer was created to reroute traffic, while work continued to restore the original load balancer. 10:16 AM: The failed load balancer reported passing health checks; connectivity was temporarily restored. 10:20 AM: Load balancer health checks began failing again. 10:53 AM: The failing load balancer was moved to the hel1 datacenter. 11:27 AM: Traffic was rerouted, and the health check configuration was updated; connectivity to load balancer targets was restored. Future mitigation steps: Establishing a fallback address for our services to reduce response time if traffic rerouting becomes necessary. We are currently testing an alternative load balancing solution outside Hetzner Cloud and will send an update through our customer success team, once we have finished testing the redundancy improvement. All services unavailable https://status.onmercury.io/incident/535485 Wed, 26 Mar 2025 08:51:00 -0000 https://status.onmercury.io/incident/535485#ee730b49a7d9bd242875f341942e61313c2f452a2a4c410a889af453e47f4803 From 9:50 AM CET to 11:27 AM CET our Mercury service experienced a major incident due to failing health checks on Hetzner Cloud load balancers. The incident response team began mitigation at 10:02 AM CET. Timeline of Events: 9:50 AM: First alert received through our incident management platform. 10:02 AM: Failing load balancer identified; mitigation efforts began. 10:14 AM: A new load balancer was created to reroute traffic, while work continued to restore the original load balancer. 10:16 AM: The failed load balancer reported passing health checks; connectivity was temporarily restored. 10:20 AM: Load balancer health checks began failing again. 10:53 AM: The failing load balancer was moved to the hel1 datacenter. 11:27 AM: Traffic was rerouted, and the health check configuration was updated; connectivity to load balancer targets was restored. Future mitigation steps: Establishing a fallback address for our services to reduce response time if traffic rerouting becomes necessary. We are currently testing an alternative load balancing solution outside Hetzner Cloud and will send an update through our customer success team, once we have finished testing the redundancy improvement. All services unavailable https://status.onmercury.io/incident/535485 Wed, 26 Mar 2025 08:51:00 -0000 https://status.onmercury.io/incident/535485#ee730b49a7d9bd242875f341942e61313c2f452a2a4c410a889af453e47f4803 From 9:50 AM CET to 11:27 AM CET our Mercury service experienced a major incident due to failing health checks on Hetzner Cloud load balancers. The incident response team began mitigation at 10:02 AM CET. Timeline of Events: 9:50 AM: First alert received through our incident management platform. 10:02 AM: Failing load balancer identified; mitigation efforts began. 10:14 AM: A new load balancer was created to reroute traffic, while work continued to restore the original load balancer. 10:16 AM: The failed load balancer reported passing health checks; connectivity was temporarily restored. 10:20 AM: Load balancer health checks began failing again. 10:53 AM: The failing load balancer was moved to the hel1 datacenter. 11:27 AM: Traffic was rerouted, and the health check configuration was updated; connectivity to load balancer targets was restored. Future mitigation steps: Establishing a fallback address for our services to reduce response time if traffic rerouting becomes necessary. We are currently testing an alternative load balancing solution outside Hetzner Cloud and will send an update through our customer success team, once we have finished testing the redundancy improvement. Mercury and other application down for 1h37m due to cloud provider incident https://status.onmercury.io/incident/530859 Wed, 19 Mar 2025 15:16:00 -0000 https://status.onmercury.io/incident/530859#53bb141a0c2dbf041925fdf38a9d7323caa3dddbe8198627cdcecd9c65623f35 The application became unavailable at 1:59 PM, after LoadBalancing failed due to an incidient that occurred at our cloud provider, Hetzner, which led to config changes negatively affecting LoadBalancer performance. From our investigation, we deduced that an autoscaling event in the cluster that happened at 1:59 PM led to a config change propagated to the LoadBalancer which then led to the failure. https://status.hetzner.com/incident/b8246859-8fdd-4c44-b35c-d3c9a6cad3a7 We then redeployed the LoadBalancer and deactivated auto-scaling in the cluster to prevent further problems associated with the Hetzner incident. We will keep auto-scaling deactivated until we have ensured in a test setup that problem mitigation at Hetzner is reliable. Mercury and other application down for 1h37m due to cloud provider incident https://status.onmercury.io/incident/530859 Wed, 19 Mar 2025 15:16:00 -0000 https://status.onmercury.io/incident/530859#53bb141a0c2dbf041925fdf38a9d7323caa3dddbe8198627cdcecd9c65623f35 The application became unavailable at 1:59 PM, after LoadBalancing failed due to an incidient that occurred at our cloud provider, Hetzner, which led to config changes negatively affecting LoadBalancer performance. From our investigation, we deduced that an autoscaling event in the cluster that happened at 1:59 PM led to a config change propagated to the LoadBalancer which then led to the failure. https://status.hetzner.com/incident/b8246859-8fdd-4c44-b35c-d3c9a6cad3a7 We then redeployed the LoadBalancer and deactivated auto-scaling in the cluster to prevent further problems associated with the Hetzner incident. We will keep auto-scaling deactivated until we have ensured in a test setup that problem mitigation at Hetzner is reliable. Mercury and other application down for 1h37m due to cloud provider incident https://status.onmercury.io/incident/530859 Wed, 19 Mar 2025 15:16:00 -0000 https://status.onmercury.io/incident/530859#53bb141a0c2dbf041925fdf38a9d7323caa3dddbe8198627cdcecd9c65623f35 The application became unavailable at 1:59 PM, after LoadBalancing failed due to an incidient that occurred at our cloud provider, Hetzner, which led to config changes negatively affecting LoadBalancer performance. From our investigation, we deduced that an autoscaling event in the cluster that happened at 1:59 PM led to a config change propagated to the LoadBalancer which then led to the failure. https://status.hetzner.com/incident/b8246859-8fdd-4c44-b35c-d3c9a6cad3a7 We then redeployed the LoadBalancer and deactivated auto-scaling in the cluster to prevent further problems associated with the Hetzner incident. We will keep auto-scaling deactivated until we have ensured in a test setup that problem mitigation at Hetzner is reliable. Mercury and other application down for 1h37m due to cloud provider incident https://status.onmercury.io/incident/530859 Wed, 19 Mar 2025 15:16:00 -0000 https://status.onmercury.io/incident/530859#53bb141a0c2dbf041925fdf38a9d7323caa3dddbe8198627cdcecd9c65623f35 The application became unavailable at 1:59 PM, after LoadBalancing failed due to an incidient that occurred at our cloud provider, Hetzner, which led to config changes negatively affecting LoadBalancer performance. From our investigation, we deduced that an autoscaling event in the cluster that happened at 1:59 PM led to a config change propagated to the LoadBalancer which then led to the failure. https://status.hetzner.com/incident/b8246859-8fdd-4c44-b35c-d3c9a6cad3a7 We then redeployed the LoadBalancer and deactivated auto-scaling in the cluster to prevent further problems associated with the Hetzner incident. We will keep auto-scaling deactivated until we have ensured in a test setup that problem mitigation at Hetzner is reliable. Mercury database failover incident https://status.onmercury.io/incident/483801 Wed, 18 Dec 2024 10:33:00 -0000 https://status.onmercury.io/incident/483801#0c2f243c32835edae6d03dcfb6f3ecf277a572bc29886dfa7417da3bf67a785c We want to inform you about two brief service outages affecting Mercury today. ##Incident Details ###Duration - First outage: 10:21 AM to 10:34 AM CET (12 minutes) - Second outage: 10:46 AM to 10:54 AM CET (7 minutes) ###Impact During these periods, the service returned a "503 Service Temporarily Unavailable" error. ##What Happened Our preliminary investigation indicates that the outages were caused by a sudden loss of connectivity within our database cluster. The automatic failover mechanism did not select a replica to serve the application, resulting in temporary service unavailability. ##Current Status The service is currently operational, and we are closely monitoring it. Our team is actively investigating the root cause to ensure the reliability of our failover mechanisms and prevent similar incidents in the future. We sincerely apologize for any inconvenience this may have caused. User unable to log in https://status.onmercury.io/incident/467287 Sun, 17 Nov 2024 10:05:00 -0000 https://status.onmercury.io/incident/467287#29decff562a125524f3a75545693ab8604e71fbb0031e6cad7d55555e51f9d68 Missing update User unable to log in https://status.onmercury.io/incident/467287 Sun, 17 Nov 2024 09:31:00 -0000 https://status.onmercury.io/incident/467287#efbbb8b9a84d9cd5c1f85fa321b7095914ed6d47c94f9dece2421ffb98d1c9f9 On Sunday, Nov 17 2024, we experienced login problems to our Mercury application. As this happened outside office hours, and was missed by our monitoring, it was only fixed at around 8:15 AM on Nov 18, 2024, after users had started reporting being unable to log in. We are currently working on being able to monitor pages that are behind the login to avoid such problems in the future. Temporary load balancer outage https://status.onmercury.io/incident/424502 Mon, 02 Sep 2024 08:05:00 -0000 https://status.onmercury.io/incident/424502#ce572f24d93ec534f34bfa8bb02a55c873f206746ebab4b8524faade8db35098 We experienced a brief outage in our production environment today due to the load balancer failover. We are actively working to improve the redundancy and stability of our load balancers. Temporary load balancer outage https://status.onmercury.io/incident/424502 Mon, 02 Sep 2024 08:05:00 -0000 https://status.onmercury.io/incident/424502#ce572f24d93ec534f34bfa8bb02a55c873f206746ebab4b8524faade8db35098 We experienced a brief outage in our production environment today due to the load balancer failover. We are actively working to improve the redundancy and stability of our load balancers. Temporary load balancer outage https://status.onmercury.io/incident/424502 Mon, 02 Sep 2024 08:05:00 -0000 https://status.onmercury.io/incident/424502#ce572f24d93ec534f34bfa8bb02a55c873f206746ebab4b8524faade8db35098 We experienced a brief outage in our production environment today due to the load balancer failover. We are actively working to improve the redundancy and stability of our load balancers. Temporary load balancer outage https://status.onmercury.io/incident/424502 Mon, 02 Sep 2024 08:05:00 -0000 https://status.onmercury.io/incident/424502#ce572f24d93ec534f34bfa8bb02a55c873f206746ebab4b8524faade8db35098 We experienced a brief outage in our production environment today due to the load balancer failover. We are actively working to improve the redundancy and stability of our load balancers. Temporary load balancer outage https://status.onmercury.io/incident/424499 Wed, 28 Aug 2024 08:04:00 -0000 https://status.onmercury.io/incident/424499#331bf628beef770e618ed2dc22e5a7c7e97a0554b717284b589bcecab3be9096 We experienced a brief outage in our production environment today due to the load balancer failover. We are actively working to improve the redundancy and stability of our load balancers. Temporary load balancer outage https://status.onmercury.io/incident/424499 Wed, 28 Aug 2024 08:04:00 -0000 https://status.onmercury.io/incident/424499#331bf628beef770e618ed2dc22e5a7c7e97a0554b717284b589bcecab3be9096 We experienced a brief outage in our production environment today due to the load balancer failover. We are actively working to improve the redundancy and stability of our load balancers. Temporary load balancer outage https://status.onmercury.io/incident/424499 Wed, 28 Aug 2024 08:04:00 -0000 https://status.onmercury.io/incident/424499#331bf628beef770e618ed2dc22e5a7c7e97a0554b717284b589bcecab3be9096 We experienced a brief outage in our production environment today due to the load balancer failover. We are actively working to improve the redundancy and stability of our load balancers. Temporary load balancer outage https://status.onmercury.io/incident/424499 Wed, 28 Aug 2024 08:04:00 -0000 https://status.onmercury.io/incident/424499#331bf628beef770e618ed2dc22e5a7c7e97a0554b717284b589bcecab3be9096 We experienced a brief outage in our production environment today due to the load balancer failover. We are actively working to improve the redundancy and stability of our load balancers. Temporary outage https://status.onmercury.io/incident/410880 Thu, 08 Aug 2024 14:21:00 -0000 https://status.onmercury.io/incident/410880#dba4849b91568d26d6c748fb8252421ad0597dd1b882b006252a944a1323b09f We experienced a brief outage in our production environment today due to maintenance work by our infrastructure provider on their load balancers. We are actively working to improve our system resiliency to mitigate potential outages in the future. Temporary outage https://status.onmercury.io/incident/410880 Thu, 08 Aug 2024 14:21:00 -0000 https://status.onmercury.io/incident/410880#dba4849b91568d26d6c748fb8252421ad0597dd1b882b006252a944a1323b09f We experienced a brief outage in our production environment today due to maintenance work by our infrastructure provider on their load balancers. We are actively working to improve our system resiliency to mitigate potential outages in the future. Temporary outage https://status.onmercury.io/incident/410880 Thu, 08 Aug 2024 14:21:00 -0000 https://status.onmercury.io/incident/410880#dba4849b91568d26d6c748fb8252421ad0597dd1b882b006252a944a1323b09f We experienced a brief outage in our production environment today due to maintenance work by our infrastructure provider on their load balancers. We are actively working to improve our system resiliency to mitigate potential outages in the future. Temporary outage https://status.onmercury.io/incident/410880 Thu, 08 Aug 2024 14:21:00 -0000 https://status.onmercury.io/incident/410880#dba4849b91568d26d6c748fb8252421ad0597dd1b882b006252a944a1323b09f We experienced a brief outage in our production environment today due to maintenance work by our infrastructure provider on their load balancers. We are actively working to improve our system resiliency to mitigate potential outages in the future. Scheduled maintenance to replace LoadBalancer https://status.onmercury.io/incident/397403 Fri, 12 Jul 2024 20:58:00 -0000 https://status.onmercury.io/incident/397403#3135a5eaff7c6e97079bdae9d860c0d9043f2ee7a1673a05084925aa7a4b0f00 In order to try and mitigate problems with intermittent downtimes in the range of a few minutes per week, we are upscaling a LoadBalancer. Maintenance was started at 10:10 PM and finished at 10:41 PM. All systems are back up and running. Scheduled maintenance to replace LoadBalancer https://status.onmercury.io/incident/397403 Fri, 12 Jul 2024 20:58:00 -0000 https://status.onmercury.io/incident/397403#3135a5eaff7c6e97079bdae9d860c0d9043f2ee7a1673a05084925aa7a4b0f00 In order to try and mitigate problems with intermittent downtimes in the range of a few minutes per week, we are upscaling a LoadBalancer. Maintenance was started at 10:10 PM and finished at 10:41 PM. All systems are back up and running. Scheduled maintenance to replace LoadBalancer https://status.onmercury.io/incident/397403 Fri, 12 Jul 2024 20:58:00 -0000 https://status.onmercury.io/incident/397403#3135a5eaff7c6e97079bdae9d860c0d9043f2ee7a1673a05084925aa7a4b0f00 In order to try and mitigate problems with intermittent downtimes in the range of a few minutes per week, we are upscaling a LoadBalancer. Maintenance was started at 10:10 PM and finished at 10:41 PM. All systems are back up and running. Scheduled maintenance to replace LoadBalancer https://status.onmercury.io/incident/397403 Fri, 12 Jul 2024 20:58:00 -0000 https://status.onmercury.io/incident/397403#3135a5eaff7c6e97079bdae9d860c0d9043f2ee7a1673a05084925aa7a4b0f00 In order to try and mitigate problems with intermittent downtimes in the range of a few minutes per week, we are upscaling a LoadBalancer. Maintenance was started at 10:10 PM and finished at 10:41 PM. All systems are back up and running. All services unreachable https://status.onmercury.io/incident/390352 Thu, 27 Jun 2024 12:38:00 -0000 https://status.onmercury.io/incident/390352#8baaa09a875657dfa6f5bfe361520defbe361dadf1517007eda98d83583b474e ### Thursday, Jun 27, 2024 - Incident post mortem By the Mercury Infrastructure Team Today we experienced an outage in our Kubernetes infrastructure. We’re providing a preliminary incident report that details the nature of the outage and our response. The following is the incident report for the Kubernetes outage that occurred on Jun 27, 2024. We understand this issue has severely impacted our users and apologize sincerely to everyone who was affected. ### Issue Summary From 1:32 PM to 2:38 PM CEST, requests to all of our applications resulted in PR_END_OF_FILE_ERROR messages (on Firefox) and no response (on Chrome). ### Timeline (all times CEST) 1:32 PM: Our monitoring tools alerts us to the error until 1:32 PM: Our infrastructure team starts investigating the error until 1:33 PM: First user reports until 1:35 PM: We start investigating possible culprits, like certificate validities and other SSL related issues until 1:52 PM: We identify a LoadBalancer as a possible root cause for the problem until 2:06 PM: We start replacing one of our control plane nodes base on log anomalies until 2:12 PM: We manage to briefly recover from the error by rescaling the LoadBalancer until 2:18 PM: The LoadBalancer loses all connections to relevant services again until 2:17 PM: We run our cluster deployment pipeline to replace the faulty node until 2:24 PM: While monitoring the deployment pipeline, we notice something is off with another node until 2:28 PM: The replacement node comes online until 2:34 PM: We ensure that no negative side-effects will occur upon replacing another node until 2:38 PM: We remove the second node from the control plane 2:38 PM: All services are responsive again ### Root Cause A faulty control plane node. How one out of five control plane nodes could cause this downtime needs further investigation. ### Resolution and recovery Replacing control plane nodes in Kubernetes ### Corrective and Preventative Measures tba incerely, The Mercury infrastructure Team Posted by Toby Irmer, MD All services unreachable https://status.onmercury.io/incident/390352 Thu, 27 Jun 2024 12:38:00 -0000 https://status.onmercury.io/incident/390352#8baaa09a875657dfa6f5bfe361520defbe361dadf1517007eda98d83583b474e ### Thursday, Jun 27, 2024 - Incident post mortem By the Mercury Infrastructure Team Today we experienced an outage in our Kubernetes infrastructure. We’re providing a preliminary incident report that details the nature of the outage and our response. The following is the incident report for the Kubernetes outage that occurred on Jun 27, 2024. We understand this issue has severely impacted our users and apologize sincerely to everyone who was affected. ### Issue Summary From 1:32 PM to 2:38 PM CEST, requests to all of our applications resulted in PR_END_OF_FILE_ERROR messages (on Firefox) and no response (on Chrome). ### Timeline (all times CEST) 1:32 PM: Our monitoring tools alerts us to the error until 1:32 PM: Our infrastructure team starts investigating the error until 1:33 PM: First user reports until 1:35 PM: We start investigating possible culprits, like certificate validities and other SSL related issues until 1:52 PM: We identify a LoadBalancer as a possible root cause for the problem until 2:06 PM: We start replacing one of our control plane nodes base on log anomalies until 2:12 PM: We manage to briefly recover from the error by rescaling the LoadBalancer until 2:18 PM: The LoadBalancer loses all connections to relevant services again until 2:17 PM: We run our cluster deployment pipeline to replace the faulty node until 2:24 PM: While monitoring the deployment pipeline, we notice something is off with another node until 2:28 PM: The replacement node comes online until 2:34 PM: We ensure that no negative side-effects will occur upon replacing another node until 2:38 PM: We remove the second node from the control plane 2:38 PM: All services are responsive again ### Root Cause A faulty control plane node. How one out of five control plane nodes could cause this downtime needs further investigation. ### Resolution and recovery Replacing control plane nodes in Kubernetes ### Corrective and Preventative Measures tba incerely, The Mercury infrastructure Team Posted by Toby Irmer, MD All services unreachable https://status.onmercury.io/incident/390352 Thu, 27 Jun 2024 12:38:00 -0000 https://status.onmercury.io/incident/390352#8baaa09a875657dfa6f5bfe361520defbe361dadf1517007eda98d83583b474e ### Thursday, Jun 27, 2024 - Incident post mortem By the Mercury Infrastructure Team Today we experienced an outage in our Kubernetes infrastructure. We’re providing a preliminary incident report that details the nature of the outage and our response. The following is the incident report for the Kubernetes outage that occurred on Jun 27, 2024. We understand this issue has severely impacted our users and apologize sincerely to everyone who was affected. ### Issue Summary From 1:32 PM to 2:38 PM CEST, requests to all of our applications resulted in PR_END_OF_FILE_ERROR messages (on Firefox) and no response (on Chrome). ### Timeline (all times CEST) 1:32 PM: Our monitoring tools alerts us to the error until 1:32 PM: Our infrastructure team starts investigating the error until 1:33 PM: First user reports until 1:35 PM: We start investigating possible culprits, like certificate validities and other SSL related issues until 1:52 PM: We identify a LoadBalancer as a possible root cause for the problem until 2:06 PM: We start replacing one of our control plane nodes base on log anomalies until 2:12 PM: We manage to briefly recover from the error by rescaling the LoadBalancer until 2:18 PM: The LoadBalancer loses all connections to relevant services again until 2:17 PM: We run our cluster deployment pipeline to replace the faulty node until 2:24 PM: While monitoring the deployment pipeline, we notice something is off with another node until 2:28 PM: The replacement node comes online until 2:34 PM: We ensure that no negative side-effects will occur upon replacing another node until 2:38 PM: We remove the second node from the control plane 2:38 PM: All services are responsive again ### Root Cause A faulty control plane node. How one out of five control plane nodes could cause this downtime needs further investigation. ### Resolution and recovery Replacing control plane nodes in Kubernetes ### Corrective and Preventative Measures tba incerely, The Mercury infrastructure Team Posted by Toby Irmer, MD All services unreachable https://status.onmercury.io/incident/390352 Thu, 27 Jun 2024 12:38:00 -0000 https://status.onmercury.io/incident/390352#8baaa09a875657dfa6f5bfe361520defbe361dadf1517007eda98d83583b474e ### Thursday, Jun 27, 2024 - Incident post mortem By the Mercury Infrastructure Team Today we experienced an outage in our Kubernetes infrastructure. We’re providing a preliminary incident report that details the nature of the outage and our response. The following is the incident report for the Kubernetes outage that occurred on Jun 27, 2024. We understand this issue has severely impacted our users and apologize sincerely to everyone who was affected. ### Issue Summary From 1:32 PM to 2:38 PM CEST, requests to all of our applications resulted in PR_END_OF_FILE_ERROR messages (on Firefox) and no response (on Chrome). ### Timeline (all times CEST) 1:32 PM: Our monitoring tools alerts us to the error until 1:32 PM: Our infrastructure team starts investigating the error until 1:33 PM: First user reports until 1:35 PM: We start investigating possible culprits, like certificate validities and other SSL related issues until 1:52 PM: We identify a LoadBalancer as a possible root cause for the problem until 2:06 PM: We start replacing one of our control plane nodes base on log anomalies until 2:12 PM: We manage to briefly recover from the error by rescaling the LoadBalancer until 2:18 PM: The LoadBalancer loses all connections to relevant services again until 2:17 PM: We run our cluster deployment pipeline to replace the faulty node until 2:24 PM: While monitoring the deployment pipeline, we notice something is off with another node until 2:28 PM: The replacement node comes online until 2:34 PM: We ensure that no negative side-effects will occur upon replacing another node until 2:38 PM: We remove the second node from the control plane 2:38 PM: All services are responsive again ### Root Cause A faulty control plane node. How one out of five control plane nodes could cause this downtime needs further investigation. ### Resolution and recovery Replacing control plane nodes in Kubernetes ### Corrective and Preventative Measures tba incerely, The Mercury infrastructure Team Posted by Toby Irmer, MD All services unreachable https://status.onmercury.io/incident/390352 Thu, 27 Jun 2024 11:32:00 -0000 https://status.onmercury.io/incident/390352#c5fcbde0469fe446b650ccee4ad9b3909e780c1b162bbd797d1d8e7e8d55e623 All services are unreachable All services unreachable https://status.onmercury.io/incident/390352 Thu, 27 Jun 2024 11:32:00 -0000 https://status.onmercury.io/incident/390352#c5fcbde0469fe446b650ccee4ad9b3909e780c1b162bbd797d1d8e7e8d55e623 All services are unreachable All services unreachable https://status.onmercury.io/incident/390352 Thu, 27 Jun 2024 11:32:00 -0000 https://status.onmercury.io/incident/390352#c5fcbde0469fe446b650ccee4ad9b3909e780c1b162bbd797d1d8e7e8d55e623 All services are unreachable All services unreachable https://status.onmercury.io/incident/390352 Thu, 27 Jun 2024 11:32:00 -0000 https://status.onmercury.io/incident/390352#c5fcbde0469fe446b650ccee4ad9b3909e780c1b162bbd797d1d8e7e8d55e623 All services are unreachable