Starting at 17:19 UTC on June 1, 2022, Square’s Merchant Data service experienced a disruption, impacting several Square products which prevented some sellers from taking payments.
In this post mortem recap, we’ll communicate the root cause of this disruption, document the steps that we took to diagnose and resolve the disruption, and share our analysis and actions to ensure that we are properly defending our customers from service interruptions like this in the future.
Prior to the incident, several Square engineering teams were manually running routine exercises which resulted in network traffic routing predominantly to a single datacenter. These exercises are done at both peak and off peak times as a best practice to ensure ongoing network stability.
[17:19] (start of impact): An automated routine maintenance activity redirected database traffic for a small number of services to new server hosts, increased latency, and temporarily caused an interruption in database traffic.
[17:20] Sellers began experiencing issues in our POS applications, limiting the ability to take payments.
[17:26] (start of investigation): Engineers were alerted of degraded performance across several Square services, including Login and Payments. They began investigating.
[17:40] Team continued to investigate starting by ruling out any possible triggers such as bug fixes, feature releases, and shifts in traffic patterns.
[17:52] Engineers attempted to shed load on affected services which did not immediately improve the degraded performance.
[18:02] Engineering made the correlation between the automated maintenance activity and when our services became degraded.
[18:10] The decision is made to revert the automated maintenance activity for the impacted services which routed traffic back to our original server hosts.
[18:13] The engineering teams successfully rolled back the automated maintenance activity.
[18:15-18:25] Engineers observe that the majority of Square’s services, including payments and logins, recover.
[19:04] (end of impact): All services have recovered to healthy levels; engineers are continuing to monitor metrics.
This incident was caused by the culmination of routine automated maintenance and manual exercises. The combined activity created a shift in network traffic which inadvertently caused a significant spike in query latency on a critical database. This load caused slower response times, quickly exhausted application database connections, and resulted in a significant amount of requests timing out. The spike unexpectedly triggered a response mechanism to kill long running queries which put more undue load on the database. Load was then further compounded as upstream services began retrying. The situation was stabilized by a combination of shedding load and shifting traffic back to the original database host.
Immediately after the incident was resolved, Square engineers identified several contributing factors and addressed them. Engineering is continuing to analyze the incident from multiple angles including how a failure in one system was able to cascade into a widespread outage. As part of that investigation, additional action to reduce future risk will be identified and assigned to the appropriate teams.
Lastly, several process improvements have been prioritized, including tasks to increase transparency for our sellers during an incident. We are streamlining our internal communications process to accelerate our ability to share information with our sellers and are investigating a framework for proactively notifying sellers.