Square status - United States

Degraded Performance: Multiple Services
Incident Report for Square US
Postmortem

Incident Summary

Starting at 17:19 UTC on June 1, 2022, Square’s Merchant Data service experienced a disruption, impacting several Square products which prevented some sellers from taking payments.

In this post mortem recap, we’ll communicate the root cause of this disruption, document the steps that we took to diagnose and resolve the disruption, and share our analysis and actions to ensure that we are properly defending our customers from service interruptions like this in the future.

Timeline (UTC)

Prior to the incident, several Square engineering teams were manually running routine exercises which resulted in network traffic routing predominantly to a single datacenter. These exercises are done at both peak and off peak times as a best practice to ensure ongoing network stability.

[17:19] (start of impact): An automated routine maintenance activity redirected database traffic for a small number of services to new server hosts, increased latency, and temporarily caused an interruption in database traffic. 

[17:20] Sellers began experiencing issues in our POS applications, limiting the ability to take payments. 

[17:26] (start of investigation): Engineers were alerted of degraded performance across several Square services, including Login and Payments. They began investigating.

[17:40] Team continued to investigate starting by ruling out any possible triggers such as bug fixes, feature releases, and shifts in traffic patterns.

[17:52] Engineers attempted to shed load on affected services which did not immediately improve the degraded performance.

[18:02] Engineering made the correlation between the automated maintenance activity and when our services became degraded.

[18:10] The decision is made to revert the automated maintenance activity for the impacted services which routed traffic back to our original server hosts.

[18:13] The engineering teams successfully rolled back the automated maintenance activity. 

[18:15-18:25] Engineers observe that the majority of Square’s services, including payments and logins, recover. 

[19:04] (end of impact): All services have recovered to healthy levels; engineers are continuing to monitor metrics.

Analysis

This incident was caused by the culmination of routine automated maintenance and manual exercises. The combined activity created a shift in network traffic which inadvertently caused a significant spike in query latency on a critical database.  This load caused slower response times, quickly exhausted application database connections, and resulted in a significant amount of requests timing out.  The spike unexpectedly triggered a response mechanism to kill long running queries which put more undue load on the database.  Load was then further compounded as upstream services began retrying.  The situation was stabilized by a combination of shedding load and shifting traffic back to the original database host.

Immediately after the incident was resolved, Square engineers identified several contributing factors and addressed them.  Engineering is continuing to analyze the incident from multiple angles including how a failure in one system was able to cascade into a widespread outage. As part of that investigation, additional action to reduce future risk will be identified and assigned to the appropriate teams.

Lastly, several process improvements have been prioritized, including tasks to increase transparency for our sellers during an incident. We are streamlining our internal communications process to accelerate our ability to share information with our sellers and are investigating a framework for proactively notifying sellers.

Posted Jun 08, 2022 - 10:24 PDT

Resolved
We’ve confirmed with our Engineering Teams that the earlier issues have been resolved, and all previously impacted services have returned to normal.

We understand how important it is for all of our services to be running for your business, and our Engineering team is actively discussing how to prevent future disruptions like this one. Thank you so much for your patience as we worked through this.
Posted Jun 01, 2022 - 12:58 PDT
Monitoring
Our engineering team has pushed out a fix for the impacted services, and they are currently in the process of stabilizing.

We will continue to watch this closely and will update you to confirm when completely resolved.
Posted Jun 01, 2022 - 12:38 PDT
Update
Our engineering team has pushed out a fix for the impacted services, and they are currently in the process of stabilizing.

We will continue to watch this closely and will update you to confirm when completely resolved.
Posted Jun 01, 2022 - 12:36 PDT
Update
We are continuing to work on a fix for a disruption currently impacting multiple Square services. We understand how important it is for all of our services to be running for your business, we’ll be back with further updates as they come to hand.

Thank you for your continued patience with us as we work to resolve this issue.
Posted Jun 01, 2022 - 11:56 PDT
Identified
Our engineering team has located the source of the impacted services and is working to implement a fix.

We appreciate your continued patience during this time, please check back here for updates.
Posted Jun 01, 2022 - 11:51 PDT
Update
Our teams are continuing to investigate issues impacting multiple Square-related services. As soon as we have more information to share regarding a fix, we will do so here.

We know how important full account accessibility is to your business - and we hope to be back with a fix soon.
Posted Jun 01, 2022 - 11:20 PDT
Update
We are receiving confirmed reports of issues that are impacting multiple Square services. We are working with our teams to confirm & determine the potential breadth of impact.

We’ll be back to update as soon as we can once we receive more information from our Engineering teams.
Posted Jun 01, 2022 - 11:13 PDT
Update
We are receiving confirmed reports of issues that are impacting multiple Square services. We are working with our teams to confirm & determine the potential breadth of impact.

We’ll be back to update as soon as we can once we receive more information from our Engineering teams.
Posted Jun 01, 2022 - 11:09 PDT
Update
We are receiving confirmed reports of issues that are impacting multiple Square services. We are working with our teams to confirm & determine the potential breadth of impact.

We’ll be back to update as soon as we can once we receive more information from our Engineering teams.
Posted Jun 01, 2022 - 11:08 PDT
Investigating
We are receiving confirmed reports of issues that are impacting multiple Square services. We are working with our teams to confirm & determine the potential breadth of impact.

We’ll be back to update as soon as we can once we receive more information from our Engineering teams.
Posted Jun 01, 2022 - 10:46 PDT
This incident affected: Payment Acceptance, Money & Transfers, Developer Services, eCommerce, Appointments, Customer Management, Team Management, Square for Restaurants, Dashboard, Capital, Instant Transfers, Online Store, Account Accessibility, Square Services, and Orders.