Square status - United States

Degraded Performance: Square Services
Incident Report for Square US
Postmortem

Incident Summary: 2023-09-07

A timeline of the events of the outage and steps for remediation

Summary

Last week, Square experienced a multi-hour outage across our services. We understand that you rely on our systems to power your business and that’s a responsibility we take seriously. We apologize for letting you down and for the length of time it took for us to get our systems back up and running.

Beginning at 1:54 PM ET on September 7, 2023, Square products and services were unavailable. At 2:05 AM ET September 8th systems began to recover with merchants able to access restored payment services by 5:19 AM ET. For sellers on a supported configuration that utilized offline mode, Square completed processing offline payments by 1:57 PM ET on September 8th or, if the device came online at a later time, shortly after the device came online. Square Online websites were available; however, Square Online customers were unable to process payments during the outage.

As we previously shared, this outage was caused by a key part of our infrastructure, our DNS servers. Now that we’ve completed a root cause analysis, we want to share an overview of the incident and steps for remediation.

Service Impact

We’re going to start with an overview of how Square’s systems work together. Square operates in multiple data center regions. Square services use DNS and mesh-based routing infrastructure to find service dependencies and serve requests. Without DNS, Square products, internal tools, and services cannot communicate, which results in service disruption. In this incident, an unrelated change to our host-based firewalls combined with a DNS service upgrade caused unexpected load on our internal DNS servers and caused them to fail. Once node-based DNS caches expired, services could not communicate with their dependencies and caused external requests to fail.

Square’s host-based firewall policy is managed by a central service that pushes firewall policies to nodes in Square datacenters, which then expand the policy into firewall rules. This service uses an accelerated rollout strategy to quickly adapt to changing environment state. But, in this case, a small policy change expanded to a much larger ruleset. This large ruleset caused node instability and when combined with the traffic pattern of DNS, caused DNS to start failing requests.

Square uses a microservices environment for services that handle external requests and many internal systems to manage our services. In this case, many services used for troubleshooting and recovery were also impacted, which resulted in an extended outage.

Based on a forensic analysis of the incident, we’ve ruled out a cyberattack as the cause of this incident, and there’s no evidence of a data breach or loss.

Timeline

September 7, 2023

  • 11:04 AM ET - Host-based firewall rule change deployed to enable region communication, increasing on-node firewall rule size.
  • 1:56 PM ET - DNS zone change.
  • 2:02 PM ET - Engineers were notified of infrastructure issues and incident response begins starting with DNS investigation.
  • 2:47 PM ET - issquareup.com incident created.
  • 2:52 PM ET - Work begins to recover internal access and tooling.
  • 3:56 PM ET - Shed networking traffic to our DNS servers. Started manual work to bring up new DNS servers.
  • 6:00 PM ET - DNS service capacity increased, but does not help. Started manual deployment of networking changes to re-enable our authorization and access services.
  • 6:29 PM ET - Internal access services recover. This allows engineers to start working in parallel to recover the authorization and control plane services.
  • 7:00 PM ET - Started manual deployment of networking changes to all data centers.
  • 8:36 PM ET - Square deployment pipeline recovers.
  • 10:06 PM ET - Rebuild of our DNS servers.
  • 11:52 PM ET - New configuration based on reverted ruleset is built and configuration begins to be pushed to DNS hosts.

September 8, 2023

  • 12:06 AM ET - Some DNS hosts are healthy and more internal tooling recovers.
  • 12:55 AM ET - All DNS servers are healthy.
  • 1:30 AM ET - Partial recovery of internal service to service connectivity. Partial recovery of our edge routing infrastructure.
  • 2:05 AM ET - Some Square systems begin recovery.
  • 2:40 AM ET - Payment traffic has fully recovered.
  • 3:12 AM ET - Edge routing infrastructure fully recovered.
  • 4:18 AM ET - Majority of Square products and services are recovered. issquareup.com incident is updated that we’ve implemented a series of fixes.
  • 5:19 AM ET - issquareup.com incident is resolved.
  • 6:59 AM ET - Additional DNS capacity is added.
  • 9:52 AM ET - Background processing of offline payments begins.
  • 1:57 PM ET - Uploaded offline payments have been fully processed.

Service Improvements

The incident has highlighted a number of opportunities to improve our infrastructure, and we're working on making these changes, which are designed to prevent future incidents:

  • Transitioning our DNS infrastructure to isolated infrastructure.
  • Additional monitoring and optimizations for critical networking infrastructure.
  • Optimizing dependencies between our deploy and platform infrastructure where feasible.

Many sellers utilized Offline Mode in order to continue accepting payments. As a precautionary measure, we deferred processing offline payments for a number of hours. We are expanding support for and improving our communication regarding the availability of Offline Mode.

In Closing

We apologize for the disruption our outage might have created for you, your customers, and your employees. We know this situation was made more difficult by our communication frequency and the delayed support response some of you experienced. We will learn from this event and improve our systems and processes.

We appreciate your business and we are committed to doing better to regain your trust.

Posted Sep 15, 2023 - 17:03 PDT

Resolved
We can now confirm that the disruption impacting Square services has been resolved.
Please be aware that sellers may encounter delays in the updating of certain product/services:

- Offline Mode Payments: Payments are being uploaded, but there will be a slight delay before they appear as completed.
Any new Offline Mode Payments will be completed as normal in the coming hours.

- Square Reporting Tools: There is a possibility of delays in updating new billing and transaction information across all Square reporting tools, including those in all Square Point of Sale apps and the Dashboard.

- Transfers: we anticipate slight delays for some transfers on Friday, September 8.
However, we want to reassure all our US sellers that you will receive your transfers by 12:00 PM, PDT on Friday, September 8.

We understand how important it is to have your business tools fully operational, and for this reason, our engineering team is currently engaged in discussions to prevent similar disruptions from happening in the future.

We sincerely thank you for your patience as our team worked to resolve this issue, and we apologize for any inconvenience this disruption may have caused to your business.

Once this disruption has been fully investigated, we plan to publish a full review of this issue and determine what steps we can take to prevent it from happening again.
Posted Sep 08, 2023 - 06:42 PDT
Update
Your continued patience and support mean a lot to us as our engineers oversee the implemented solution. Services are steadily regaining their functionality, and we will share any additional updates on this platform as soon as they become available.
Posted Sep 08, 2023 - 05:21 PDT
Update
We are actively observing the recovery of all Square systems and will continue to post live updates here. Thanks again for your patience.
For instant answers to common questions, visit our Support Center at squareup.com/help or our Seller Community at sellercommunity.com.
Posted Sep 08, 2023 - 04:19 PDT
Update
We are continuing to monitor the situation and the fix implemented by our engineering teams.
Thank you for bearing with us.

As a reminder - as a result of this disruption affecting multiple Square Services, we anticipate slight delays for some transfers on Friday, September 8.
However, we want to reassure all our US sellers that you will receive your transfers by 12:00 PM, PDT on Friday, September 8.
Posted Sep 08, 2023 - 03:44 PDT
Update
We appreciate your ongoing patience and support as our engineers continue to monitor the solution implemented. We are continuing to see services regain functionality and we'll post any further updates here as we have them.
Posted Sep 08, 2023 - 03:21 PDT
Monitoring
Our engineering team is continuing to monitor the results of the fix implemented and Square services are continuing to recover.
As a reminder, for instant answers to common questions, visit our Support Center at squareup.com/help or our Seller Community at sellercommunity.com. Thank you.
Posted Sep 08, 2023 - 03:21 PDT
This incident affected: Payment Acceptance, Transfers, Appointments, Team Management, Phone Support, Point of Sale, Square for Restaurants, Square for Retail, Virtual Terminal, Invoices, Dashboard, Capital, Square Card, Online Store, Loyalty, Marketing, Payroll, Developer Sandbox, Developer Payments, Seller Community, and Square Hardware.