Square status - United States

Multiple Service Issue
Incident Report for Square US
Postmortem

Incident Summary

Starting at Sep 18, 2021 16:05 UTC, Square’s servers experienced a service disruption, impacting Square products. For approximately 3 hours, the issue prevented some sellers from being able to sign into their accounts or accept tips.

In this post mortem recap, we’ll communicate the root cause of this disruption, document the steps that we took to diagnose and resolve the disruption, and share our analysis and actions to ensure that we are properly defending our customers from service interruptions like this in the future.

Timeline (UTC)

16:05 Beginning of impact: APIs for fetching account & device settings suffered increasing latency and declining request success rate

16:05 Automated detectors paged oncall engineers

16:06 Oncall engineer began the investigation

16:16 Oncall engineer expanded the investigation across multiple teams

16:29 Incident summary and investigation document is created

16:41 Engineers identify worst performing database queries

16:56 Customer Success set issquareup.com to “Investigating” status

17:15 Engineers rate limit incoming traffic to reduce load on the database

17:30 Engineers kill longest running queries to reduce load on the database

17:36 Engineers shed all incoming traffic to allow database recovery

17:38 Engineers re-enable incoming traffic

17:40 Database performance degrades further

17:45 Engineers switched the database to other hardware

18:13 Engineers set lower rate limiting on incoming traffic

18:24 Engineers identify subset of queries causing the performance degradation

18:30 Engineers propose database operation to optimize those queries

18:43 Engineers backup database to avoid data loss

18:46 Engineers perform database operation

19:00 End of impact: Request success rates returned to their baseline before the incident began.

19:21 Customer Success set issquareup.com to “Monitoring” status

21:07 Customer Success set issquareup.com to “Resolved” status

Analysis

This incident revealed areas of improvement for both our technical infrastructure and our engineering processes, several of which have already been implemented.

A new usage pattern for Square APIs generated an unexpected volume of data within the database storing mobile application settings. Queries, which were not optimized for this scale, faced degraded performance, leading to timeouts retrieving settings from this database. APIs that were depended on by our mobile Point of Sale applications to retrieve account information from this database began excluding the unavailable information.

As a result, the iOS mobile applications unintentionally reverted to incorrect values for application settings, such as disabling tipping. Disabling tipping negatively impacted both sellers and their employees, and sellers were unable to re-enable tipping due to periodic syncs reapplying that setting.

When attempting to log-in, sellers using device codes to log into mobile applications encountered a different issue due to our system treating all requests missing those settings as entirely invalid. The log-in flow requires successful retrieval of account information which blocked these sellers from taking payments.

To prevent the same issue from occurring again in the future, our engineering teams will conduct an audit of each setting to ensure that, in the event of service disruption, a local cache of settings remains in effect such that settings remain unchanged. We’ll also be exploring other options for ensuring tipping stability is highly prioritized in the future. Finally, we’ve refined our metric detectors to forecast alarming trends before performance degradation occurs.

Posted Sep 23, 2021 - 15:32 PDT

Resolved
Good news! We’ve confirmed with our Engineer Teams that the earlier issues have been resolved, and all previously impacted services have returned to normal. We’re sorry for any inconvenience this may have caused, and thank you for your patience as we worked through the issue.
Posted Sep 18, 2021 - 14:07 PDT
Monitoring
We have released a fix for the issues experienced earlier, and sellers should see services returning to full functionality. We’ll continue to watch this closely, and will update you to confirm that this is completely resolved. Thank you for being patient with us!
Posted Sep 18, 2021 - 12:21 PDT
Update
Our Engineering Teams are actively working on resolving the current issues. We still recommend you remain logged in until we have more information to provide. We’ll report back once we’ve implemented a fix for this issue. We appreciate your continued patience.
Posted Sep 18, 2021 - 12:05 PDT
Identified
Our Engineering Teams have identified the root cause of the issue, and are actively working on a solution. We still recommend you remain logged in until we have more information to provide. We’ll report back once we’ve implemented a fix for this issue. Thank you for your patience.
Posted Sep 18, 2021 - 11:40 PDT
Investigating
We are still investigating issues with multiple services and systems being down. Our Engineering Teams are aware and we will update issquareup.com as more info becomes available. If you are currently experiencing this issue, please remain logged in until we have more information to provide. Thank you for your patience.
Posted Sep 18, 2021 - 09:56 PDT
This incident affected: Point of Sale, Square for Restaurants, and Square for Retail.