Starting at Sep 18, 2021 16:05 UTC, Square’s servers experienced a service disruption, impacting Square products. For approximately 3 hours, the issue prevented some sellers from being able to sign into their accounts or accept tips.
In this post mortem recap, we’ll communicate the root cause of this disruption, document the steps that we took to diagnose and resolve the disruption, and share our analysis and actions to ensure that we are properly defending our customers from service interruptions like this in the future.
16:05 Beginning of impact: APIs for fetching account & device settings suffered increasing latency and declining request success rate
16:05 Automated detectors paged oncall engineers
16:06 Oncall engineer began the investigation
16:16 Oncall engineer expanded the investigation across multiple teams
16:29 Incident summary and investigation document is created
16:41 Engineers identify worst performing database queries
16:56 Customer Success set issquareup.com to “Investigating” status
17:15 Engineers rate limit incoming traffic to reduce load on the database
17:30 Engineers kill longest running queries to reduce load on the database
17:36 Engineers shed all incoming traffic to allow database recovery
17:38 Engineers re-enable incoming traffic
17:40 Database performance degrades further
17:45 Engineers switched the database to other hardware
18:13 Engineers set lower rate limiting on incoming traffic
18:24 Engineers identify subset of queries causing the performance degradation
18:30 Engineers propose database operation to optimize those queries
18:43 Engineers backup database to avoid data loss
18:46 Engineers perform database operation
19:00 End of impact: Request success rates returned to their baseline before the incident began.
19:21 Customer Success set issquareup.com to “Monitoring” status
21:07 Customer Success set issquareup.com to “Resolved” status
This incident revealed areas of improvement for both our technical infrastructure and our engineering processes, several of which have already been implemented.
A new usage pattern for Square APIs generated an unexpected volume of data within the database storing mobile application settings. Queries, which were not optimized for this scale, faced degraded performance, leading to timeouts retrieving settings from this database. APIs that were depended on by our mobile Point of Sale applications to retrieve account information from this database began excluding the unavailable information.
As a result, the iOS mobile applications unintentionally reverted to incorrect values for application settings, such as disabling tipping. Disabling tipping negatively impacted both sellers and their employees, and sellers were unable to re-enable tipping due to periodic syncs reapplying that setting.
When attempting to log-in, sellers using device codes to log into mobile applications encountered a different issue due to our system treating all requests missing those settings as entirely invalid. The log-in flow requires successful retrieval of account information which blocked these sellers from taking payments.
To prevent the same issue from occurring again in the future, our engineering teams will conduct an audit of each setting to ensure that, in the event of service disruption, a local cache of settings remains in effect such that settings remain unchanged. We’ll also be exploring other options for ensuring tipping stability is highly prioritized in the future. Finally, we’ve refined our metric detectors to forecast alarming trends before performance degradation occurs.