Write-up
Payments Disruption
Incident Summary

Starting at 2024-04-01 14:36 UTC, applications running on Android or Square Hardware's Android-based operating system experienced a service disruption. Other Square platforms, such as iOS, experienced an intermittent service disruption as a secondary result of that incident. The root cause was a new feature configuration that could not be properly interpreted by Android mobile applications, which resulted in significant traffic volumes in the attempts to recover.


In this postmortem recap, we’ll communicate the root cause of this disruption, document the steps that we took to diagnose and resolve the disruption, and share our analysis and actions to ensure that we are properly defending our customers from service interruptions like this in the future.

Timeline (UTC)

HH14:36 Square introduced a new feature configuration for an upcoming feature

HH14:36 Beginning of impact: Android and Square hardware began emitting crashes

HH14:36 Android and Square Hardware's Android-based operating system increased traffic volume to attempt recovery

HH14:37 Square APIs initiated auto-scaling as traffic volume increased due to retries

HH14:43 Mobile crash monitoring alerted engineering of an escalating issue

HH14:44 Traffic volume monitoring alerted engineering of an escalating issue

HH14:47 Investigation of mobile crash reports was initiated

HH14:59 Square posted an update to IsSquareUp.com to notify Sellers

HH14:59 Investigation of elevated traffic volumes against Square APIs was initiated

HH15:08 Square posted an update to IsSquareUp.com to notify Sellers

HH15:20 Square posted an update to social media accounts to notify Sellers

HH15:25 Square removed the new feature configuration

HH15:25 End of impact: Android and Square Hardware's Android-based operating system successfully reconnected to APIs

HH15:29 Traffic volume returned to the baseline before the incident began

Analysis

This incident revealed areas of improvement for both our technical infrastructure and our engineering processes, several of which have already been implemented.


The introduction of new feature configuration resulted in malformed JSON sent to Square’s mobile applications. Each platform handled that network response differently. iOS properly identified the issue and rejected the latest response content. Applications running on Android or Square Hardware's Android-based operating system emitted a crash report and closed the application. Square hardware immediately relaunched the application. Android devices waited for users to relaunch the application.


The frequent relaunches from the affected platforms caused significant traffic increases, compared to normal patterns, on specific APIs. Those Square APIs initiated auto-scaling procedures to handle the increased traffic. Square will further refine our auto-scaling policies based on the learnings from this incident.


In response to the root cause, Square will improve the handling of the feature configuration to avoid this particular issue. Additionally, Square will adjust Android to match iOS when handling invalid responses to fallback to cached values, until an incident is resolved.