On February 2nd, 2021 between 21:36 UTC and 22:29 UTC, PagerDuty experienced an incident which impacted the web interface, mobile application, REST API, and outbound webhooks. During this time some users may have intermittently received 5xx errors.
Pagerduty uses MySQL as one of our primary datastores and there are multiple clusters used to serve the needs of various services within our ecosystem. We use an industry standard replication topology with primary nodes and numerous read replicas to serve traffic.
On February 2nd, 2021 around 21:35 UTC, there was an EC2 node failure which affected the primary node of one of our clusters, causing the node to immediately become unavailable. Within minutes, our engineers initiated a failover process to replace the failed host. We encountered some delays during the failover process due to a bug in our cluster management tooling, which delayed the resolution process. Once the failover completed, our engineers worked to rebuild the part of our replication topology that had been connected to the failed host. These actions resolved the incident for our customers.
We’ve conducted an internal review of this incident and we’ve identified several actions to take to reduce the risk of this happening again and the impact should it happen again. Please see the action items below:
Finally, we’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to support@pagerduty.com with these questions.