Intermittent 500 errors
Incident Report for PagerDuty
Postmortem

# Summary

On February 2nd, 2021 between 21:36 UTC and 22:29 UTC, PagerDuty experienced an incident which impacted the web interface, mobile application, REST API, and outbound webhooks. During this time some users may have intermittently received 5xx errors. 

# What Happened

Pagerduty uses MySQL as one of our primary datastores and there are multiple clusters used to serve the needs of various services within our ecosystem. We use an industry standard replication topology with primary nodes and numerous read replicas to serve traffic. 

On February 2nd, 2021 around 21:35 UTC, there was an EC2 node failure which affected the primary node of one of our clusters, causing the node to immediately become unavailable. Within minutes, our engineers initiated a failover process to replace the failed host. We encountered some delays during the failover process due to a bug in our cluster management tooling, which delayed the resolution process. Once the failover completed, our engineers worked to rebuild the part of our replication topology that had been connected to the failed host.  These actions resolved the incident for our customers. 

#  What We Are Doing About This

We’ve conducted an internal review of this incident and we’ve identified several actions to take to reduce the risk of this happening again and the impact should it happen again. Please see the action items below:  

  • We’ve completed an investigation into the bug that we encountered during the failover that caused the delays to our recovery. We have identified several configuration settings to change that will prevent situations like this from happening in the future. 
  • We’re investigating and conducting a separate review of an intermediary service and its handling of the MySQL node failure.  

Finally, we’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to support@pagerduty.com with these questions.

Posted Feb 12, 2021 - 18:50 UTC

Resolved
We have fully recovered from this incident
Posted Feb 02, 2021 - 22:30 UTC
Monitoring
We are seeing recovery from this incident and will continue to monitor it.
Posted Feb 02, 2021 - 22:24 UTC
Update
We have identified the source of the 500 errors and are seeing them decrease. We are actively working on repairing the issue.
Posted Feb 02, 2021 - 22:13 UTC
Identified
We are seeing intermittent 500 errors in our UI and mobile app. Event ingestion is not affected.
Posted Feb 02, 2021 - 22:00 UTC
This incident affected: REST API, Web Application, and Mobile Application.