Incident creation delays
Incident Report for PagerDuty
Postmortem

Summary

On August 27, 2018, at 18:54 UTC, PagerDuty experienced a 51-minute performance degradation that caused delayed incident creation through HTTP (Events API) and email-based integrations, as well as delays in notification delivery. During this time, incidents were created more slowly, and notifications for those incidents were sent later than they otherwise would be.

What Happened?

In the final steps of our procedure to update our web application’s data storage schema, the deployment of new application code that coincided with the schema migration failed our canary deploy tests and the deployment was canceled. However, the schema versioning information, which is used to instruct the web application hosts of the appropriate schema cache fileset to use for rebuilding data models, had been updated. This left our application in an inconsistent state since the versioning information referenced a schema that was no longer available. The impacted hosts experienced elevated system resource usage and other adverse effects.

Due to a miscommunication during the process of reverting the deployment, the schema version information was not reverted. Furthermore, the reversion was applied in such a way that it triggered a deploy to the entire fleet. This caused the same effects that were originally limited to the hosts included in the canary deploy to be exhibited fleet-wide, as the hosts began trying to rebuild data models without the updated schema cache, which would normally be generated during a schema migration.

Once these missteps were realized, the version information was reverted and the web application was restarted. This corrected the version mismatch between the recorded version information and the latest schema cache fileset on the hosts’ filesystems. Within minutes, PagerDuty finished working through the backlog of tasks that had accumulated in its event processing data pipeline, and our services fully recovered.

What Are We Doing About This?

Our approach to mitigating risk in the migration process is two-fold. First, we are improving our internal documentation around the process of schema migrations, including procedures for remedial actions to take when key steps in a migration do not succeed. Second, we are investigating ways to safely automate the final steps in the migration process, as well as safeguards against the condition that caused the service degradation in this incident.

Posted 3 months ago. Aug 31, 2018 - 18:40 UTC

Resolved
PagerDuty has recovered from the incident creation and notification delivery delays.
Posted 3 months ago. Aug 27, 2018 - 19:45 UTC
Update
Our event processing systems are in the process of recovering, but notification delivery is experiencing a minor delay related to the recovery.
Posted 3 months ago. Aug 27, 2018 - 19:37 UTC
Identified
We have identified the cause of the issue, and our remedial actions have helped us in mitigating the impact.
Posted 3 months ago. Aug 27, 2018 - 19:19 UTC
Investigating
PagerDuty is currently experiencing a delay in incident creation via email and Events API. We are aware of this and actively taking steps to mitigate the effect.
Posted 3 months ago. Aug 27, 2018 - 19:08 UTC
This incident affected: Events API and Notification Delivery.