Issue Processing Events on our Events API
Incident Report for PagerDuty
Postmortem

Summary

On May 3 2017 at 20:01 UTC, PagerDuty suffered a service degradation affecting our Events API; this incident lasted for three hours. Customers would have experienced difficulties sending events to the PagerDuty Events API. We apologize to any customers who were affected by the outage.

What Happened?

At 19:12 UTC, PagerDuty began maintenance on one of the Cassandra-based services responsible for processing events from the Events API. During this maintenance, the Cassandra cluster became unstable while engineers increased the capacity of the overall system. PagerDuty engineers were immediately alerted to the issue and worked to bring the cluster back into a stable state. At 22:10 UTC, the cluster was stable and the API was able to process events.

What Are We Doing About This?

To avoid future issues like this one, we have put additional checks into place around how we scale our Cassandra cluster. We sincerely apologize if this degradation negatively impacted your team's usage of PagerDuty. If you have questions or concerns please contact us at support@pagerduty.com

Posted about 1 year ago. May 16, 2017 - 23:10 UTC

Resolved
We are now processing events and notifications normally. All systems are functional.
Posted over 1 year ago. May 03, 2017 - 22:10 UTC
Monitoring
We believe we have resolved the root cause of the issue with event ingestion and are working on processing the current backlog of events. We are monitoring the situation closely until we are fully recovered.
Posted over 1 year ago. May 03, 2017 - 21:46 UTC
Identified
We have identified the issue with event processing on our Events API and are currently working on a resolution.
Posted over 1 year ago. May 03, 2017 - 21:06 UTC
Investigating
We are currently experiencing an issue processing events on our Events API. We are actively investigating the issue.
Posted over 1 year ago. May 03, 2017 - 20:33 UTC