Delayed Notifications
Incident Report for PagerDuty

Summary

On Thursday April 13 between 8:30 UTC and 10:30 UTC, PagerDuty suffered a degradation in our notification delivery pipeline in which some users were not able to receive any new notifications on at least one notification channel. All notifications affected fell out of SLA, and in some cases were never sent.

What happened

A kernel upgrade caused some of our Cassandra hosts to momentarily lose visibility to other Cassandra hosts, thereby causing some of our notifications to enter a corrupted state. The corrupted notifications had the adverse effect of causing clogging in their respective contact channels, which stopped users from receiving any notifications on those channels. Our engineers had to manually clear the contact channels in order to allow notifications to continue processing.

What are we doing about this?

PagerDuty takes notification delivery very seriously and has already taken steps to mitigate the risk of this happening in the future, such as moving away from using Cassandra and towards Kafka as a queueing service. We know we have let our customers down in dropping messages with nothing to make up for the lost notifications, and for that we are deeply sorry. We are using this experience as an opportunity to look deeper at our infrastructure and make it even more resilient without compromising quality.

We would like to again apologize for any inconvenience this issue caused. If you have any questions, please do not hesitate to contact us at support@pagerduty.com.

Posted 4 months ago. Oct 26, 2017 - 18:05 UTC

Resolved
This incident has been resolved.
Posted 10 months ago. Apr 13, 2017 - 22:53 UTC
Update
Webhook delivery has recovered at this time.
Posted 10 months ago. Apr 13, 2017 - 22:36 UTC
Update
Notification delivery has recovered; webhooks are still impacted. Investigation is still ongoing.
Posted 10 months ago. Apr 13, 2017 - 22:24 UTC
Update
We are still investigating the issue affecting delivery of notifications and webhooks.
Posted 10 months ago. Apr 13, 2017 - 21:47 UTC
Update
We are still investigating the current issue of delayed notifications and also of webhook delivery.
Posted 10 months ago. Apr 13, 2017 - 20:42 UTC
Investigating
We are currently experiencing an issue causing delay in notification delivery to a small number of accounts.
Posted 10 months ago. Apr 13, 2017 - 20:00 UTC