On Thursday April 13 between 8:30 UTC and 10:30 UTC, PagerDuty suffered a degradation in our notification delivery pipeline in which some users were not able to receive any new notifications on at least one notification channel. All notifications affected fell out of SLA, and in some cases were never sent.
A kernel upgrade caused some of our Cassandra hosts to momentarily lose visibility to other Cassandra hosts, thereby causing some of our notifications to enter a corrupted state. The corrupted notifications had the adverse effect of causing clogging in their respective contact channels, which stopped users from receiving any notifications on those channels. Our engineers had to manually clear the contact channels in order to allow notifications to continue processing.
PagerDuty takes notification delivery very seriously and has already taken steps to mitigate the risk of this happening in the future, such as moving away from using Cassandra and towards Kafka as a queueing service. We know we have let our customers down in dropping messages with nothing to make up for the lost notifications, and for that we are deeply sorry. We are using this experience as an opportunity to look deeper at our infrastructure and make it even more resilient without compromising quality.
We would like to again apologize for any inconvenience this issue caused. If you have any questions, please do not hesitate to contact us at firstname.lastname@example.org.