On the morning of March 7, from approximately 4:45AM UTC to 6AM UTC, PagerDuty experienced an incident related to event processing and incident creation. During this time, we experienced issues with our events API, as well as delays in incident creation and as a result, notification processing.
Our engineers were alerted of notification processing delays at 4:50AM UTC. As an initial remediation step, the associated service was redeployed, however, we did not see recovery after this action. We isolated the issue to one of our Kafka clusters, which was operating in a degraded state due to underlying hardware issues affecting our servers. As soon as we discovered this, we began the process of replacing the failing servers. The underlying server issues had recovered, at which point customer impact had stopped. We continued the replacement process to completion in order to bring the Kafka cluster back to a good state.
Customers would have seen issues for approximately 1 hour from when the hardware degradation began. There was no impact to our systems after this point as we continued to replace the failing servers.
Our engineering team is going to investigate the reasons why these underlying instance issues impacted our event to notification pipeline. We expect to be resilient to these issues and will investigate why weren’t in this case and what we can do to remedy this deficiency going forward.
We would like to express our sincere regret for the service degradation. For any questions, comments, or concerns, please contact us at support@pagerduty.com.