Event Processing Issues
Incident Report for PagerDuty

Summary

On August 2, 2017, from 07:52 to 12:09 UTC, we experienced issues in event processing which affected a small number of customers. Over the course of this incident, 30 minutes worth of inbound events were lost for the affected accounts.

What Happened?

A small portion of customer accounts experienced degradation of event processing caused by issues in an upgraded system that had been put in place. The affected customers were reverted back to utilizing the previous system at 09:33 UTC in order to mitigate further damage. Unfortunately, some inbound events in the range of about 09:00-09:30 UTC could not be recovered during the reverting process.

What Are We Doing About This?

Our engineers have reverted all accounts to use the previous event processing system, and are currently investigating solutions to prevent similar issues from occurring in the upgraded system. Additionally, improvements have been made to more quickly identify and minimize the impact of these issues should they happen, and to significantly reduce the risk of losing events.

We understand the importance of maintaining the reliability of our systems and would like to apologize for any inconvenience caused by this incident. If you have any questions, concerns, or comments, please do not hesitate to reach out to support@pagerduty.com.

Posted 6 months ago. Aug 16, 2017 - 21:45 UTC

Resolved
From 8:00 until 9:41 UTC, there was an issue in processing events for a small percentage of customers. The issue has been since resolved and all systems are operational.
Posted 7 months ago. Aug 02, 2017 - 12:10 UTC