Delays in Event Processing
Incident Report for PagerDuty
Postmortem

Summary

On March 23rd, 2018 at 15:40 UTC, the portion of PagerDuty’s inbound integration event processing service that handles inbound email for email integrations suffered a performance degradation that lasted 74 minutes. During this time, email events were delayed for over five minutes, resulting in delayed action in triggering or resolving incidents.

What Happened

A portion of our event processing cluster experienced network issues. Our email processing service was configured such that it retried processing and transmitting events fewer times before restarting constituent hosts. This resulted in many hosts restarting, compared to other components of the service that were able to retry hosts experiencing network issues enough times to survive the network issues. These continual restart attempts added a delay to the email processing, and ultimately, a backlog of email events.

To recover, PagerDuty engineers temporarily doubled the number of hosts in the email processing service. With this adjustment in place, the backlog of stuck email events could be processed faster, and thus event processing was able to catch up with new event submissions.

What are we doing about this?

We are taking steps to update the hosts to be more resilient to the type of network issue that we experienced, in order to be able to make forward progress when they occur, like our other services do.

Additionally, since most of the time spent recovering was in processing the backlog of events that accumulated during the service degradation, we will be investigating improved methods for processing very large backlogs of events.

We regret if this affected your team’s ability to receive alerts in a timely manner. As always, if you have any questions or concerns, feel free to contact us at support@pagerduty.com.

Posted 4 months ago. Apr 09, 2018 - 21:34 UTC

Resolved
We've recovered and all events, including email events, are being processed normally.
Posted 5 months ago. Mar 23, 2018 - 16:51 UTC
Monitoring
We've resolved the issue causing delays in event processing. Email events are catching up but other event types are being processed normally.
Posted 5 months ago. Mar 23, 2018 - 16:37 UTC
Investigating
Our engineering team is currently investigating an issue causing delays in event processing.
Posted 5 months ago. Mar 23, 2018 - 16:24 UTC