Event Ingestion Delays
Incident Report for PagerDuty


On November 2nd, 2017 from 14:59 to 16:21 UTC, PagerDuty experienced a degradation in our ability to process events. As a result, incident creation and incident notification delivery were delayed for some customers.

What Happened?

PagerDuty Engineers were doing preventative maintenance on one of our internal datastores, attempting to add capacity to it. Due to a bug with the datastore technology, the process of adding capacity to the datastore resulted in it being unavailable to internal PagerDuty services for a period of about 7 minutes. Due to a separate problem with our tooling, there was an additional period of partial availability for the same datastore that followed (approximately 22 minutes). After these events, our event ingestion pipeline had to process the backlog of pending events that had accumulated. A small subset of these events took a longer than expected time to be processed.

What Are We Doing About This?

We will soon be upgrading our datastore software version to ensure that we are no longer vulnerable to the bug that initiated this degradation. We have also improved our tooling and documentation to address the tooling problem.

In terms of our event ingestion pipeline, we have done work to improve the parallelism of our event processing. This will allow us to recover much quicker in the future.

We would like to express our regret for the service degradation. For any questions, comments, or concerns, please reach out to support@pagerduty.com.

Posted Feb 27, 2018 - 00:06 UTC

Event ingestion and notifications are back to normal operational levels. All systems should be fully operational.
Posted Nov 02, 2017 - 16:21 UTC
We have identified the source of the issue and have been in the process of recovering for approximately the last 15 minutes.
Posted Nov 02, 2017 - 15:59 UTC
We are investigating potential delays in event ingestion. The REST API, web application, webhooks are all functional.
Posted Nov 02, 2017 - 15:36 UTC