Delayed Notifications and Incident Log Entries
Incident Report for PagerDuty
Postmortem

Summary

On May 21 from 09:30 to 12:30 UTC, PagerDuty experienced a degradation in our ability to process events. As a result, incident creation and incident notification delivery were delayed for some customers.

What Happened?

A particular instance of a datastore software, which serves as part of our event processing pipeline, experienced localized network issues. Once our engineers identified this, they were able to remove the affected instance from the service. Our systems then recovered and we declared the incident resolved at 10:50 UTC. This initial incident was recorded as a separate incident, on this page: https://status.pagerduty.com/incidents/6sq06fbfxffg

However, in the process of replacing the removed instance, a bug in the datastore software resulted in additional complications, which led to further service impact beginning at around 12:03 UTC. Once this was identified, we reverted the deployment of the replacement instance and applied a manual workaround.

In each case, following impact to the service, our event ingestion pipeline worked to process the backlog of pending events that had accumulated behind the affected part of the event pipeline. After the latter impact, our systems fully recovered again at 12:31 UTC.

What Are We Doing About This?

We have upgraded our datastore software version to ensure that we are no longer vulnerable to the bug that initiated this degradation. We have also been making architectural changes that improve availability during network issues, and perform tests of configuration changes extensively as part of our deployment pipeline.

We would like to express our regret for this service interruption. For any questions, comments, or concerns, please reach out to support@pagerduty.com

Posted 4 months ago. Apr 03, 2019 - 23:21 UTC

Resolved
We have identified the issue and taken steps to remediate it. Notifications and log entries are being processed normally. All systems are now operational. We are continuing to monitor the situation for any recurrence.
Posted about 1 year ago. May 21, 2018 - 12:31 UTC
Update
We are continuing to investigate this issue.
Posted about 1 year ago. May 21, 2018 - 12:28 UTC
Investigating
We are investigating an issue that is causing notifications and log entries to be delayed.
Posted about 1 year ago. May 21, 2018 - 12:03 UTC
This incident affected: Notification Delivery.