Several services currently running in degraded state
Incident Report for PagerDuty


On August 5, beginning at 12:02 UTC, PagerDuty experienced a degradation across multiple services that lasted 38 minutes and caused delays in incident creation, notifications, and incident detail display.

What Happened?

A relatively new piece of our infrastructure, which is responsible for internal communication in some of our data processing pipelines, experienced a networking regression shortly after it was upgraded. This caused it to disconnect, which halted replication and forwarding of messages.

Once the problem had been identified, PagerDuty engineers manually restarted the service, and after a period of time the backlog of messages upstream in the data pipeline was processed successfully.

What Are We Doing About This?

To be able to more quickly recover from similar incidents, we developed a method to automate the process of restarting the service, and have put monitoring in place for the condition that led to the incident. We have also identified that the networking regression is fixed in a newer version of the datastore, and will be upgrading the affected system.

Posted Aug 29, 2018 - 00:36 UTC

All PagerDuty services have recovered.
Posted Aug 05, 2018 - 12:46 UTC
PagerDuty engineers have identified the cause, taken action to correct the issues, and our services are in the process of recovery.
Posted Aug 05, 2018 - 12:41 UTC
PagerDuty is currently experiencing delays in notifications and display of incident details; our engineers are currently investigating.
Posted Aug 05, 2018 - 12:33 UTC
This incident affected: Notification Delivery.