Several services currently running in degraded state
Incident Report for PagerDuty
Postmortem

Summary

On August 5, beginning at 12:02 UTC, PagerDuty experienced a degradation across multiple services that lasted 38 minutes and caused delays in incident creation, notifications, and incident detail display.

What Happened?

A relatively new piece of our infrastructure, which is responsible for internal communication in some of our data processing pipelines, experienced a networking regression shortly after it was upgraded. This caused it to disconnect, which halted replication and forwarding of messages.

Once the problem had been identified, PagerDuty engineers manually restarted the service, and after a period of time the backlog of messages upstream in the data pipeline was processed successfully.

What Are We Doing About This?

To be able to more quickly recover from similar incidents, we developed a method to automate the process of restarting the service, and have put monitoring in place for the condition that led to the incident. We have also identified that the networking regression is fixed in a newer version of the datastore, and will be upgrading the affected system.

Posted about 2 months ago. Aug 29, 2018 - 00:36 UTC

Resolved
All PagerDuty services have recovered.
Posted 2 months ago. Aug 05, 2018 - 12:46 UTC
Monitoring
PagerDuty engineers have identified the cause, taken action to correct the issues, and our services are in the process of recovery.
Posted 2 months ago. Aug 05, 2018 - 12:41 UTC
Investigating
PagerDuty is currently experiencing delays in notifications and display of incident details; our engineers are currently investigating.
Posted 2 months ago. Aug 05, 2018 - 12:33 UTC
This incident affected: Notification Delivery.