Delayed notifications and incident log entries
Incident Report for PagerDuty
Postmortem

Summary

On April 24th, 2018 from 22:47 to April 25th 00:04 UTC, PagerDuty’s web UI was intermittently unavailable and incident notification delivery was delayed, affecting all customers. Notification delivery and the web UI then returned to full availability, although from 00:04 until 01:22 UTC, there were event processing delays and status 500 responses issued from the events API for some customers.

What Happened?

On April 24th, 2018, there was a change to IPSec configuration in PagerDuty infrastructure that was automatically deployed. The configuration change resulted in IPSec tunnels not being renewed, which caused a gradually increasing connectivity disruption between multiple components of the PagerDuty platform. An incident response team identified the issue and deployed a corrected configuration, restoring full connectivity.

What Are We Doing About This?

We will be expanding our automated infrastructure testing of configuration changes in the pre-deploy phase. For any questions, comments, or concerns, please reach out to support@pagerduty.com.

Posted 4 months ago. Apr 30, 2018 - 20:56 UTC

Resolved
Events are now processing normally. All systems are now operational.
Posted 4 months ago. Apr 25, 2018 - 01:22 UTC
Update
Notifications and display of incident timelines have recovered, but we are now observing a delay in event processing affecting some customers. All other systems are operational.
Posted 4 months ago. Apr 25, 2018 - 00:56 UTC
Update
Notifications and incident details are being processed but are delayed. Our web UI has recovered. We are continuing to monitor and assess our systems' status and determine appropriate next steps.
Posted 4 months ago. Apr 25, 2018 - 00:04 UTC
Identified
PagerDuty Engineers have identified the issue and are taking remedial action.
Posted 4 months ago. Apr 24, 2018 - 23:02 UTC
Investigating
We're currently experiencing delayed notifications and display of incident details and timeline entries. PagerDuty engineers are currently investigating.
Posted 4 months ago. Apr 24, 2018 - 22:47 UTC