Problem with Events API
Incident Report for PagerDuty
Postmortem

Summary

On July 16 from 14:55 to 17:48 UTC, PagerDuty experienced an outage in event ingestion and processing, and in notification delivery. During this time, incidents could not be triggered through the Events API, and notifications were not sent.

What Happened?

During a network outage impacting one of our cloud infrastructure providers, a network partition became unreachable. When this happened, hosts in PagerDuty's fleet that run the datastore software used by the event and notification processing pipelines lost contact with a leader node in the affected partition.

Normally, the hosts would failover to an in-sync replica of the leader, but they exhibited an as-yet unexplained behavior in that they treated all of the in-sync replicas of the unreachable node as also missing. As a result, the event processing and notification services halted, and the Events API soon stopped accepting new events, issuing status 500 responses.

Once the issue was identified, the nodes in the problematic network partition were decommissioned, and the rest of the cluster was reconfigured to take over and resume work. This allowed event and notification processing to resume.

What Are We Doing About This?

We have updated the configuration on our datastore software to enable prompt failover in the case of future major network disruptions. Investigation into the unexpected behavior of the datastore software is still ongoing. We are also continuing to make architectural changes that will improve availability and reduce the impact of network issues on our services.

We would like to express our regret for the service interruption. For any questions, comments, or concerns, please reach out to support@pagerduty.com

Posted 19 days ago. Jul 26, 2018 - 22:45 UTC

Resolved
Our systems have fully recovered.
Posted 29 days ago. Jul 16, 2018 - 17:50 UTC
Update
Event intake has been restored. However, notifications are still delayed and in the process of recovering.
Posted 29 days ago. Jul 16, 2018 - 17:31 UTC
Update
We are continuing to monitor for any further issues.
Posted 29 days ago. Jul 16, 2018 - 16:56 UTC
Monitoring
We have seen partial recovery of events processing when events are inbound from email integration. Events API inbound events are still not being processed. Notifications are still not being processed. We are continuing to monitor the situation and will update as our components recover.
Posted 29 days ago. Jul 16, 2018 - 16:51 UTC
Update
We are continuing to work toward a resolution. We have not seen signs of recovery as of yet. We will continue to update as we have more information.
Posted 29 days ago. Jul 16, 2018 - 15:59 UTC
Identified
We are currently investigating an issue that is preventing events from processing. This is causing a delay in downstream services such as notifications. We have identified the problem and are working to remediate the situation. We will update as we have more information.
Posted 29 days ago. Jul 16, 2018 - 15:25 UTC
This incident affected: Events API and Notification Delivery.