Service Degradation
Incident Report for PagerDuty
Postmortem

Summary

On October 18, 2017 20:36 UTC to 22:26 UTC, one of PagerDuty’s datacenters suffered a network degradation. This resulted in delayed notifications to 20% of customers as well as 5xx errors from our events endpoint affecting approximately 15% of customers.

What happened?

At 20:36 UTC, our monitoring systems detected an increase in TCP retransmission rates as well as network interruptions for both ingress and egress traffic in the affected datacenter. Approximately 10 minutes later, we started our Incident Response.

At 21:15 UTC, our internal networking metrics showed that there were no longer any active networking problems in the affected datacenter. With the network recovered, our backlog of incoming events and outgoing notifications started to recover as well.

At 22:26 UTC, the remaining backlog, which affected 20% of our customers, was fully recovered and all systems were operational.

What are we doing about this?

During the networking event, we noted that some of our production systems as well as some of our internal tooling needed improvement to be resilient against this type of network degradation. While we do test network degradation on a per host basis, we had not been testing this at the datacenter level. For our events endpoint, we will be implementing changes such that an entire datacenter loss does not prevent requests going through to other datacenters. For our internal tooling that prevented us from taking certain recovery actions during the networking event, we will be investing time to run these tools either in a multi-datacenter or failover model.

During the period of recovery, we discovered a bottleneck on one of our databases. We will be looking into vertically scaling this database to ensure that we have adequate capacity to respond as quick as possible when we have backlogs increase upstream.

We greatly apologize for any inconvenience this has caused. Please contact us at support@pagerduty.com if you have any questions about this.

Posted 11 months ago. Oct 31, 2017 - 20:49 UTC

Resolved
Our systems have recovered.
Posted 11 months ago. Oct 18, 2017 - 22:29 UTC
Monitoring
We are on the path to recovery and are monitoring the situation.
Posted 11 months ago. Oct 18, 2017 - 21:50 UTC
Identified
We are still investigating delays in notifications for a subset of accounts.
Posted 11 months ago. Oct 18, 2017 - 21:43 UTC
Investigating
We are currently experiencing a degradation in some of our services. We are currently investigating.
Posted 11 months ago. Oct 18, 2017 - 21:12 UTC