Issue displaying incident details
Incident Report for PagerDuty

Summary

On March 10th at 07:38 UTC, we suffered two periods of partial degradation of our web and mobile application UIs. The first period lasted for a duration of 2 hours 46 minutes, and the second period lasted for 23 minutes. During this time, the incident timeline details and infrastructure health components of the UI were affected.

No notifications were lost or delayed during this time. Full functionality was restored at 11:28 UTC. We sincerely apologize for any inconvenience this incident caused.

What Happened?

PagerDuty uses third-party software to deploy and manage internal services that support the PagerDuty application. This software includes systems that can automatically run services on a cluster of machines.

During the incident, the management software for one cluster entered a corrupted state. Subsequently, the management software terminated services on this cluster and could not recover them correctly. The affected services included some that populate the Incident Timeline details, and one that services the Infrastructure Health component.

Our monitoring automatically detected the missing services and PagerDuty engineers immediately became aware of the issue. However, various attempts to recover the management software were unsuccessful, and eventually the engineers shut down the management software and configured the cluster manually to run the services.

What Are We Doing About This?

Our resolution was delayed due to the unexpected corruption and difficulty in diagnosing and attempting to resolve the issues in the cluster management software. As an immediate fix, once we had stabilized the services for our customers, we then performed a full reset of the management software. Second, to reduce the impact of this issue in the future, we have updated our process so that we detect and workaround this issue quickly in the future. Third, we will investigate the root cause of this issue and decide if there are configuration changes or other fixes that can be applied. Finally, we are also investigating alternative options for the management software.

We would like to again apologize for any inconvenience this issue caused. If you have any questions, do not hesitate to contact us at support@pagerduty.com.

Posted about 1 year ago. Mar 14, 2017 - 20:01 UTC

Resolved
We have resolved these issues and the web and mobile apps no longer display missing data when viewing incident details. All systems are operational at this time.
Posted about 1 year ago. Mar 10, 2017 - 10:31 UTC
Identified
We are still experiencing issues affecting the display of data for incident details in our web and mobile applications. Notifications to customers remain fully operational. We are continuing our investigation and working to resolve this issue as quickly as possible.
Posted about 1 year ago. Mar 10, 2017 - 10:06 UTC
Investigating
There is currently an issue affecting the display of data for incident details in our web and mobile applications. We are actively investigating. Notifications to customers are fully operational.
Posted about 1 year ago. Mar 10, 2017 - 09:43 UTC