On March 10th at 07:38 UTC, we suffered two periods of partial degradation of our web and mobile application UIs. The first period lasted for a duration of 2 hours 46 minutes, and the second period lasted for 23 minutes. During this time, the incident timeline details and infrastructure health components of the UI were affected.
No notifications were lost or delayed during this time. Full functionality was restored at 11:28 UTC. We sincerely apologize for any inconvenience this incident caused.
PagerDuty uses third-party software to deploy and manage internal services that support the PagerDuty application. This software includes systems that can automatically run services on a cluster of machines.
During the incident, the management software for one cluster entered a corrupted state. Subsequently, the management software terminated services on this cluster and could not recover them correctly. The affected services included some that populate the Incident Timeline details, and one that services the Infrastructure Health component.
Our monitoring automatically detected the missing services and PagerDuty engineers immediately became aware of the issue. However, various attempts to recover the management software were unsuccessful, and eventually the engineers shut down the management software and configured the cluster manually to run the services.
Our resolution was delayed due to the unexpected corruption and difficulty in diagnosing and attempting to resolve the issues in the cluster management software. As an immediate fix, once we had stabilized the services for our customers, we then performed a full reset of the management software. Second, to reduce the impact of this issue in the future, we have updated our process so that we detect and workaround this issue quickly in the future. Third, we will investigate the root cause of this issue and decide if there are configuration changes or other fixes that can be applied. Finally, we are also investigating alternative options for the management software.
We would like to again apologize for any inconvenience this issue caused. If you have any questions, do not hesitate to contact us at email@example.com.