On March 1st, 2017, starting at 16:59 UTC, PagerDuty suffered a service degradation lasting 22 minutes. During this time, approximately 0.043% of notifications were delayed, with no alerts being lost.
An update to one of the services powering our event pipeline was released. This release contained a bug which was not detected during our testing. This bug caused errors in the application and prevented events from being processed correctly. As soon as the issue was discovered, the change was reverted, and event processing immediately started to recover.
We will be increasing our test coverage and improving the testing stages of our deployment pipeline in order to detect these types of failures before they reach our production systems. Additionally, we will be reassessing our rollback strategy in order to mitigate long recovery times when reverting a recently deployment becomes necessary.
We apologize to all of our customers for any inconvenience that this delay may have caused. If there are any questions, comments, or concerns regarding this issue, please reach out to us at firstname.lastname@example.org.