On Oct 27, 2020, at 6:26 PM UTC PagerDuty experienced a major incident due to database replication lag. The lag delayed up to date information from appearing in the UI and caused on-call off-call (OCHONs) notifications to be sent in the middle of some shifts rather than at the beginning or end. The underlying cause of the lag was load placed on the primary database server due to a one-time job to realign schedules for the upcoming (at the time) shift off of daylight saving time.
A script to fix schedules for DST-related changes created a greater than expected load on our infrastructure which resulted in delayed replication. The delayed replication further exacerbated the issue by creating a gap in how we measured the start and end times of shifts altered by the scripts. To restore functionality, we canceled the script and brought in more hosts to reduce replication lag time.
We are currently addressing multiple contributing factors for this issue. Planned and currently worked on steps are:
We’d like to apologize for the impact that this had on our customers. If you have any further questions, please reach out to support@pagerduty.com with these questions.