Teem by Eptura detailed Root Cause Analysis | April 11, 2024
S2 Google Calendar Service not Synchronizing
We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
Customers using the Google Calendar service experienced events that were not automatically synced. During this time, a workaround was provided to force a manual sync, updating the calendars.
Type of Event:
Functionality Issue
Services/Modules impacted:
Production/ Google Calendar Service
Timeline (Reported MST):
On the late afternoon of April 11th, 2024, at approximately 3:50pm, multiple customers reported an issue with their Google Calendar Service not automatically syncing calendar events. Customers were provided a temporary workaround to manually force sync their calendars. All customers were made aware of the Severity 2 incident via Teem Status Page. The investigation continued through April 19, 2024, when the CloudOps team identified the root cause of the issue. On April 22, 2024, at approximately 11:08am, all customers were notified via Status Page that the fix had been implemented and we moved into a monitoring phase. After continuous monitoring, no additional reports for Google Calendar Events and customers confirming that their Calendar events were syncing automatically, the Severity 2 incident was marked as resolved on April 29, 2024, at 10:23am.
Total Duration of Event:
17 days, 18 hours, 33 minutes
Root Cause:
We observed that the PgBouncer and PgBouncer_ro services will not run simultaneously on job managers. Due to the startup script, it is unclear which of the two services is running, and it seems that the "last to start wins" scenario occurs. In an instance restart, a different service could "win" and cause further inconsistency. We have also discovered that three of our Job Managers are running outdated code.
Remediation:
These services shared a unix socket directory. By providing different unix socket directories, the services both would run simultaneously and eliminate the inconsistency. This eliminated significant errors on the jobmanagers
Preventative Action:
Our team is dedicated to continuously improving the Google Calendar Service by enhancing our current processes and implementing robust monitoring systems. We appreciate your patience and cooperation during this disruption.