Teem by Eptura detailed Root Cause Analysis | September 5, 2024
S2 |Office365 & Google Calendar | Reservations not syncing
We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
Some TEEM customers experienced difficulties syncing calendar events with both Office365 and Google Calendar. Calendar events were not syncing automatically from the source calendar to TEEM, unless a manual Force Sync was initiated. In some cases, even using Force Sync did not resolve the issue, leading to confusion when using the reservation module.
While both Office365 and Google Calendar showed similar symptoms of calendar syncing failures, the underlying causes were different for each service and required separate solutions to resolve.
Type of Event:
Functionality Issue
Services/Modules impacted:
Calendar Service/ Production
Timeline:
September 5, 2024, Reported MDT
10:30 AM: Customers began to report the inability to sync their calendar events for Google (410 errors) and O365 (Missing Initialization errors). An internal Fire Alarm was raised, and all customers were notified that we are investigating the issue via the status page.
September 6, 2024, Reported MDT
11:53 AM: The engineering team has identified the issue and continues to work towards a resolution. Customers were notified that we have moved from investigating to an identified phase. In the meantime, engineers have implemented enhanced measures to mitigate the issue by running a script to manually sync calendars for all reporting customers, four times a day and should improve the reliability of calendar events till full resolution.
September 10, 2024, Reported MDT
9:34 AM: All customers were notified that a solution was implemented for the sync issue affecting Google (410 errors) and Office365 (Missing Initialization errors). To ensure stability and performance our engineering team is overseeing the process to confirm that the issue has been fully resolved. Monitoring will continue over the next several days.
September 17, 2024, Reported MDT
8:02 AM: All customers were notified that our engineering team has completed the necessary actions and verified that the service is now functioning normally. As no additional customers have reported specific issues in regard to Google (410 errors) and O365 (Missing Initialization errors). The status page was updated to a resolved state.
Total Duration of Event:
11 days, 21 hours, 32 minutes
Office365 Root Cause: The issue occurred due to two concurrent requests attempting to refresh the Office 365 access token at the same time. This created a situation where the system, under certain conditions, returned a null token (None), which was then passed to the API client. As a result, calendar syncing was interrupted.
Office365 Remediation:
We have updated the system to ensure that, when multiple requests are made, the current access token is used if it’s valid. This prevents the token from being set to None and ensures the API client always receives a valid token.
Office 365 Preventative Measures: In addition to the fix, we’ve ensured that the system will no longer return a null token in any situation. We have also added logging to monitor the token refresh process closely, allowing us to better detect and resolve any future issues quickly.
Google Calendar Root Cause: The system was making multiple attempts to delete events from Google Calendar, even when the event had already been deleted. This caused a 410 error, indicating the resource was no longer available. The issue occurred because the system did not verify whether an event still existed before attempting to delete it.
Google Calendar Remediation:
Google Calendar Preventative Measure: To prevent similar issues in the future, Google watchers will be updated using dedicated cron jobs, ensuring synchronization happens in a controlled and consistent manner. The lock mechanism will also ensure that API calls are handled sequentially and without conflict.