Teem by Eptura detailed Root Cause Analysis | August 20, 2024
S2 O365 Multiple Users Marked Inactive
We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
Customers using O365 were unable to access the Teem platform. When trying to login, customers were met with an error message.
Type of Event:
Functionality Issue
Services/Modules impacted:
Production/ Office 365
Timeline (Reported MST):
On the morning of August 20, 2024, at approximately 8:20 AM MST, our support team began receiving reports from end users who were inadvertently marked as Inactive in the Teem platform. We promptly informed all Teem customers of the incident via our Status Page.
At 10:24 AM MST, we marked the Status Page as resolved. However, recognizing the importance of addressing this issue thoroughly, our engineering team continued to collaborate closely with our Support team to find a comprehensive solution. To prevent further impact, the engineering team implemented a nightly script to revert the status of users marked as Inactive.
The investigation continued until November 1, 2024, when our engineering team successfully released a HotFix to address the issue. Although the initial HotFix did not fully resolve the problem, the team enhanced our logging capabilities to better track and understand the behavior.
On November 11, 2024, the additional logs provided valuable insights, enabling our engineering team to resolve the issue in our QA environment. The final HotFix was released on November 14, 2024, and customers have confirmed the resolution.
Total Duration of Event:
83 Days
Root Cause:
The issue arose from the concurrent processing of multiple user batches. During this process, one thread completed its task earlier than expected and inadvertently deleted all cache keys, including the main Sync_Key and associated batch_keys. This led to subsequent threads receiving an empty batch list, which resulted in the deprovisioning or deactivation of users.
Remediation:
To resolve this issue, we have implemented database row-level locking for batch processing. This ensures that batch processing happens sequentially, avoiding conflicts. Key updates include:
Preventative Action:
To prevent recurrence, we have made the following improvements:
Added extensive logging in CloudWatch to monitor and better understand process behavior.
Streamlined the user synchronization process to ensure that all O365 users are successfully synced with the Teem directory.
These enhancements will significantly improve the reliability and performance of our system. Thank you for your patience and support as we continue to make these improvements.