S2 - Inability to Access Teem using O365 SSO
Incident Report for Teem
Postmortem

Teem by Eptura detailed Root Cause Analysis | August 20, 2024 

S2 O365 Multiple Users Marked Inactive 

 

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. 

 

Description: 

Customers using O365 were unable to access the Teem platform. When trying to login, customers were met with an error message. 

 

Type of Event: 

Functionality Issue 

 

Services/Modules impacted: 

Production/ Office 365  

 

Timeline (Reported MST):  

On the morning of August 20, 2024, at approximately 8:20 AM MST, our support team began receiving reports from end users who were inadvertently marked as Inactive in the Teem platform. We promptly informed all Teem customers of the incident via our Status Page. 

At 10:24 AM MST, we marked the Status Page as resolved. However, recognizing the importance of addressing this issue thoroughly, our engineering team continued to collaborate closely with our Support team to find a comprehensive solution. To prevent further impact, the engineering team implemented a nightly script to revert the status of users marked as Inactive. 

The investigation continued until November 1, 2024, when our engineering team successfully released a HotFix to address the issue. Although the initial HotFix did not fully resolve the problem, the team enhanced our logging capabilities to better track and understand the behavior. 

On November 11, 2024, the additional logs provided valuable insights, enabling our engineering team to resolve the issue in our QA environment. The final HotFix was released on November 14, 2024, and customers have confirmed the resolution. 

 

Total Duration of Event: 

83 Days 

 

Root Cause:  

The issue arose from the concurrent processing of multiple user batches. During this process, one thread completed its task earlier than expected and inadvertently deleted all cache keys, including the main Sync_Key and associated batch_keys. This led to subsequent threads receiving an empty batch list, which resulted in the deprovisioning or deactivation of users. 

 

Remediation: 

To resolve this issue, we have implemented database row-level locking for batch processing. This ensures that batch processing happens sequentially, avoiding conflicts. Key updates include: 

  1. Introduced a dedicated table to track batch process counts and the ID of the last processed batch. 
  2. Applied database row-level locks to manage synchronization safely and efficiently. 
  3. Updated the deprovisioning process to occur only after verifying that all batches are fully processed. 

 

 Preventative Action:  

 To prevent recurrence, we have made the following improvements: 

  •  Enhanced concurrency handling to ensure seamless user batch synchronization. 
  • Added extensive logging in CloudWatch to monitor and better understand process behavior. 

  • Streamlined the user synchronization process to ensure that all O365 users are successfully synced with the Teem directory. 

These enhancements will significantly improve the reliability and performance of our system. Thank you for your patience and support as we continue to make these improvements.

Posted Jan 07, 2025 - 12:25 MST

Resolved
We have found this issue to be mitigated and is only effecting a small subset of customers that we are working diligently to support and find a resolution. If you are experiencing issues please don't hesitate to reach out to support.
Posted Aug 20, 2024 - 10:24 MDT
Update
We are continuing to investigate this issue.
Posted Aug 20, 2024 - 08:30 MDT
Investigating
We are currently investigating an issue with Teem. We will update you when we have more information.
Posted Aug 20, 2024 - 08:25 MDT