Teem by Eptura detailed Root Cause Analysis | 3/4/2025
S1 – Teem Inaccessible
We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
(The Incident is logged in MST)
On March 4th, 2025, the production application encountered a performance issue due to a high server load. This issue lasted for approximately four days. Upon investigation, it was determined that multiple duplicate events were being generated, resulting in an excessive number of database entries and a slowdown in application response times. The situation escalated, ultimately leading to a system outage and the activation of a fire alarm.
Type of Event:
Outage
Services/Modules impacted:
Timeline:
The timeline is posted in MST.
Around 7:34 AM. We received notice of some people not able to access app.teem.com. At 10:19 AM our Engineering team clearing tables to help alleviate performance. At 11:29 AM we moved the issue to an S2 due to website being up and accessible. At 10:17 PM we are finishing clearing tables meaning performance should now be at full. 3:49 AM website is working as intended. At 9:17 AM the issue was resolved and we entered the monitoring phase for two hours. At 11 AM we cleared the status page.
Total Duration of Event:
27 Hours
Root cause:
Duplicate event generation: The event processing logic did not prevent the creation of redundant events.
Delayed event processing: Recent codebase changes that integrated Graph API calls extended the event processing time.
Push Callback Timing Issue:
When an event is created in our database, a corresponding call is made to Outlook to create the event on their end.
The increased processing time due to the Graph API changes caused push callbacks to arrive before the original event creation process was completed.
This resulted in the original event remaining incomplete and missing updates from Outlook.
A scheduled interval call exists to update missed events, but the frequent push callbacks led to the original event being updated only after multiple duplicates were created.
Remediation:
To address the issue, we introduced a locking mechanism by implementing a cache to prevent event creation by push callback before the original event creation flow is completed. Additionally, we handled the scenario where, in the event the locking mechanism fails, the original event is updated with the push callback data. We are also working on sending a unique payload key to Outlook to compare within push callbacks, ensuring more accurate and timely event updates.
Preventative Action:
To enhance system reliability, we have improved logging and monitoring for push callback processing, enabling early detection of any anomalies. Additionally, we have implemented a mechanism to prevent the first push callback for an event, ensuring that the original event creation flow is updated smoothly and without any hindrance