Teem by Eptura detailed Root Cause Analysis | 2/25/2025
S1 – Duplicate Events causing downtime
We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
(The Incident is logged in MST)
On February 22, 2025, the production application encountered a performance issue due to a high server load. This issue lasted for approximately four days. Upon investigation, it was determined that multiple duplicate events were being generated, resulting in an excessive number of database entries and a slowdown in application response times. The situation escalated, ultimately leading to a system outage and the activation of a fire alarm.
Type of Event:
Outage
Services/Modules impacted:
Timeline:
The timeline is posted in MST.
Around 8:58 AM. We received notice of some instances of slowed experience and inability to access app.teem.com. At 8:59 AM our Engineering team is reaching out to AWS team to address this. At 10:30 AM we moved the issue to an S1. At 10:45 AM the site became accessible again, but with slowness. At 12:52 PM we resolved this issue and went into a monitoring phase. At 2:02 PM we cleared the status page.
Total Duration of Event:
4 Hours
Root cause:
Duplicate event generation: The event processing logic did not prevent the creation of redundant events.
Delayed event processing: Recent codebase changes that integrated Graph API calls extended the event processing time.
Push Callback Timing Issue:
When an event is created in our database, a corresponding call is made to Outlook to create the event on their end.
The increased processing time due to the Graph API changes caused push callbacks to arrive before the original event creation process was completed.
This resulted in the original event remaining incomplete and missing updates from Outlook.
A scheduled interval call exists to update missed events, but the frequent push callbacks led to the original event being updated only after multiple duplicates were created.
Remediation:
To address the issue, we introduced a locking mechanism by implementing a cache to prevent event creation by push callback before the original event creation flow is completed. Additionally, we handled the scenario where, in the event the locking mechanism fails, the original event is updated with the push callback data. We are also working on sending a unique payload key to Outlook to compare within push callbacks, ensuring more accurate and timely event updates.
Preventative Action:
To enhance system reliability, we have improved logging and monitoring for push callback processing, enabling early detection of any anomalies. Additionally, we have implemented a mechanism to prevent the first push callback for an event, ensuring that the original event creation flow is updated smoothly and without any hindrance