S1 - App.teem.com degraded performance

Incident Report for Teem

Postmortem

Teem by Eptura detailed Root Cause Analysis | 2/25/2025 

S1 – Duplicate Events causing downtime 

 

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.  

 

Description: 

(The Incident is logged in MST) 

On February 22, 2025, the production application encountered a performance issue due to a high server load. This issue lasted for approximately four days. Upon investigation, it was determined that multiple duplicate events were being generated, resulting in an excessive number of database entries and a slowdown in application response times. The situation escalated, ultimately leading to a system outage and the activation of a fire alarm. 

 

Type of Event: 

Outage 

 

Services/Modules impacted: 

App.Teem.Com/Calendaring 

 

Timeline: The timeline is posted in MST. 

Around 8:58 AM. We received notice of some instances of slowed experience and inability to access app.teem.com. At 8:59 AM our Engineering team is reaching out to AWS team to address this. At 10:30 AM we moved the issue to an S1. At 10:45 AM the site became accessible again, but with slowness. At 12:52 PM we resolved this issue and went into a monitoring phase. At 2:02 PM we cleared the status page. 

 

Total Duration of Event: 

4 Hours 

 

Root cause:  

 

  • Duplicate event generation: The event processing logic did not prevent the creation of redundant events. 

  • Delayed event processing: Recent codebase changes that integrated Graph API calls extended the event processing time. 

  • Push Callback Timing Issue:  

  • When an event is created in our database, a corresponding call is made to Outlook to create the event on their end. 

  • The increased processing time due to the Graph API changes caused push callbacks to arrive before the original event creation process was completed. 

  • This resulted in the original event remaining incomplete and missing updates from Outlook. 

  • A scheduled interval call exists to update missed events, but the frequent push callbacks led to the original event being updated only after multiple duplicates were created. 

 

 

Remediation: 

To address the issue, we introduced a locking mechanism by implementing a cache to prevent event creation by push callback before the original event creation flow is completed. Additionally, we handled the scenario where, in the event the locking mechanism fails, the original event is updated with the push callback data. We are also working on sending a unique payload key to Outlook to compare within push callbacks, ensuring more accurate and timely event updates. 

 

 

Preventative Action:  

 

To enhance system reliability, we have improved logging and monitoring for push callback processing, enabling early detection of any anomalies. Additionally, we have implemented a mechanism to prevent the first push callback for an event, ensuring that the original event creation flow is updated smoothly and without any hindrance

Posted Mar 20, 2025 - 10:33 MDT

Resolved

As we have not seen further service disruptions after the fix was implemented, we have moved to the Resolved Phase.
A Preliminary RCA will be posted in this incident in 2 business days. Please stay subscribed to the page to receive post automatically.
Posted Feb 25, 2025 - 14:01 MST

Monitoring

A fix has been implemented. We are moving into the Monitoring Phase for the next hour. We will be closing this out at 2 PM MST
Posted Feb 25, 2025 - 12:51 MST

Update

We are currently still investigating an issue with app.teem.com and logging Our Engineering team is currently investigating to determine the cause of the disruption. Next update will be posted at 2:30 PM MST.
Posted Feb 25, 2025 - 12:19 MST

Update

We are currently still investigating an issue with app.teem.com and logging Our Engineering team is currently investigating to determine the cause of the disruption. Next update will be posted at 12:30 PM MST.
Posted Feb 25, 2025 - 10:29 MST

Investigating

We are currently investigating an issue with the loading of app.teem.com. Our Engineering team is currently investigating to determine the cause of the disruption. Next update will be posted at 11 AM MST
Posted Feb 25, 2025 - 08:03 MST
This incident affected: Web Interface.