S1 - TEEM Performance Issues

Incident Report for Teem

Postmortem

Teem by Eptura detailed Root Cause Analysis | 3/4/2025

S1 – Teem Inaccessible

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.

Description:

(The Incident is logged in MST)

On March 4th, 2025, the production application encountered a performance issue due to a high server load. This issue lasted for approximately four days. Upon investigation, it was determined that multiple duplicate events were being generated, resulting in an excessive number of database entries and a slowdown in application response times. The situation escalated, ultimately leading to a system outage and the activation of a fire alarm.

Type of Event:

Outage

Services/Modules impacted:

App.Teem.Com/Calendaring

Timeline: The timeline is posted in MST.

Around 7:34 AM. We received notice of some people not able to access app.teem.com. At 10:19 AM our Engineering team clearing tables to help alleviate performance. At 11:29 AM we moved the issue to an S2 due to website being up and accessible. At 10:17 PM we are finishing clearing tables meaning performance should now be at full. 3:49 AM website is working as intended. At 9:17 AM the issue was resolved and we entered the monitoring phase for two hours. At 11 AM we cleared the status page.

Total Duration of Event:

27 Hours

Root cause:

Duplicate event generation: The event processing logic did not prevent the creation of redundant events.
Delayed event processing: Recent codebase changes that integrated Graph API calls extended the event processing time.
Push Callback Timing Issue:
When an event is created in our database, a corresponding call is made to Outlook to create the event on their end.
The increased processing time due to the Graph API changes caused push callbacks to arrive before the original event creation process was completed.
This resulted in the original event remaining incomplete and missing updates from Outlook.
A scheduled interval call exists to update missed events, but the frequent push callbacks led to the original event being updated only after multiple duplicates were created.

Remediation:

To address the issue, we introduced a locking mechanism by implementing a cache to prevent event creation by push callback before the original event creation flow is completed. Additionally, we handled the scenario where, in the event the locking mechanism fails, the original event is updated with the push callback data. We are also working on sending a unique payload key to Outlook to compare within push callbacks, ensuring more accurate and timely event updates.

Preventative Action:

To enhance system reliability, we have improved logging and monitoring for push callback processing, enabling early detection of any anomalies. Additionally, we have implemented a mechanism to prevent the first push callback for an event, ensuring that the original event creation flow is updated smoothly and without any hindrance

Posted Mar 20, 2025 - 10:33 MDT

Resolved

As we have not seen further service disruptions after the fix was implemented, we have moved to the Resolved Phase.
A Preliminary RCA will be posted in this incident in 2 business days. Please stay subscribed to the page to receive post automatically.

Posted Mar 05, 2025 - 11:31 MST

Monitoring

A fix has been implemented. We are moving into the Monitoring Phase for the next two hours. If you have any issues please don't hesitate to reach out to support.

Posted Mar 05, 2025 - 08:49 MST

Identified

The issue with App.teem.com performance has been identified and a fix is being implemented. We will post another update at 10 AM CST.

Posted Mar 04, 2025 - 15:35 MST

Monitoring

A fix has been implemented. We are moving into the Monitoring Phase for the next 24 hours.

Posted Mar 04, 2025 - 11:33 MST

Identified

The issue with performance has been identified and a fix is being implemented. We will post another update at 1PM CST.

Posted Mar 04, 2025 - 10:49 MST

Update

We are currently investigating an issue with TEEM and the inability to access the platform.

Our Engineering and Cloud Operations team is actively working to determine the root cause of the disruption and assess its impact.

We will provide our next update by 1:00 PM CST.

Thank you for your patience as we work to resolve this issue.

Posted Mar 04, 2025 - 10:14 MST

Investigating

We are currently investigating an issue with TEEM and the inability to access the platform.

Our Engineering and Cloud Operations team is actively working to determine the root cause of the disruption and assess its impact.

We will provide our next update by 9:48am CST.

Thank you for your patience as we work to resolve this issue.

Posted Mar 04, 2025 - 07:19 MST

This incident affected: Web Interface, Mobile Data, API, Google Apps Calendar, Exchange Sync, Support Articles, Mandrill US East, Mandrill US West, EventBoard, Phone System, LobbyConnect, Authentication (SSO), and Datadog Events.