S1 - TEEM Services unavailable
Incident Report for Teem
Postmortem

Teem by Eptura detailed Root Cause Analysis | 2/12/2024

S1 – Inability to Access Teem

 

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.

 

Description:

On Monday February 12, 2024, around 7:02pm MST both internal and external customers experienced an inability to access Teem. At approximately 7:06pm MST, internal team members were alerted that Teem Login Page has failed a monitoring check. Internal teams immediately began to investigate the issue.

 

Type of Event:

Outage

 

Services/Modules impacted:

All production.

 

Timeline:
The timeline is posted in MST.

We received an alert at 7:02 PM. Our internal Engineering team was able to jump onto the issue. 7:16 PM. We continued investigating and jumped on a call with AWS engineering to get this issue resolved. 8:39 PM Our internal teams are still on a call with AWS engineering. 9:40 PM The call is continuing with AWS as our Engineering team and AWS engineering teams are working together to get database online. 3:02 AM Our Engineering team posted a status page as a Severity 1 letting our customers know the issues at hand 4:20 AM We are still investigating the issue. 5:59 AM Investigation continues 7:11 AM the issue has been resolved and we have moved Virtual Machines to allow hosting for our database. We then notified our customer base and put the status page into a monitoring state. 9:26 AM the issue has been confirmed resolved and the status page has now been updated to reflect.

 

 

Total Duration of Event:

12 Hours

 

Root Cause:

AWS hosting required an update to our Virtual Machine we are hosting our Database on. Our Engineering team wasn’t notified due to an email being sent to an old email address. We were notified of the outage right away due to failsafe's put in place. Investigation and attempts at resolution started immediately.

 

Remediation:

We have moved to RDS for AWS and this should now no longer cause issues with downtime on server updates. We have also updated all emails and notification systems.

 

Preventative Action:

Having the correct email in place as well as being on a hosting server allows for quick switchover without downtime.

Posted Feb 23, 2024 - 09:45 MST

Resolved
As we have not seen further service disruptions after the fix was implemented, we have moved to the Resolved Phase.
A RCA will be posted in this incident in 10 business days. Please stay subscribed to the page to receive post automatically.
Posted Feb 13, 2024 - 09:26 MST
Monitoring
A fix has been implemented. We are moving into the Monitoring Phase for the next 2 hours. 10:00am CST
Posted Feb 13, 2024 - 07:11 MST
Update
We are currently investigating an issue with TEEM. Our Engineering team is currently investigating to determine the cause of the disruption. The next update will be posted at 9 AM CST.
Posted Feb 13, 2024 - 05:59 MST
Update
We are currently investigating an issue with TEEM. Our Engineering team is currently investigating to determine the cause of the disruption. The next update will be posted at 7 AM CST.
Posted Feb 13, 2024 - 04:20 MST
Investigating
We are currently investigating an issue with Teem. We will update you when we have more information.
Posted Feb 13, 2024 - 03:02 MST
This incident affected: Web Interface, Mobile Data, API, EventBoard, Phone System, LobbyConnect, Authentication (SSO), and Datadog Events.