Teem by Eptura detailed Root Cause Analysis | 2/12/2024
S1 – Inability to Access Teem
We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
On Monday February 12, 2024, around 7:02pm MST both internal and external customers experienced an inability to access Teem. At approximately 7:06pm MST, internal team members were alerted that Teem Login Page has failed a monitoring check. Internal teams immediately began to investigate the issue.
Type of Event:
Outage
Services/Modules impacted:
All production.
Timeline:
The timeline is posted in MST.
We received an alert at 7:02 PM. Our internal Engineering team was able to jump onto the issue. 7:16 PM. We continued investigating and jumped on a call with AWS engineering to get this issue resolved. 8:39 PM Our internal teams are still on a call with AWS engineering. 9:40 PM The call is continuing with AWS as our Engineering team and AWS engineering teams are working together to get database online. 3:02 AM Our Engineering team posted a status page as a Severity 1 letting our customers know the issues at hand 4:20 AM We are still investigating the issue. 5:59 AM Investigation continues 7:11 AM the issue has been resolved and we have moved Virtual Machines to allow hosting for our database. We then notified our customer base and put the status page into a monitoring state. 9:26 AM the issue has been confirmed resolved and the status page has now been updated to reflect.
Total Duration of Event:
12 Hours
Root Cause:
AWS hosting required an update to our Virtual Machine we are hosting our Database on. Our Engineering team wasn’t notified due to an email being sent to an old email address. We were notified of the outage right away due to failsafe's put in place. Investigation and attempts at resolution started immediately.
Remediation:
We have moved to RDS for AWS and this should now no longer cause issues with downtime on server updates. We have also updated all emails and notification systems.
Preventative Action:
Having the correct email in place as well as being on a hosting server allows for quick switchover without downtime.