Degraded performance of API and web site

Incident Report for Teem

Postmortem

There were several issues that were addressed during the emergency downtime that were all contributing factors to the performance issue:

Indexes on several heavily used tables were bloated/corrupted leaving the primary database instance starved for processing resources. This was corrected along with adding additional capacity and alerting.
Various instance and network level errors indicated potential issues with the hardware hosting the virtual database instance. The system was force migrated to new hardware and processes cleared and restarted ensuring proper function and replication. Additionally, failover processes and triggers have been reviewed and updated to help avoid single node disruption of the wider system.
Several background and asynchronous tasks were better tuned and balanced to avoid resource over utilization.

Posted Oct 09, 2019 - 16:45 MDT

Resolved

This incident has been resolved. The database performance has continued to stay at normal levels as a result of the back-end changes.
Post mortem to be posted by Thursday, October 10th

Posted Oct 03, 2019 - 09:36 MDT

Update

We have continued monitoring, and the implemented back-end changes have resulted in database performance staying at normal levels.

We will continue monitoring this evening and will provide another update by 10 AM MT/12 PM ET tomorrow.

Posted Oct 02, 2019 - 13:47 MDT

Update

The backlog of requests has returned to normal levels. We are continuing to monitor and will provide updates as further information is found.

Posted Oct 01, 2019 - 14:59 MDT

Update

At 8:30AM MT this morning there was a significant increase in sync requests to our systems. This backlog of requests is trending downward and it is expected that some calendars will be out of sync at this time while the systems catch up. Next Update by: 3PM MT/5PM ET

Posted Oct 01, 2019 - 12:26 MDT

Update

A code release was pushed last night to address the small subset of O365/Exchange calendars, as well as improve calendar syncing overall. We are monitoring the results and will update by 4PM MT/6PM ET

Posted Oct 01, 2019 - 09:38 MDT

Update

We are investigating reports of a subset of calendars not syncing for O365, Exchange, and Google calendars that appear to be related to this incident. Updates will be provided as they become available.

Posted Sep 30, 2019 - 14:24 MDT

Update

All systems are operational. Teem will leave this incident open and continue to monitor systems closely throughout the weekend to verify all global clients are fully functional.

Posted Sep 27, 2019 - 16:52 MDT

Monitoring

All systems are operational and Teem will continue to closely monitor the platform throughout the day. During the maintenance window overnight the team shifted hosted hardware, restarted and performed maintenance on the primary database, including updates, reindexing, general clean up and reinitialization dependent services. Teem will be adding additional capacity to the system throughout the day. In addition we will continue to monitor our systems. As of this update, all systems are currently operational.

Posted Sep 27, 2019 - 09:49 MDT

Update

We are continuing to investigate this issue as services have operated with degraded performance throughout the day. While a root cause has not yet been confirmed we have identified issues relating to our database cluster and will be scheduling an emergency maintenance window later this evening to address the issue.

Posted Sep 26, 2019 - 16:54 MDT

Investigating

Teem is currently investigating a wide spread performance issue across the site.

Posted Sep 26, 2019 - 09:54 MDT

This incident affected: Web Interface and API.