Database Issue

Incident Report for Teem

Postmortem

Description
Intermittent system performance issues were seen throughout the day and the problem was escalated during the afternoon. By 23:30 it was determined that the situation was unrecoverable and the API and the primary database needed to be taken offline to correct the identified issue. The standby and read replicas were taken offline and restoration of a new standby database began. Once complete the API and other components were re-enabled for client usage, no data was lost in the restoration.

Root Cause/Remediation
Because of a large load on the primary database, the standby database and read replica databases began to drift further and further out of sync from the master throughout the day. It was anticipated that during a period of lower activity the replicas would catch up to the master database and we could concurrently resolve the load issues with a software patch. However, the standby database stopped responding and subsequently lost communication with the master database, this caused the master database replication timeout to trip which then caused the master database to drop it's replication slots. This in turn caused the on disk replication WALs to be removed, requiring a full restore of the standby from the master database. In addition to the restoration of data, the replication timeout delay was increased substantially and the instance type was changed to allow additional processing capability. We have also applied an application update that will let the software deal with additional load more efficiently and cause less load on the database in the future.

Posted Sep 28, 2018 - 09:08 MDT

Resolved

Teem has successfully repaired the affected databases and restored the system back to fully operational.

Posted Sep 25, 2018 - 02:45 MDT

Update

Teem is continuing work on the issue and the site is temporarily offline.

Posted Sep 24, 2018 - 23:58 MDT

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 24, 2018 - 22:10 MDT

Investigating

Teem is currently experiencing an issue with one of it's primary databases. The team has identified the root cause and remediation is underway. All systems will be affected.

Posted Sep 24, 2018 - 21:48 MDT

This incident affected: Web Interface, Mobile Data, API, Google Apps Calendar, and Exchange Sync.