Description
Intermittent system performance issues were seen throughout the day and the problem was escalated during the afternoon. By 23:30 it was determined that the situation was unrecoverable and the API and the primary database needed to be taken offline to correct the identified issue. The standby and read replicas were taken offline and restoration of a new standby database began. Once complete the API and other components were re-enabled for client usage, no data was lost in the restoration.
Root Cause/Remediation
Because of a large load on the primary database, the standby database and read replica databases began to drift further and further out of sync from the master throughout the day. It was anticipated that during a period of lower activity the replicas would catch up to the master database and we could concurrently resolve the load issues with a software patch. However, the standby database stopped responding and subsequently lost communication with the master database, this caused the master database replication timeout to trip which then caused the master database to drop it's replication slots. This in turn caused the on disk replication WALs to be removed, requiring a full restore of the standby from the master database. In addition to the restoration of data, the replication timeout delay was increased substantially and the instance type was changed to allow additional processing capability. We have also applied an application update that will let the software deal with additional load more efficiently and cause less load on the database in the future.