S2 | O365 & Google Calendar - Latency in Events Syncing
Incident Report for Teem
Postmortem

Teem Detailed Root Cause Analysis | Severity 1 & 2 | October 6 – 20, 2022

Description:

On October 6, 2022, at approximately 8:50am, customer support started to receive reports of issues throughout the Teem platform with calendars having syncing latency. When customers book events in Outlook or Google calendars, the event boards would not reflect the events for as 10 mins to 2 hours. This issue lasted intermittently till October 20, 2022.

Type of Event:

Service Disruption

 

Services\Modules Impacted:

Calendars

 

Remediation:

Our engineering team has created new monitoring systems and have implemented new hardware to effectively identify these issues quickly going forward.

Timeline:

Thursday, October 6, 2022

8:50am MDT: Customers began reporting issues of calendars having some latency on syncing events to the event board and support escalates the issue to engineering.

8:54am MDT: Our Engineering team begins investigation and are actively exploring calendar issues through 12pm MDT.

1:31pm MDT: We’ve identified the issue to be with both Microsoft and Google

3:03pm MDT: Engineering has identified the issue and implemented a fix. Essentially, correcting the event queues by adding more resources to clear out the calendar requests quickly.

6:05pm MDT: Event queues continue to process their way through.

7:32pm MDT: Support received feedback from most of our customers saying that this issue has been resolved.

7:35pm MDT: The status page was marked as resolved. However, the issue was downgraded to an S3 as support awaits more feedback from customers.

Friday, October 7, 2022

9:27am MDT: Status page is updated with a note advising customers: On 10/06/2022, Teem customers using O365 & Google Calendars experienced disruption with Calendars not Syncing. Our engineering team has implemented a fix. You may experience a delay in resolution due to event jobs that continue to process due to the disruption. We will continue to monitor the resolution till the events have been cleared.

10:37am MDT: Engineering confirms that the queue is still clearing. However, number of calendar events have not grown in the queue, which is a good sign. They continue to add more resources to help process all requests through the queue.

Wednesday, October 12, 2022

2:52pm MDT: Customers begin to report that Microsoft and Google calendars are not syncing, and a new incident was reported on the status page making our customers aware.

2:58pm MDT: Engineering acknowledged that they’re beginning to investigate and are exploring notification pushes for calendars.

4:36pm MDT: Engineering has asserted logs and will review why some and not all calendars are updating.

7:01pm MDT: Status page updated to as we continue to investigate the issue.

11:00pm MDT: Engineering has a working theory on the issue and their team continues efforts to assert it is the root cause. Calendar watchers are being registered and an attempt to sync begins.

11:05pm MDT: The status page is updated as we are still investigating the issue.

3:01pm MDT: The status page is updated as we are still investigating the issue.

Thursday, October 13, 2022

7:54am MDT: The status page is updated as we are still investigating the issue.

12:21pm MDT: The engineering team has done initial testing on their working theory and the team begins to prep solution for bulk application.

12:28pm MDT: The status page is updated as we are still investigating the issue.

2:56pm MDT: The engineering team is positive that the issue has been identified and are running updates to verify solution.

4:32pm MDT: The status page is updated from Investigation to Identified.

6:28pm MDT: Engineering has implemented a fix and numbers for calendar requests begin to trend downwards. Support team begins to verify calendars with customers.

7:29pm MDT: The status page is updated from Identified to Monitoring.

8:40pm MDT: The status page is updated as we continue to monitor, and our support team continues to verify the solution with customers.

Friday, October 14, 2022

6:16am MDT: Customers have verified that the issue has been resolved and the status page has been updated from Monitoring to Resolved.

6:38am MDT: The status page has a note added to the incident - While this issue may appear to be resolved, support will continue monitoring all customers that have been affected by the Calendar Sync disruption. If you continue experiencing these issues. Please reach out to our support team at support@teem.com.

Monday, October 17, 2022

3:41pm MDT: Customers being to report calendar syncing issues are intermittent. Some calendars are syncing and not all. The status page is updated, and all customers are notified that we are investigating this issue. Engineering also acknowledges the issue and begins investigating.

4:07pm MDT: Support works with engineering to provide some examples of calendars from customers as investigation continues

7:05pm – 11:13pm MDT: The status page is updated, and customers are notified that new logs are being added to the system to help us better identify the issue. Investigation continues through the night and into the next morning.

Tuesday, October 18, 2022

3:03am – 11:08am MDT: The team continues to investigate logs that were added to identify the issue and status page is updated as such.

1:49pm MDT:  The engineering team is finding the queue for calendar requests are backed up and are working on distributing requests. The team is now adding more logs and are monitoring them to understand where the bottle neck is.

8:37pm MDT: The status page is updated as our engineering team continues to monitor logs that have been implemented. We're also looking at current configurations to assist the latency in calendar events.

11:00pm MDT: The engineering team has setup new servers for processing queues and updated queue processing configurations. The team begins to test, monitor, and fine tune configuration.

Wednesday, October 19, 2022

12:30am MDT: The status page is updated as our engineering team continues to monitor logs that have been implemented. Queues have been restarted and have processed. The team has force synced 200 test calendars and they all processed within 2 minutes.

9:05am MDT: The status page is updated as we are still investigating this issue.

10:18am MDT: Additional monitoring has been put into place and queues begin processing as normal

1:46pm MDT: Support reaches out to customers to verify calendar issues and continue to monitor customer requests. The status page is also updated with information asking customers to reach out to support if they are still experiencing the issue.

6:00pm MDT: The status page is updated to monitoring. Monitoring continues into the next day.

Thursday, October 20, 2022

6:05pm MDT: Status page is updated from Monitoring to Resolved. After a day of monitoring, we are marking this issue as Resolved. Our team has implemented new logs to help us tell a better story for created events going forward. We've also reconfigured job managers which has been processing requests smoothly and quickly. We continue to implement new hardware to improve our services. Please reach out to our support team if you have any questions. We appreciate your patience as we work through this.

 Root Cause Analysis:

Insufficient processing, logs, and monitoring systems for Calendar queues.

Preventative Action:

The engineers have reconfigured the queue processing system and sustained monitoring is suggestive of normal processing. The engineering team implemented a plan to reduce the chances of a recurrence of the queue processing problems. The plan includes provisioning of additional infrastructure to add a new layer of high availability. Adding this new high availability layer to the system will enable us to continue processing queue data even in the event of some parts of the system failing. It will also allow us to continue processing while provisioning additional server capacity and restarting queue processing subsystems if needed. The team has deployed this additional infrastructure and has begun testing and tuning the configuration. As part of the efforts to recover the system it was necessary to gain additional insights into our queue processing. There are new monitors available now that give insights into queue processing, we did not have previously. However, we still need more monitoring to confidently say the system is processing every message as expected and to understand our throughput patterns so we can watch/alert for anomalies that may indicate an active or imminent failure.  The engineering team is going to add additional monitoring and build tooling to enable Support to better diagnose message processing through the queues. This tooling will empower support to address more questions with confidence without having to involve engineering. The implementation of additional monitoring is in progress now and tooling for support will continue into the next sprint cycle for the team.

Posted Oct 24, 2022 - 15:27 MDT

Resolved
After a day of monitoring we are marking this issue as Resolved. Our team has implemented new logs to help us tell a better story for created events going forward. We've also reconfigured job managers which has been processing requests smoothly and quickly. We continue to implement new hardware to improve our services. Please reach out to our support team if you have any questions. We appreciate your patience as we work through this.
Posted Oct 20, 2022 - 18:05 MDT
Update
Our team is continuing to monitor the issues at hand before we move into a resolved state. If you're still experiencing any issues please don't hesitate to reach out to our support team.
Posted Oct 19, 2022 - 18:00 MDT
Update
Since our team has implemented additional monitoring resources, we should start to see calendars beginning to sync and process normally. If you are still experiencing this issue please reach out to support (support@teem.com) and provide details around the room(s) that you are trying to book.

- Calendar name
- Is your calendar event syncing?
- If so, how long is it taking to sync?
- Are your rooms syncing for some or all?

Our monitoring continues and our next update will be provided 6:20pm MDT.
Posted Oct 19, 2022 - 14:19 MDT
Update
Our engineering team continues to monitor resources that have been put into place. Our next update will be provided at 1pm MDT.
Posted Oct 19, 2022 - 09:05 MDT
Monitoring
Our engineering team has identified/implemented more resources such as logging, adding additional hardware and made some configuration updates. We continue to test and monitor the calendar queues through to resolution. We appreciate your continued patience as we work through this. Our next update will be posted at 8:30am MST.
Posted Oct 19, 2022 - 04:45 MDT
Update
Our engineering team continues to monitor logs that have been implemented. We're also taking a look at current configurations to assist the latency in calendar events. We apologize for the inconvenience and appreciate your continued patience as we work through this. Our next update will be posted at 4:30am MST.
Posted Oct 19, 2022 - 00:30 MDT
Update
Our engineering team continues to monitor logs that have been implemented. We're also taking a look at current configurations to assist the latency in calendar events. We apologize for the inconvenience and appreciate your continued patience as we work through this. Our next update will be posted at 12:30am MST.
Posted Oct 18, 2022 - 20:37 MDT
Update
Our engineering team continues to investigate calendar queues and monitoring possible bottle necks that are causing this latency. We apologize for the inconvenience and appreciate your continued patience as we work through this. Our next update will be posted at 7:30pm MDT.
Posted Oct 18, 2022 - 15:30 MDT
Update
Our team continues to monitor the latency in calendar events syncing. We appreciate your patience as we work through this. Our next update will be provided at 3:05pm MDT.
Posted Oct 18, 2022 - 11:08 MDT
Update
Our team continues to monitor the latency in calendar events syncing. We appreciate your patience as we work through this. Our next update will be provided at 11:05am MDT.
Posted Oct 18, 2022 - 07:16 MDT
Update
Our team continues to monitor the latency in calendar events syncing. We appreciate your patience as we work through this. Our next update will be provided at 7:05am MDT.
Posted Oct 18, 2022 - 03:03 MDT
Update
Our engineering team continues to investigate and monitor event logs. Our next update will be provided at 3:05am MDT.
Posted Oct 17, 2022 - 23:15 MDT
Update
Our engineering team continues to investigate this issue. We are adding logs to the system to help us better identify the issue. Our next update will be provided 11:05pm MDT.
Posted Oct 17, 2022 - 19:05 MDT
Investigating
We are currently investing an issue with Teem Calendars, where events are not syncing and will provide an update for you at 7:41pm.
Posted Oct 17, 2022 - 15:41 MDT
This incident affected: Google Apps Calendar and Exchange Sync.