[Datto SaaS Protection] Service Interruption Impacting UK Environments

Incident Report for Datto

Postmortem

Between November 4th and November 9th a release was rolled out to SaaS Protection. Following that release SaaS Protection partners experienced a service interruption which caused slow ingest in the UK. This initially affected all partners and end customers but was later localized to the new v3 infrastructure.

The root cause for this service interruption was identified to be a fix in the release for a long-standing Microsoft API defect which unexpectedly added a large number of SharePoint and Teams sites. This caused our Azure peering links to become saturated. Additionally, the large backlog of new SharePoint and Teams sites created a large backlog of services on the v3 platform.

Our Engineering team took the following steps to remediate the problem:

Reconfigured the network to double our bandwidth used for API calls to Microsoft
Reduced concurrency of SharePoint backups
Increased concurrency of backups for the v3 infrastructure and per service type
Reduced from 3x backups per day to 2x backups per day
Redirected new accounts to separate infrastructure in the UK
Expanded the database infrastructure
Returned to the normal 3x backups per day

We’re in the process of expanding our data center footprint in the UK and bringing new v3 infrastructure online in Q1. This will allow us to manage the growth of the current v3 infrastructure and balance the load more effectively.

New procedures have been included in our regular infrastructure reviews to identify the signs of increasing stress on the database.

We are further enhancing our phased rollout of releases to protect against an unexpected load on the platform in the future.

A more detailed RCA is available. For access to this RCA please make a request directly to your PSM.

Posted Dec 23, 2021 - 17:30 UTC

Resolved

This incident has been resolved.

Posted Dec 13, 2021 - 15:45 UTC

Update

We are continuing to monitor for any further issues.

Posted Dec 07, 2021 - 14:15 UTC

Monitoring

Last week’s infrastructure maintenance has proven to be highly successful. Ingest rates have accelerated significantly, the backlog has shrunk considerably to the point where we have fully completed ingest for a majority of new accounts. There are a few remaining accounts / services that are large. But they all should be fully ingested by the end of the week. Anything remaining after this week is likely something that should be raised with Datto Support in order to investigate further.

We are moving this incident into a monitoring state due to the actions taken last week along with reinstating 3x backups/day which was done earlier today.

This week we will be holding an internal post mortem on the event. We’ll use the findings from the post mortem to generate a root cause analysis (RCA) which we intend to share next week.

Posted Dec 06, 2021 - 21:44 UTC

Update

The maintenance performed on the UK infrastructure has completed successfully. We have expanded capacity and have already seen an immediate increase in throughput. We’ll closely monitor KPIs and other metrics to ensure proper operation. Our next scheduled update will be at the end of Monday. At that time we can provide further guidance on the backlog of services based on several days of data including the weekend.

Posted Dec 02, 2021 - 23:58 UTC

Update

UK update: Earlier today we sent out a maintenance notice for after business hours tomorrow. In that maintenance window we will bring the new infrastructure online and redistribute load to increase throughput in the cluster.

We expect this to make an immediate impact on performance and return the cluster to normal operations. However, it will take some time to work through the backlog of services which has accumulated. Assuming all goes well in the maintenance window, we will observe the performance through the weekend and then on Monday communicate expectations to work through what remains. In general it will be large SharePoint sites and OneDrive that naturally lag. More information to follow on that.

Posted Dec 01, 2021 - 21:58 UTC

Update

We are continuing to work on a fix for this issue.

Posted Nov 29, 2021 - 23:02 UTC

Update

We continue to make progress knocking down the backlog of services for the affected accounts in the UK. However, a large backlog remains which requires the new infrastructure to fully remediate.

We’re continuing the process of bringing the new infrastructure online and carefully distributing the load. We remain on track for the new infrastructure to be online by the end of the week. At that time, we can more rapidly reduce the backlog of services and return to normal operations (including a return to 3 backups per day). Our next planned update is Wednesday.

Posted Nov 29, 2021 - 23:01 UTC

Update

We continue to see steady but slow improvement for the affected accounts in the UK.

The new infrastructure is on site and going through our standard onboarding process. We remain on track to remediate the current situation by the end of next week.

We’ll provide another update by the end of day on Monday along with some additional info early next week on what to expect as we near full remediation of the current situation.

Posted Nov 26, 2021 - 16:11 UTC

Update

Backup success rates were temporarily disrupted by the unrelated incident communicated earlier today. In general, we continue to see steady but slow improvement in the UK. We are on track with the previously communicated timeline of bringing new infrastructure online to remediate the current situation by the end of next week. We’ll provide more info as we get closer.

We’ll continue to post updates at least every 2 business days. If there’s anything worthy of an interim update we will be sure to provide that.

Posted Nov 24, 2021 - 20:02 UTC

Update

We are continuing to closely monitor backup success rates and ingest rates in the UK. The last 24 hour backup success rates are right around 99.5%. The changes implemented late last week have had a positive effect on the rate of ingest. The ingest backlog is being steadily reduced. However, there remains a large backlog to work through.

We have additional infrastructure scheduled to be brought online by the end of next week. We expect to return to normal operations within a few days of that new infrastructure.

Posted Nov 22, 2021 - 22:43 UTC

Update

We are continuing to closely monitor backup success rates and ingest rates in the UK. Last week a change was deployed addressing a long-standing Microsoft API issue which unlocked a large number of SharePoint backups. This temporarily saturated our Azure peering links and caused other backups to lag behind.

We’ve since addressed the networking issue. However, it created a large backlog of new services along with some incremental backups that were missed.

We are iteratively making application changes to increase the throughput of backups. One of the changes we made today is to temporarily reduce backups from 3x/day to 2x/day. This will provide additional backup slots to make progress on the backlog. Again, this is temporary and will revert back as soon we return to normal operations.

Ultimately we need additional infrastructure to permanently resolve the issues. We brought some new infrastructure online today and have more resources already en route to the UK which will come online within the next 2 weeks.

Please know we are treating this with the highest priority.

We’ll continue to post regular updates. If you have any specific questions please open a ticket with Datto Support.

Posted Nov 18, 2021 - 23:11 UTC

Update

Several configuration changes have been made to increase the number of concurrent Exchange backup runs. Exchange backup success rates are improving although it may be several days before the backlog is eliminated and for backup reports to reflect the improving state. In particular, new customer ingest will be delayed since general priority is given to existing incremental backups. We expect normal operations by the end of the week. If you have a specific concern, please raise a support ticket.

Posted Nov 16, 2021 - 21:35 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 15, 2021 - 13:59 UTC

Investigating

We are currently aware of a problem where environments in the UK are experiencing decreased backup success rates and slow ingest speeds.

Our Engineering team is currently investigating this issue.

Currently, we do not have an ETA on when a fix will be available.

You can monitor the current status of this issue at https://status.datto.com/

Posted Nov 12, 2021 - 20:21 UTC

This incident affected: Datto SaaS Protection (SaaS Protection Backups).