[Datto SaaS Protection] Service Interruption Impacting UK Environments
Incident Report for Datto
Update
The maintenance performed on the UK infrastructure has completed successfully. We have expanded capacity and have already seen an immediate increase in throughput. We’ll closely monitor KPIs and other metrics to ensure proper operation. Our next scheduled update will be at the end of Monday. At that time we can provide further guidance on the backlog of services based on several days of data including the weekend.
Posted Dec 02, 2021 - 23:58 UTC
Update
UK update: Earlier today we sent out a maintenance notice for after business hours tomorrow. In that maintenance window we will bring the new infrastructure online and redistribute load to increase throughput in the cluster.

We expect this to make an immediate impact on performance and return the cluster to normal operations. However, it will take some time to work through the backlog of services which has accumulated. Assuming all goes well in the maintenance window, we will observe the performance through the weekend and then on Monday communicate expectations to work through what remains. In general it will be large SharePoint sites and OneDrive that naturally lag. More information to follow on that.
Posted Dec 01, 2021 - 21:58 UTC
Update
We are continuing to work on a fix for this issue.
Posted Nov 29, 2021 - 23:02 UTC
Update
We continue to make progress knocking down the backlog of services for the affected accounts in the UK. However, a large backlog remains which requires the new infrastructure to fully remediate.

We’re continuing the process of bringing the new infrastructure online and carefully distributing the load. We remain on track for the new infrastructure to be online by the end of the week. At that time, we can more rapidly reduce the backlog of services and return to normal operations (including a return to 3 backups per day). Our next planned update is Wednesday.
Posted Nov 29, 2021 - 23:01 UTC
Update
We continue to see steady but slow improvement for the affected accounts in the UK.

The new infrastructure is on site and going through our standard onboarding process. We remain on track to remediate the current situation by the end of next week.

We’ll provide another update by the end of day on Monday along with some additional info early next week on what to expect as we near full remediation of the current situation.
Posted Nov 26, 2021 - 16:11 UTC
Update
Backup success rates were temporarily disrupted by the unrelated incident communicated earlier today. In general, we continue to see steady but slow improvement in the UK. We are on track with the previously communicated timeline of bringing new infrastructure online to remediate the current situation by the end of next week. We’ll provide more info as we get closer.

We’ll continue to post updates at least every 2 business days. If there’s anything worthy of an interim update we will be sure to provide that.
Posted Nov 24, 2021 - 20:02 UTC
Update
We are continuing to closely monitor backup success rates and ingest rates in the UK. The last 24 hour backup success rates are right around 99.5%. The changes implemented late last week have had a positive effect on the rate of ingest. The ingest backlog is being steadily reduced. However, there remains a large backlog to work through.

We have additional infrastructure scheduled to be brought online by the end of next week. We expect to return to normal operations within a few days of that new infrastructure.
Posted Nov 22, 2021 - 22:43 UTC
Update
We are continuing to closely monitor backup success rates and ingest rates in the UK. Last week a change was deployed addressing a long-standing Microsoft API issue which unlocked a large number of SharePoint backups. This temporarily saturated our Azure peering links and caused other backups to lag behind.

We’ve since addressed the networking issue. However, it created a large backlog of new services along with some incremental backups that were missed.

We are iteratively making application changes to increase the throughput of backups. One of the changes we made today is to temporarily reduce backups from 3x/day to 2x/day. This will provide additional backup slots to make progress on the backlog. Again, this is temporary and will revert back as soon we return to normal operations.

Ultimately we need additional infrastructure to permanently resolve the issues. We brought some new infrastructure online today and have more resources already en route to the UK which will come online within the next 2 weeks.

Please know we are treating this with the highest priority.

We’ll continue to post regular updates. If you have any specific questions please open a ticket with Datto Support.
Posted Nov 18, 2021 - 23:11 UTC
Update
Several configuration changes have been made to increase the number of concurrent Exchange backup runs. Exchange backup success rates are improving although it may be several days before the backlog is eliminated and for backup reports to reflect the improving state. In particular, new customer ingest will be delayed since general priority is given to existing incremental backups. We expect normal operations by the end of the week. If you have a specific concern, please raise a support ticket.
Posted Nov 16, 2021 - 21:35 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 15, 2021 - 13:59 UTC
Investigating
We are currently aware of a problem where environments in the UK are experiencing decreased backup success rates and slow ingest speeds.

Our Engineering team is currently investigating this issue.

Currently, we do not have an ETA on when a fix will be available.

You can monitor the current status of this issue at https://status.datto.com/
Posted Nov 12, 2021 - 20:21 UTC
This incident affects: Datto SaaS Protection and Backupify (Backupify Backups).