On 14-October-2021 at 21:09 UTC, Datto RMM Partners on the Concord, Zinfandel, Syrah and Merlot platforms experienced a service interruption which caused audits to be delayed on new and existing devices.
The root cause for this service interruption was identified to be a code change deployed in the 10.0 release that introduced changes to data validation in audit data sent by agents.
The change caused an unforeseen increase in the size of delta audits, which in turn resulted in increased processing time on the platform.
Our Engineering team increased platform resources to cover the increased load while they were working on a hotfix for the issue. This resolved the symptoms of the issue by 15-October-2021, 4:45 UTC.
The Agent code change causing the issue was reverted; the hotfix was created, tested and released on platforms already on the 10.0 version by 15-October-2021, 13:47 UTC. Agents were not forced to update, but rather let to organically update through their regular procedure to ensure that day to day operations are not disrupted.
The issue was considered fully resolved once the 10.0 release has been deployed to the Pinotage platform as well with the fix already included on 19-October-2021.
In order to prevent a similar issue from happening in the future, pre-release Code review and QA processes have been updated to cover scenarios that caused this interruption.