RMM - All Platforms - Devices may alert offline when their session is lost in the database
Incident Report for Datto
Resolved
This incident has been resolved.
Posted Apr 06, 2023 - 13:55 UTC
Monitoring
Fixes have been implemented and we are monitoring the results:

Our teams have been working on remediating the issue where managed devices (both Agent-based and Agentless) have been alerting offline as a result of losing their session in the backend.

The Engineering and Technical Support Teams' concerted efforts over the last two months have allowed us to narrow the cause down to database contention during high platform loads.
To address this we have introduced several changes to both code and infrastructure; fixes confirmed safe to be released in hotfix windows were rolled out as soon as they were ready, while fixes that required downtime were rolled into the major releases of 11.6.0 and 11.7.0.

The first major change was to the logic of offline alerting for Agentless Managed devices (devices managed by a Network Node) in February, such that offline alerts are no longer triggered for network devices if the Node through which they connect to the platform goes offline. The impact of this has been pronounced.
Throughout the course of February and March, including the 11.7 release, infrastructure tuning has also taken place to improve the way the backend services handle connection requests from devices. Furthermore, the procedures handling device sessions with the platform in session storage have also been improved to reduce congestion in the data store. Following these changes, we have seen a significant improvement in session-data handling and a dramatic reduction of offline alerts (from the high 1000s at high platform load times to low 100s during the same time periods) leading us to believe that the remaining offline alerts the platform raises are valid in nature.
During this period we have also added additional logging and monitoring to the data stores and services involved, which allows us to continue with planned improvements in future releases.
Posted Apr 05, 2023 - 13:30 UTC
Investigating
Following our 11.5.0 release partners have started experiencing an increasing amount of false offline alerts on agent-based an agentless managed devices.

Our engineering team has been working on identifying the issue triggers, and deploying code and infrastructure changes in order to mitigate the issue and pave the way to finding the root cause of the issue.

These efforts are ongoing and while progress has been achieved, and the rate of false offline alerts has dropped considerably, the root cause of the issue is still being investigated and a complete resolution is still outstanding.

In order to improve the traceability of our updates regarding this problem, we have retired the previous Datto Status post, and migrated it to the Datto Community Known Issues & Workarounds Blog.

https://community.datto.com/t5/Known-Issues-Workarounds/All-Platforms-Devices-may-alert-offline-when-their-session-is/ba-p/102689

Please follow this Post to keep up to date of our progress.

This incident will also remain active until a resolution is reached, and subscribers will be notified once this post is resolved.
Posted Mar 13, 2023 - 11:42 UTC
This incident affected: Datto RMM (Pinotage (EU1), Merlot (EU2), Zinfandel (US West), Concord (US East), Syrah (APAC), Vidal (US East)).