RMM - Merlot - Intermittent errors returned in the UI
Incident Report for Datto
Postmortem

On the 12th of April at 11:09 UTC, Datto RMM partners on the Merlot (EU2) platform experienced a service issue which caused excessive page load times in the Web Application.

One of the load balancers in the infrastructure became unhealthy, and this temporarily caused interactions of a backend service to fail with its downstream services - cache and front end API. This caused excessively long loading times and internal server errors when accessing pages in the Web Application.

At 11.33 UTC the backend service recovered automatically.

To uncover these kinds off issue much earlier in an automated way, more proactive monitoring at the Load Balancer level will be implemented in the near future.

A subsequent issue occurred on the 15th April around 7:30 UTC, causing the same behavior on the Merlot(EU2) platform.

The R&D team has identified the issue to be caused by a long running query exhausting the resources available for the backend caching service. This resulted in requests from the Web Application taking an increasing time to be fulfilled and intermittently timing out, returning an error.

Resources have been scaled which appeared to have resolved the issue temporarily, but downstream services eventually had to be restarted and a failover had to be performed on the Alerts database to fully resolve the issues.

The service was confirmed to be fully operational by 13.23 UTC.

In order to prevent this and similar issues related to the caching service from occurring, our R&D team is upgrading the caching client in the 13.1.0 release version. Further load optimization work is being scoped and planned for later this year.

In order to further mitigate risk, internal alerting thresholds have been adjusted to facilitate faster response and remediation.

Posted May 08, 2024 - 09:51 UTC

Resolved
This incident has been resolved.
Posted Apr 15, 2024 - 06:51 UTC
Monitoring
Our teams have investigated an issue on Merlot for Datto RMM where intermittent errors were observed in the UI. A fix has been implemented and we are currently monitoring the results.

We apologize for any inconvenience this may have caused.

Thank you for your patience!
Posted Apr 12, 2024 - 14:02 UTC
This incident affected: Datto RMM (Merlot (EU2)).