On the 12th of April at 11:09 UTC, Datto RMM partners on the Merlot (EU2) platform experienced a service issue which caused excessive page load times in the Web Application.
One of the load balancers in the infrastructure became unhealthy, and this temporarily caused interactions of a backend service to fail with its downstream services - cache and front end API. This caused excessively long loading times and internal server errors when accessing pages in the Web Application.
At 11.33 UTC the backend service recovered automatically.
To uncover these kinds off issue much earlier in an automated way, more proactive monitoring at the Load Balancer level will be implemented in the near future.
A subsequent issue occurred on the 15th April around 7:30 UTC, causing the same behavior on the Merlot(EU2) platform.
The R&D team has identified the issue to be caused by a long running query exhausting the resources available for the backend caching service. This resulted in requests from the Web Application taking an increasing time to be fulfilled and intermittently timing out, returning an error.
Resources have been scaled which appeared to have resolved the issue temporarily, but downstream services eventually had to be restarted and a failover had to be performed on the Alerts database to fully resolve the issues.
The service was confirmed to be fully operational by 13.23 UTC.
In order to prevent this and similar issues related to the caching service from occurring, our R&D team is upgrading the caching client in the 13.1.0 release version. Further load optimization work is being scoped and planned for later this year.
In order to further mitigate risk, internal alerting thresholds have been adjusted to facilitate faster response and remediation.