RMM - Concord - Delayed Ticket Creation
Incident Report for Datto
Postmortem

On 20-October-2021 at 16:00 UTC, Datto RMM Partners on the Concord platform experienced a service interruption which caused Autotask PSA Tickets to not be raised in a timely fashion or not be raised at all for Alerts.

The root cause for this service interruption was identified to be a spike in Ticket Creation request from a few Devices on the platform amounting to over 100K requests from these Devices.

The Alert to Ticket process works in a First-In-First-Out basis to ensure that Resolution requests never precede Creation requests. To prevent small spikes from individual Devices stalling the Creation of Tickets for other Devices, the process creates queues for Devices respectively. However, the logic only looks back over the first 20K messages. In this situation, this resulted in the few spiking Devices taking up the whole queue instead of being able to queue up multiple Devices simultaneously.

Attempts at increasing the processing speed by allocating more resources to the processing of these requests was unsuccessful at resolving the issue. The process was already working at full capacity on the few spiking Devices, and could only start working through the additional requests from other Devices, and thus using the new resources allocated, once these were finished. The decision was made to purge the queue of messages in order to restore the service at 20-October-2021 18:00.

We aim to ensure that the actions of a single Device or collection of Devices cannot impact the performance of the platform for other Devices. In this case we fell short of that aim by being too lenient with the rate limits applied to incoming Alerts. Moving forward, we will more strictly rate limit the number of Alerts a device is allowed to raise in a given time period. A post on the Datto Community is forthcoming to provide the details of this.
The processing time to raise Tickets for devices with an extensive Alert History is significant due to the way the Alert Summary for the Device is generated. This will be improved both through tweaks to that process, but also through the pruning of the Alert History that we store. The Datto Community post will outline this pruning further.
Better queue tooling is being worked on to allow us to more accurately interrogate and prune messages from a queue en-masse.

Posted Oct 26, 2021 - 13:23 UTC

Resolved
This incident has been resolved.

We will publish the Root Cause Analysis for this incident as soon as it becomes available.
Posted Oct 21, 2021 - 10:32 UTC
Update
A fix has been implemented and we are monitoring results. Please be aware Alerts that were pending a ticket creation before 18:05 UTC will remain in a Ticket Pending Creation state. We apologize for any inconvenience.
Posted Oct 20, 2021 - 22:02 UTC
Monitoring
A fix has been implemented and we are monitoring results. Please be aware Alerts that were pending a ticket creation before 18:05 UTC will fail to create a ticket. We apologize for any inconvenience.
Posted Oct 20, 2021 - 18:50 UTC
Update
We are still investigating the issue. We have also noticed that attempts to setup the RMM to PSA integration are failing, in the Setup Wizard.
Posted Oct 20, 2021 - 17:22 UTC
Update
We are continuing to investigate the issue.
Posted Oct 20, 2021 - 17:07 UTC
Investigating
Our teams are currently investigating a delay for ticket creation, from alerts for Datto RMM on the Concord platform. An update will be posted here within 60 minutes with the status of this investigation.

Thank you for your patience!
Posted Oct 20, 2021 - 16:04 UTC
This incident affected: Datto RMM (Concord (US East)).