Beginning February 16, 2020 at 08:21UTC, and over the next two days, partners in the DE1 zone began to experience increased latency followed by an error page. The first service interruption lasted 27 minutes. The second interruption, on February 17 lasted three minutes and the last, on February 18 lasted 24 minutes, for a total of 55 minutes over three consecutive days.
The underlying cause of the interruptions was an inefficient method in the Autotask application code, used for querying Installed Products. The method was returning ALL Products whenever a user queried for or navigated to a control that queried Products. Historically, this has not been an issue, however, shortly prior to these incidents a partner mistakenly imported over 993,000 products manually into their database. When a user in that partner’s database queried Products, the inefficient method queried ALL 993,000+ Products, causing race conditions on the web server making the call. As the web server became overloaded, the query was passed to the next web server, which was also overwhelmed by the query. The query would continue passing to the next web server until all of the web servers for the zone were overloaded, bringing the zone down.
Datto’s Development Engineers have inspected and replaced nine inefficient calls that formerly would try to return all rows for items such as Products. These calls now only return a portion of the data and will not cause any load increase on web servers.
We will continue to serve our partners in order to allow them to use Autotask PSA to run their business reliably without undue limitations. We will undertake edge-case testing of large-scale data sets to determine if there are other opportunities to increase efficiency or eliminate causes of service interruption.