PSA - LON Zone (WW4) - Recurring intermittent issue causing long loading times across the application - Under Investigation
Incident Report for Datto
Postmortem

You may have experienced intermittent service interruptions with Autotask PSA from Friday, April 23 through Thursday, April 29. Upon investigation, we traced the root cause back to a server and contributing software behavior:

  • Server: A server responsible for storing and retrieving attachments (such as screenshot images attached to tickets) entered a “bad state,” intermittently failing to respond to requests from the web tier. There was never enough strain on server resources to trigger an alert. The behavior resulted in request backlogs during peak periods, causing a domino effect as web servers became overwhelmed and left the available load pool.

  • Software: We designed Autotask PSA to attempt page restoration upon login following a session timeout. Analysis of our code found that, in many cases, the platform was refreshing each restored page up to ten times when only a single refresh is required. As a result, both the web tier and the broader system experienced increased load, especially during initial login at the beginning of the work day.

After our team identified the root causes, we scheduled a maintenance window, during which we fully rebooted the impacted server. Since the reboot, the problem has not recurred. In addition, we have updated the Autotask PSA code to stop unnecessary page refreshes after sessions time out. We have also increased the level of monitoring on the server and are adding redundancy with automated failover.

We continue to have our engineering team monitor the LON zone during peak hours of the workday to ensure full resolution of the situation. They are also gathering additional data and can quickly restore service if needed. Due to this situation, we have delayed our 2021.1 release by one week.

We recognize that even a minute of downtime can disrupt your business, and we apologize for this situation. We look forward to continuing to earn your trust in Autotask PSA and deliver the uptime that you have come to expect. Thank you for your support and understanding.

Posted May 05, 2021 - 19:18 UTC

Resolved
This incident has been resolved.
Posted May 05, 2021 - 19:17 UTC
Monitoring
A potential problem has been identified with a number of internal servers experiencing higher than average load.

A fix has been implemented for these servers and we are currently monitoring the results.
Posted Apr 30, 2021 - 15:29 UTC
Update
Our Technology/Engineering team is currently investigating reports of a service interruption, we are looking into it as a matter of priority.
Posted Apr 30, 2021 - 12:07 UTC
Investigating
Our teams are currently investigating a recurring, intermittent issue causing a brief service interruption between 8-10AM BST. We continue to work on identifying the root cause and applying a permanent fix.

We apologise for any inconvenience caused and thank you for your patience!
Posted Apr 29, 2021 - 18:31 UTC
This incident affected: Autotask PSA (UK (United Kingdom)).