You may have experienced intermittent service interruptions with Autotask PSA from Friday, April 23 through Thursday, April 29. Upon investigation, we traced the root cause back to a server and contributing software behavior:
Server: A server responsible for storing and retrieving attachments (such as screenshot images attached to tickets) entered a “bad state,” intermittently failing to respond to requests from the web tier. There was never enough strain on server resources to trigger an alert. The behavior resulted in request backlogs during peak periods, causing a domino effect as web servers became overwhelmed and left the available load pool.
Software: We designed Autotask PSA to attempt page restoration upon login following a session timeout. Analysis of our code found that, in many cases, the platform was refreshing each restored page up to ten times when only a single refresh is required. As a result, both the web tier and the broader system experienced increased load, especially during initial login at the beginning of the work day.
After our team identified the root causes, we scheduled a maintenance window, during which we fully rebooted the impacted server. Since the reboot, the problem has not recurred. In addition, we have updated the Autotask PSA code to stop unnecessary page refreshes after sessions time out. We have also increased the level of monitoring on the server and are adding redundancy with automated failover.
We continue to have our engineering team monitor the LON zone during peak hours of the workday to ensure full resolution of the situation. They are also gathering additional data and can quickly restore service if needed. Due to this situation, we have delayed our 2021.1 release by one week.
We recognize that even a minute of downtime can disrupt your business, and we apologize for this situation. We look forward to continuing to earn your trust in Autotask PSA and deliver the uptime that you have come to expect. Thank you for your support and understanding.