On July 25th at 9:30am EDT emergency maintenance began for node des1-bfyii-2638 due to the CPU overheating and failing.
The root cause of these backup issues was that the CPU on the node overheated, possibly due to the CPU fan failing, or the portion of the motherboard that controls the CPU fan failed.
Engineering teams took the following steps to remediate the problem:
The following corrective actions have been identified to minimize the likelihood of this issue happening going forward:
We need to Introduce CPU thermo-monitoring to alert and notify On-Call personnel when a CPU begins to exceed a high-temperature threshold.