HPCC users,
Early this morning there was a cooling failure in the KEC datacenter which allowed temperatures to climb to unsafe levels, resulting in the automatic shutdown of all DGX-2 nodes and thus the termination of all jobs running on these nodes.
Cooling has been restored to safe temperatures, and all of the DGX-2 nodes are back online. I know many of you have deadlines coming up, so may want to check the status of your jobs and resubmit as needed.
Rob Yelle
HPC Manager