CoE HPC News, Apr 25 2023: AC maintenance, new DGX limits

HPCC users, First, thanks to all of you who participated in the HPC usage survey, this was greatly appreciated. The results help us to better understand how the cluster is currently being used and what future trainings might be helpful or of interest. AC maintenance Last Saturday night, some compute nodes were shut down due to datacenter temperatures rising to unsafe levels. This resulted in the termination of some jobs. Cooling has since been partly restored, but this was a short term fix and some additional but urgent maintenance is needed which requires a temporary shutdown of the chillers. This will take place starting early tomorrow (Wednesday) morning. To mitigate the heat load, the cluster will run at reduced capacity during the AC maintenance. It is likely that no additional jobs will be accepted during this time until the maintenance has completed, which should be tomorrow afternoon at the earliest. I expect the cluster will be restored to full capacity by late Thursday morning. New GPU limits on DGX systems Due to the continued increase in the number of users of the DGX systems, the demand for GPUs on the DGX systems lately has been at an all time high! The previous GPU limit of 8 is no longer sustainable at present without some additional modifications. Effective immediately, the cumulative GPU limit on the DGX systems has been reduced to 4 until further notice. For those of you who have deadlines coming up, please contact me directly and we’ll try to work something out. Cheers, Rob Yelle HPC Manager
participants (1)
Yelle, Robert Brian