HPCC users,
First, thanks to all of you who participated in the HPC usage survey, this was greatly appreciated. The results help us to better understand how the cluster is currently being used and what future trainings might be helpful or of interest.
AC maintenance
Last Saturday night, some compute nodes were shut down due to datacenter temperatures rising to unsafe levels. This resulted in the termination of some jobs. Cooling has since been partly restored, but this was a short term fix and
some additional but urgent maintenance is needed which requires a temporary shutdown of the chillers. This will take place
starting early tomorrow (Wednesday) morning. To mitigate the heat load, the cluster will run at reduced capacity during the AC maintenance. It is likely that
no additional jobs will be accepted during this time until the maintenance has completed, which should be tomorrow afternoon at the earliest. I expect the cluster will be restored to full capacity by late Thursday morning.
New GPU limits on DGX systems
Due to the continued increase in the number of users of the DGX systems, the demand for GPUs on the DGX systems lately has been at an all time high! The previous GPU limit of 8 is no longer sustainable at present without some additional
modifications. Effective immediately, the cumulative GPU limit on the DGX systems has been reduced to 4 until further notice. For those of you who have deadlines coming up, please contact me directly and we’ll try to work something out.
Cheers,
Rob Yelle
HPC Manager