COE HPC Cluster Newsletter, October 30, 2023
Chiller failures in KEC
As announced yesterday, the cluster underwent an emergency shutdown yesterday morning due to a cooling failure in the KEC datacenters. Facilities was able to get to the heart of the problem, adjustments have been
made, cooling restored and temperatures have returned to normal in the DCs. The
cluster will be brought back online gradually over the course of this afternoon but may not be back to full capacity until tomorrow.
Submit nodes reboot
Submit-a is back online, but submit-b and submit-c will need to be rebooted early this afternoon due to unresolved stale mounts arising from the cluster shutdown.
DGX partition change reminder
As announced last week, the dgx partition is being phased out, so please use “dgx2” instead. The resource limits will remain at 4 GPUs and 32 CPUs on the dgx2 partition for now, but may increase again in the future.
DGX H100 nodes
I want to thank all of you who have brought up issues regarding the new H100 nodes to my attention, it is much appreciated. I’ve been delayed on working on these due to illness, but I hope to
resolve most of these this week.
MarkIII training reminder
For those of you who are interested in the MarkIII trainings,
session 3 is tomorrow morning at 11am on “Intro to Datasets”, please see the link below for more information and to register:
https://trending.markiiisys.com/osu-aiseries
If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below:
https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news
Cheers,
Rob Yelle
HPC Manager