COE HPC cluster down

Cluster users, The datacenter experienced a cooling failure this morning which required an emergency shutdown of the entire cluster. Facilities is working on the issue, and we hope to bring the cluster back online sometime tomorrow. Rob Yelle HPC Manager

COE HPC Cluster Newsletter, October 30, 2023 Chiller failures in KEC As announced yesterday, the cluster underwent an emergency shutdown yesterday morning due to a cooling failure in the KEC datacenters. Facilities was able to get to the heart of the problem, adjustments have been made, cooling restored and temperatures have returned to normal in the DCs. The cluster will be brought back online gradually over the course of this afternoon but may not be back to full capacity until tomorrow. Submit nodes reboot Submit-a is back online, but submit-b and submit-c will need to be rebooted early this afternoon due to unresolved stale mounts arising from the cluster shutdown. DGX partition change reminder As announced last week, the dgx partition is being phased out, so please use “dgx2” instead. The resource limits will remain at 4 GPUs and 32 CPUs on the dgx2 partition for now, but may increase again in the future. DGX H100 nodes I want to thank all of you who have brought up issues regarding the new H100 nodes to my attention, it is much appreciated. I’ve been delayed on working on these due to illness, but I hope to resolve most of these this week. MarkIII training reminder For those of you who are interested in the MarkIII trainings, session 3 is tomorrow morning at 11am on “Intro to Datasets”, please see the link below for more information and to register: https://trending.markiiisys.com/osu-aiseries<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftrending.markiiisys.com%2Fosu-aiseries&data=05%7C01%7Ccluster-users%40engr.orst.edu%7Ca5a19ee1a9d445ae316808dbd9874b76%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638342947322848072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dWFWc%2FRjkQYWAebroY1TTIv972hkT88Hu4SNn5Y7huQ%3D&reserved=0> If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C01%7Ccluster-users%40engr.orst.edu%7Ca5a19ee1a9d445ae316808dbd9874b76%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638342947322848072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=X2Ern7dtErgb%2Bj%2FXQ6dBgmIwHzM6wA0fOPU9qxaMpq4%3D&reserved=0> Cheers, Rob Yelle HPC Manager From: Yelle, Robert Brian <robert.yelle@oregonstate.edu> Date: Sunday, October 29, 2023 at 9:50 AM To: cluster-users@engr.orst.edu <cluster-users@engr.orst.edu> Cc: Thompson, Christopher Scott via staff <staff@engr.oregonstate.edu> Subject: COE HPC cluster down Cluster users, The datacenter experienced a cooling failure this morning which required an emergency shutdown of the entire cluster. Facilities is working on the issue, and we hope to bring the cluster back online sometime tomorrow. Rob Yelle HPC Manager
participants (1)
-
Yelle, Robert Brian