CoE HPC News, Mar 10 2023: IB Upgrade, Spring Maintenance Week

HPCC users, Please check out the latest HPC cluster news below. Infiniband Upgrade We are upgrading our high-speed infiniband backbone of our cluster! All new servers added to the cluster will be have up to 200Gb/s bandwidth. Springbreak Maintenance The cluster will undergo its regularly scheduled quarterly maintenance during Springbreak, March 27-31. The following activities will be performed: Operating system updates BIOS and firmware updates as needed Nvidia/cuda driver updates Infiniband infrastructure upgrade Due to the infiniband upgrade, we anticipate an extended offline period, which will start Monday afternoon the 27th at 1pm, and run through Thursday afternoon the 30th. Jobs scheduled to run into this offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time accordingly. For the latest cluster news and status updates, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C01%7Ccluster-users%40engr.orst.edu%7C49dd33eaa96c4188db9d08db21ae04f6%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638140803501067544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=38Zfpe%2F6bHb7dJW6K9ZpqgoEgvS1tZvTAYKWzvaOcyw%3D&reserved=0> Have a nice weekend! Robert Yelle HPC Manager

Cluster users, The infiniband upgrade is still in progress. It is mostly completed, but unfortunately we have hit a critical bump that would prevent the cluster from being operational, and so we are working closely with our vendor Nvidia to resolve this issue. The maintenance period has thus been extended and we hope to have this resolved by the end of the day tomorrow. I apologize for any inconvenience this may cause. Rob Yelle HPC Manager From: Yelle, Robert Brian <robert.yelle@oregonstate.edu> Date: Friday, March 10, 2023 at 1:25 PM To: cluster-users@engr.orst.edu <cluster-users@engr.orst.edu> Subject: CoE HPC News, Mar 10 2023: IB Upgrade, Spring Maintenance Week HPCC users, Please check out the latest HPC cluster news below. Infiniband Upgrade We are upgrading our high-speed infiniband backbone of our cluster! All new servers added to the cluster will be have up to 200Gb/s bandwidth. Springbreak Maintenance The cluster will undergo its regularly scheduled quarterly maintenance during Springbreak, March 27-31. The following activities will be performed: Operating system updates BIOS and firmware updates as needed Nvidia/cuda driver updates Infiniband infrastructure upgrade Due to the infiniband upgrade, we anticipate an extended offline period, which will start Monday afternoon the 27th at 1pm, and run through Thursday afternoon the 30th. Jobs scheduled to run into this offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time accordingly. For the latest cluster news and status updates, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C01%7Ccluster-users%40engr.orst.edu%7Cc6ab2006101d4323e79508db31645a63%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638158079293783133%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Zewqh7HWz8d0TULKifQg7xDbVX41xb1ZwQwAL3eesXo%3D&reserved=0> Have a nice weekend! Robert Yelle HPC Manager

Cluster users, The infiniband upgrade has been completed and the critical issue has been resolved. The cluster is back online but at limited capacity. The DGX servers are at 50% capacity at present, but they should be back to full capacity by Monday, April 3. Have a good weekend! Rob Yelle From: Yelle, Robert Brian <robert.yelle@oregonstate.edu> Date: Thursday, March 30, 2023 at 2:18 PM To: cluster-users@engr.orst.edu <cluster-users@engr.orst.edu> Subject: FW: CoE HPC News, Mar 10 2023: IB Upgrade, Spring Maintenance Week Cluster users, The infiniband upgrade is still in progress. It is mostly completed, but unfortunately we have hit a critical bump that would prevent the cluster from being operational, and so we are working closely with our vendor Nvidia to resolve this issue. The maintenance period has thus been extended and we hope to have this resolved by the end of the day tomorrow. I apologize for any inconvenience this may cause. Rob Yelle HPC Manager From: Yelle, Robert Brian <robert.yelle@oregonstate.edu> Date: Friday, March 10, 2023 at 1:25 PM To: cluster-users@engr.orst.edu <cluster-users@engr.orst.edu> Subject: CoE HPC News, Mar 10 2023: IB Upgrade, Spring Maintenance Week HPCC users, Please check out the latest HPC cluster news below. Infiniband Upgrade We are upgrading our high-speed infiniband backbone of our cluster! All new servers added to the cluster will be have up to 200Gb/s bandwidth. Springbreak Maintenance The cluster will undergo its regularly scheduled quarterly maintenance during Springbreak, March 27-31. The following activities will be performed: Operating system updates BIOS and firmware updates as needed Nvidia/cuda driver updates Infiniband infrastructure upgrade Due to the infiniband upgrade, we anticipate an extended offline period, which will start Monday afternoon the 27th at 1pm, and run through Thursday afternoon the 30th. Jobs scheduled to run into this offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time accordingly. For the latest cluster news and status updates, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C01%7Ccluster-users%40engr.orst.edu%7C8b13b1ae5c1e4f76981c08db323c2ed6%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638159006279192441%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7JKe8mogjj6LLeCznR0oyl28Ukx%2F3gQHjET6xv%2FKFV0%3D&reserved=0> Have a nice weekend! Robert Yelle HPC Manager
participants (1)
-
Yelle, Robert Brian