CoE HPC News, March 14 2022 Edition: Spring maintenance

HPCC users, Here is the latest news on the CoE HPC cluster: Spring break maintenance week: The HPC Cluster will undergo its regularly scheduled quarterly maintenance next week (March 21-25) The following maintenance activities will be performed: Operating system updates BIOS and firmware updates as needed Nvidia driver and CUDA updates Slurm configuration changes Open OnDemand configuration changes Power redistribution for various compute nodes Miscellaneous hardware maintenance as needed The entire cluster will be offline starting Tuesday the 22nd at 8am, and will remain offline until approximately Wednesday the 8th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: scontrol update job {jobid} TimeLimit=2-00:00:00 Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit. I will send out another announcement when the cluster is back online. After the offline period, maintenance will continue on various parts of the cluster throughout the week. For cluster news and status updates, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news Cheers, Rob Yelle HPC Manager

HPCC users, I apologize for the additional announcement, but I want to clear up a typo that I made in the previous announcement below: The cluster will be offline from Tuesday March 22nd at 8am through approximately Wednesday March 23rd at 3pm. If anyone has any questions or concerns, let me know. Rob From: Yelle, Robert Brian <robert.yelle@oregonstate.edu> Date: Monday, March 14, 2022 at 10:19 AM To: cluster-users@engr.orst.edu <cluster-users@engr.orst.edu> Subject: CoE HPC News, March 14 2022 Edition: Spring maintenance HPCC users, Here is the latest news on the CoE HPC cluster: Spring break maintenance week: The HPC Cluster will undergo its regularly scheduled quarterly maintenance next week (March 21-25) The following maintenance activities will be performed: Operating system updates BIOS and firmware updates as needed Nvidia driver and CUDA updates Slurm configuration changes Open OnDemand configuration changes Power redistribution for various compute nodes Miscellaneous hardware maintenance as needed The entire cluster will be offline starting Tuesday the 22nd at 8am, and will remain offline until approximately Wednesday the 8th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: scontrol update job {jobid} TimeLimit=2-00:00:00 Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit. I will send out another announcement when the cluster is back online. After the offline period, maintenance will continue on various parts of the cluster throughout the week. For cluster news and status updates, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news Cheers, Rob Yelle HPC Manager

HPCC users, The cluster is back online, but at limited capacity. Additional resources will become available throughout the week as maintenance progresses. Submit node ssh warnings Please note that the server host keys have change on two of the submit nodes (submit-a and submit-b), so new ssh connections may result in a security warning like "Remote Host Identification has changed" or "Host Key verification failed". It is okay to continue the connection. If you are using MacOS or Linux and are have trouble connecting, try the following: ssh-keygen -R submit-a.hpc.engr.oregonstate.edu ssh-keygen -R submit-b.hpc.engr.oregonstate.edu ssh-keygen -R submit.hpc.engr.oregonstate.edu then try to connect via ssh again. New Cuda versions Cuda versions 11.5 and 11.6 have been installed, and Nvidia drivers have been upgraded on most GPU nodes to support these latest Cuda versions. However, drivers for Cuda 11.5+ have not yet been released for the DGX systems. The default Cuda module has been changed to 11.4 to reflect the maximum Cuda version supported by the DGX systems. For more cluster news and status updates, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news Cheers, Rob Yelle HPC Manager
participants (1)
-
Yelle, Robert Brian