CoE HPC News, Sept 1 2022 Edition: Fall maintenance

HPCC users, Here is the latest news on the CoE HPC cluster: Fall maintenance week: The HPC Cluster will undergo its regularly scheduled quarterly maintenance September 12-16. The following maintenance activities will be performed: Operating system updates BIOS and firmware updates as needed Slurm scheduler upgrade Nvidia driver and CUDA updates Miscellaneous hardware maintenance as needed The entire cluster will be offline starting Tuesday the 13th at 8am, and will remain offline until approximately Wednesday the 14th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: scontrol update job {jobid} TimeLimit=2-00:00:00 Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit. I will send out a reminder next week, and another announcement when the cluster is back online. For cluster news and status updates, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C01%7Crobert.yelle%40oregonstate.edu%7C9b95bc84ebda48688b1f08da47f40007%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C637901410533349811%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=P7rTDkNliahhMcxih%2Feu7UrcjvngAmMxRasUv3y3FiA%3D&reserved=0> Cheers, Rob Yelle HPC Manager

Cluster users, This is your friendly reminder that next week is cluster maintenance week, and that the cluster will be offline starting next Tuesday morning through Wednesday afternoon, see message below for details. Have a good weekend! Rob Yelle HPC Manager From: Yelle, Robert Brian <robert.yelle@oregonstate.edu> Date: Thursday, September 1, 2022 at 2:23 PM To: cluster-users@engr.orst.edu <cluster-users@engr.orst.edu> Subject: CoE HPC News, Sept 1 2022 Edition: Fall maintenance HPCC users, Here is the latest news on the CoE HPC cluster: Fall maintenance week: The HPC Cluster will undergo its regularly scheduled quarterly maintenance September 12-16. The following maintenance activities will be performed: Operating system updates BIOS and firmware updates as needed Slurm scheduler upgrade Nvidia driver and CUDA updates Miscellaneous hardware maintenance as needed The entire cluster will be offline starting Tuesday the 13th at 8am, and will remain offline until approximately Wednesday the 14th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: scontrol update job {jobid} TimeLimit=2-00:00:00 Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit. I will send out a reminder next week, and another announcement when the cluster is back online. For cluster news and status updates, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C01%7Crobert.yelle%40oregonstate.edu%7C9b95bc84ebda48688b1f08da47f40007%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C637901410533349811%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=P7rTDkNliahhMcxih%2Feu7UrcjvngAmMxRasUv3y3FiA%3D&reserved=0> Cheers, Rob Yelle HPC Manager

HPCC users, The cluster is back online but with limited availability while maintenance progresses on the rest of the cluster. The HPC portal is still offline. Please be patient while resources and services are being restored. For those of you who were away this summer, a change has been made to the DGX partitions: If you need 4 GPUs or less, use the “dgx” partition. If you need 4 GPUs or more, use the “dgx2” partition. If you encounter any problems using the cluster, let me know. Cheers, Rob Yelle HPC Manager From: Yelle, Robert Brian <robert.yelle@oregonstate.edu> Date: Friday, September 9, 2022 at 12:28 PM To: cluster-users@engr.orst.edu <cluster-users@engr.orst.edu> Subject: Re: CoE HPC News, Sept 1 2022 Edition: Fall maintenance Cluster users, This is your friendly reminder that next week is cluster maintenance week, and that the cluster will be offline starting next Tuesday morning through Wednesday afternoon, see message below for details. Have a good weekend! Rob Yelle HPC Manager From: Yelle, Robert Brian <robert.yelle@oregonstate.edu> Date: Thursday, September 1, 2022 at 2:23 PM To: cluster-users@engr.orst.edu <cluster-users@engr.orst.edu> Subject: CoE HPC News, Sept 1 2022 Edition: Fall maintenance HPCC users, Here is the latest news on the CoE HPC cluster: Fall maintenance week: The HPC Cluster will undergo its regularly scheduled quarterly maintenance September 12-16. The following maintenance activities will be performed: Operating system updates BIOS and firmware updates as needed Slurm scheduler upgrade Nvidia driver and CUDA updates Miscellaneous hardware maintenance as needed The entire cluster will be offline starting Tuesday the 13th at 8am, and will remain offline until approximately Wednesday the 14th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: scontrol update job {jobid} TimeLimit=2-00:00:00 Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit. I will send out a reminder next week, and another announcement when the cluster is back online. For cluster news and status updates, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C01%7Crobert.yelle%40oregonstate.edu%7C5d036db2d4ac43cd67f108da929968bd%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C637983484818583717%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lqKpOANHs2N%2BM4LDGTFcpldvhjjaCDN9mPFS5ROMwK8%3D&reserved=0> Cheers, Rob Yelle HPC Manager
participants (1)
-
Yelle, Robert Brian