CoE HPC News, September 10 2024: Fall maintenance

Cluster users, The next cluster maintenance is scheduled for the week of September 23. The maintenance activities planned for this time include: Operating system updates OnDemand HPC portal upgrade Slurm update and configuration changes Nvidia driver updates BIOS and firmware updates as needed Miscellaneous hardware maintenance as needed The entire cluster will be offline starting Monday, September 23 at 8am, and will remain offline until approximately Tuesday the 24th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: scontrol update job {jobid} TimeLimit=2-00:00:00 Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit. If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C02%7Ccluster-users%40engr.orst.edu%7C12e3027c0b56467b31e808dcd1e15341%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638616012886178018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ovaExDi43Old3u2QG%2BShpIb0aXAlGL6qlPTt4IQHdkA%3D&reserved=0> Rob Yelle HPC Manager

Cluster users, This is just a reminder that next week is cluster maintenance week, and the cluster will be offline starting next Monday the 23rd at 8am (see message below). If you have jobs that you would like to run before the maintenance window starts, be sure to request time so that it can be completed before the maintenance, e.g. if you want it to start today you must request less than 4 days (96 hrs), e.g.: srun –time=3-12:00:00 … or to modify a pending job: scontrol update job {jobid} timelimit=3-12:00:00 Rob From: Yelle, Robert Brian <robert.yelle@oregonstate.edu> Date: Tuesday, September 10, 2024 at 2:41 PM To: cluster-users@engr.orst.edu <cluster-users@engr.orst.edu> Subject: CoE HPC News, September 10 2024: Fall maintenance Cluster users, The next cluster maintenance is scheduled for the week of September 23. The maintenance activities planned for this time include: Operating system updates OnDemand HPC portal upgrade Slurm update and configuration changes Nvidia driver updates BIOS and firmware updates as needed Miscellaneous hardware maintenance as needed The entire cluster will be offline starting Monday, September 23 at 8am, and will remain offline until approximately Tuesday the 24th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: scontrol update job {jobid} TimeLimit=2-00:00:00 Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit. If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C02%7Ccluster-users%40engr.orst.edu%7C0e9da080b8cf410b941908dcd8d4e2e9%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638623656044053736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=jOLN32pLbd7xcfryjFbsbCEEKLTkWfcmWIn8oVGTRz0%3D&reserved=0> Rob Yelle HPC Manager

Cluster users, The cluster is back online but at reduced capacity while maintenance progresses. Maintenance for servers from the following partitions has been rescheduled for next week after October 2nd: athena eecs2 gpu gpu-dmv sail soundbendor Rob From: Yelle, Robert Brian <robert.yelle@oregonstate.edu> Date: Thursday, September 19, 2024 at 11:00 AM To: cluster-users@engr.orst.edu <cluster-users@engr.orst.edu> Subject: Re: CoE HPC News, September 10 2024: Fall maintenance Cluster users, This is just a reminder that next week is cluster maintenance week, and the cluster will be offline starting next Monday the 23rd at 8am (see message below). If you have jobs that you would like to run before the maintenance window starts, be sure to request time so that it can be completed before the maintenance, e.g. if you want it to start today you must request less than 4 days (96 hrs), e.g.: srun –time=3-12:00:00 … or to modify a pending job: scontrol update job {jobid} timelimit=3-12:00:00 Rob From: Yelle, Robert Brian <robert.yelle@oregonstate.edu> Date: Tuesday, September 10, 2024 at 2:41 PM To: cluster-users@engr.orst.edu <cluster-users@engr.orst.edu> Subject: CoE HPC News, September 10 2024: Fall maintenance Cluster users, The next cluster maintenance is scheduled for the week of September 23. The maintenance activities planned for this time include: Operating system updates OnDemand HPC portal upgrade Slurm update and configuration changes Nvidia driver updates BIOS and firmware updates as needed Miscellaneous hardware maintenance as needed The entire cluster will be offline starting Monday, September 23 at 8am, and will remain offline until approximately Tuesday the 24th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: scontrol update job {jobid} TimeLimit=2-00:00:00 Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit. If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C02%7Ccluster-users%40engr.orst.edu%7Ca6ee7bf424e040dd767008dcdcb1aa4e%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638627902826919433%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=1AO5dkWNhim8xWU2uPgGuu2LEJb5PjaIH%2BprjzaJG0g%3D&reserved=0> Rob Yelle HPC Manager
participants (1)
-
Yelle, Robert Brian