CoE HPC News, March 12 2025: Email notification, Springbreak maintenance

Cluster users, See below the latest news regarding the COE HPC cluster. Email notification Email notification for Slurm jobs was not functional for a while after the head node migration in December, but it is working now. New partition limits This is a reminder that the new partition or queue limits are posted and updated under “Summary of partition limits” in the Slurm Howto below: https://it.engineering.oregonstate.edu/hpc/slurm-howto<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fslurm-howto&data=05%7C02%7Ccluster-users%40lists.engr.oregonstate.edu%7Cf6ef37364db84968dc4208dd61b56968%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638774153968317054%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=ZZrNRq4HgyFEFWxUsg%2F6IcO%2BAVMqv7fDb4FOaW5l37o%3D&reserved=0> Springbreak maintenance The next cluster maintenance is scheduled for the week of March 24. The maintenance activities planned for this time include: Slurm update and queue+configuration changes Operating system image updates Nvidia GPU, infiniband, and storage driver updates BIOS and firmware updates as needed Miscellaneous hardware maintenance as needed The entire cluster will be offline starting Monday, March 24 at 12pm, and will remain offline until approximately Wednesday the 26th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. if there is less than 3 days until the maintenance begins, then to change the time limit of your pending job to 2 days do: scontrol update job {jobid} TimeLimit=2-00:00:00 Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit. Job purge/Queue changes Any pending jobs remaining during the maintenance period may be purged while new queueing changes are implemented. If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C02%7Ccluster-users%40lists.engr.oregonstate.edu%7Cf6ef37364db84968dc4208dd61b56968%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638774153968346398%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=uAMsEWmk0KamQldNOToB%2BsepT7%2B9OsVMbc063PTpLH4%3D&reserved=0> Rob Yelle HPC Manager
participants (1)
-
Yelle, Robert Brian