Cluster users,
See below the latest news regarding the COE HPC cluster.
Email notification
Email notification for Slurm jobs was not functional for a while after the head node migration in December, but it is working now.
New partition limits
This is a reminder that the new partition or queue limits are posted and updated under “Summary of partition limits” in the Slurm Howto below:
https://it.engineering.oregonstate.edu/hpc/slurm-howto
Springbreak maintenance
The next cluster maintenance is scheduled for the week of March 24. The maintenance activities planned for this time include:
Slurm update and queue+configuration changes
Operating system image updates
Nvidia GPU, infiniband, and storage driver updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed
The entire cluster will be offline starting Monday, March 24 at 12pm, and will remain offline until approximately Wednesday the 26th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail,
Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. if there is less than 3 days until the maintenance begins, then to change the
time limit of your pending job to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.
Job purge/Queue changes
Any pending jobs remaining during the maintenance period may be purged while new queueing changes are implemented.
If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below:
https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news
Rob Yelle
HPC Manager