Cluster users,

See below the latest news regarding the COE HPC cluster.

Email notification

Email notification for Slurm jobs was not functional for a while after the head node migration in December, but it is working now.

New partition limits

This is a reminder that the new partition or queue limits are posted and updated under “Summary of partition limits” in the Slurm Howto below:

https://it.engineering.oregonstate.edu/hpc/slurm-howto

Springbreak maintenance

The next cluster maintenance is scheduled for the week of March 24. The maintenance activities planned for this time include:

Slurm update and queue+configuration changes

Operating system image updates

Nvidia GPU, infiniband, and storage driver updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline starting Monday, March 24 at 12pm, and will remain offline until approximately Wednesday the 26th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. if there is less than 3 days until the maintenance begins, then to change the time limit of your pending job to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

Job purge/Queue changes

Any pending jobs remaining during the maintenance period may be purged while new queueing changes are implemented.

If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below:

https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news

Rob Yelle

HPC Manager