Cluster users,

 

This is a reminder that Summer maintenance is scheduled to start next week (after Finals week), and will go on until completed. During this time I plan to complete the migration of the remaining cluster nodes to the EL8 and EL9 based operating systems. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:

 

OnDemand HPC portal upgrade

Slurm upgrade and configuration changes

Nvidia driver updates 

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

 

The entire cluster will be offline starting Monday, June 17 at 1pm, and will remain offline until approximately Wednesday the 19th at 4pm.  Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”.  If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: 

 

scontrol update job {jobid} TimeLimit=2-00:00:00

 

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

  

If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below:

 

https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news

 

Rob Yelle

HPC Manager