HPCC users,

See below the latest news on the CoE HPC cluster.

Unplanned cluster outage

Yesterday afternoon I made a change in the queuing system which did not affect the queuing server itself, but it did impact all of the client nodes. I frequently make changes to the queuing system, however this particular change unexpectedly caused client daemons to crash, which unfortunately and unexpectedly caused several jobs to terminate prematurely with the “NODE FAIL” message. For those of you wondering why your job or session was terminated yesterday, this is likely the reason, and I apologize for the inconvenience this outage has caused.

Springbreak Maintenance/OS upgrades

The next cluster maintenance is scheduled for Springbreak, March 25-29. During this week I plan to complete the migration of the remaining cluster nodes to the EL8 and EL9 based operating systems. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:

OnDemand HPC portal upgrade

Slurm update and configuration changes

Nvidia driver updates

Infiniband updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline after Finals week, starting Monday, March 25 at 12pm, and will remain offline until approximately Wednesday the 27th at 5pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

Tentative planned cooling outage

A cooling outage impacting the HPC cluster has tentatively been scheduled for Friday April 5 (time window TBD). The cluster will run at reduced capacity that day until the planned AC maintenance has completed.

Dealing with broken software on EL8/EL9

As the upgrades have progressed, a number of users have determined that some software which had worked on EL7 no longer works on EL9. In some cases, this has been addressed by installing new or compatible packages on the EL9 hosts. Also, it is currently possible to avoid an EL8/EL9 host by requesting the “el7” feature, either through the HPC portal or by adding the “--constraint=el7” option to your srun or sbatch options. However, keep in mind that this solution is only for the short term, since the entire cluster will eventually be running EL8/EL9, so in these cases the longer term solution is to rebuild the software on the EL9 (or EL8) host. I appreciate everyone’s feedback and patience regarding the upgrade, and I encourage everyone to continue reporting any issues that they encounter with the upgrade.

Status of Nvidia DGX H100 systems

The dgxh partition will remain in testing phase until the week of April 15.

If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below:

https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news

Have a nice weekend,

Rob Yelle

HPC Manager