
HPCC users, Here is the latest news on the CoE HPC cluster: Winter maintenance: The HPC Cluster will undergo its regularly scheduled quarterly maintenance after Finals week (December 13-17) The following maintenance activities will be performed: Operating system updates Slurm update and configuration changes Nvidia driver updates (including DGX systems!) Infiniband driver updates BIOS and firmware updates as needed Miscellaneous hardware maintenance as needed The entire cluster will be offline starting Tuesday the 14th at 8am, and will remain offline until approximately Wednesday the 15th at 1pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: scontrol update job {jobid} TimeLimit=2-00:00:00 Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit. I will send out a reminder of the upcoming maintenance during Finals week. Operating System upgrade on DGX servers delayed: Nvidia has recently released drivers for Cuda 11.4 on the DGX systems running Centos 7. These drivers will be installed during the maintenance scheduled for after Finals. Due to Nvidia’s continued support for EL7, the previously planned upgrade to an EL8 compatible operating system on the DGX systems has been postponed until further notice. If you have any questions or concerns, let me know. Rob Yelle HPC Manager