CoE HPC News, December 1 2023: Winter maintenance, EL8/EL9 upgrade, DGX H100 status

1 Dec 2023

      HPCC users,

Winter maintenance/OS upgrades

The cluster will operate at reduced capacity throughout the month of December through mid-January while cluster nodes are upgraded to either EL8 or EL9 linux. I will post a tentative upgrade schedule sometime next week, and will coordinate with partition group owners to help minimize the impact of these upgrades. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:

OnDemand HPC portal updates
Slurm update and configuration changes
Nvidia driver updates
Infiniband updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed

The entire cluster will be offline after Finals week, starting Monday, December 18th at 1pm, and will remain offline until approximately Wednesday the 20th at 3pm.  Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”.  If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

Status of Nvidia DGX H100 systems

I want to thank everyone who has given me feedback regarding the new DGX systems. A GPU issue came up which required that I open a ticket with Nvidia to troubleshoot. The nodes were then taken offline, then reinstalled and reconfigured. The GPU issue has now been addressed, so pytorch and other CUDA applications should work. Also, the automount problem appears to be resolved, so users can access directories in /nfs/hpc/*. I anticipate that the HPC portal/vncserver will be available for these hosts sometime next week, so stay tuned. However, ssh access is still not available. The dgxh partition is back online and the testing period resumed while other issues continue to be worked out. Feel free to give them another shot, and please continue to report any further issues to me.

If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below:

https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C01%7Ccluster-users%40engr.orst.edu%7Cd9112cbdcc214b62e76308dbf2c4d037%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638370699332533763%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CLePuVQ1EQPODGr%2B3C5rtBqISyFKLFyI5wrBxDD%2Bwj8%3D&reserved=0>

Have a nice weekend,

Rob Yelle
HPC Manager

Yelle, Robert Brian

tags

participants (1)