CoE HPC News, January 21, 2025: Upcoming Slurm outage, noVNC errors, new DGX

21 Jan 2025

      Cluster users,

Happy New Year! See below the latest news regarding the COE HPC cluster.

Brief Slurm outage

The new head node will be offline for brief maintenance this Thursday morning, January 23, between 8-10am. The Slurm scheduler will be down so no new jobs will be queued during this time, but currently running or pending jobs will not be affected.

NoVNC zlib error

Since the OnDemand HPC Portal upgrade in December, a number of users have received the following error on the portal:

noVNC encountered an error: Uncaught Error: Incomplete zlib block

The error message is contained in a large red box, and usually occurs when trying to select text. Unfortunately you cannot get rid of the box so you need to re-launch your session to work around it. To avoid getting this error message, make sure you enable compression before launching your interactive app. At present there is no way for me to set the minimum compression level, so this must be done by the user – just don’t set it to 0.

New Nvidia DGX H200 server

The College of Engineering recently purchased a new DGX H200 server, and this has now been added to the dgxh partition. To request an H200 GPU on the dgxh partition, you can use the “--constraint=h200” Slurm option, or add “h200” as a feature on the HPC portal.

Coming soon

Some of you may have noticed that the Multi-instance GPU (MIG) feature of the H100 GPU has been enabled on some of the dgxh nodes. The MIG feature will allow more efficient use of the H100 GPUs, and will shorten waiting times. This is currently in testing phase, and a couple of issues have come up during testing, but it is expected to be rolled out to all dgxh nodes once these issues are resolved.

The next cluster-wide maintenance is scheduled for Springbreak, March 24-28.

Rob Yelle
HPC Manager

Yelle, Robert Brian

tags

participants (1)