HPCC users,
Here is the latest news regarding the COE HPC cluster:
Datacenter cooling issues
Over the weekend parts of the cluster had to be shut down due to cooling issues arising from the previous heat wave. The cluster is now operating at full capacity, but be advised that in the future when there is a heat advisory,
the cluster will operate at reduced capacity during the weekends, and during the weekdays when necessary.
DGX partition limits
DGX limits have been lowered again back to 4 GPUs due to high demand. The solution I had in place that was designed to provide greater flexibility did not work out as planned, so I am working on a different solution to this which
may involve significant changes to the DGX queues and policies. If anyone has any feedback or suggestions on what they would like to see, feel free to email me.
Share partition limits
The CPU limit on the share partition is currently capped at 32 CPUs per node.
New GPU partition “ampere”
A new GPU queue called “ampere” consisting of A40 GPUs has been created and is currently open to all COE users. This queue is in a trial phase for the summer. At present you can reserve up to 2 GPUs and 16 CPUs total which is subject
to change based on availability and demand. Give it a shot and if you have any feedback, let me know.
More changes…
More changes affecting the cluster are on the way this summer, so stay tuned!
For up-to-date status on the cluster, check out the link below:
Rob Yelle
HPC Manager