
HPCC users, The cluster will be operating at reduced capacity while cooling problems are being addressed. The DGX nodes are currently running at full capacity and at full utilization. I received a number of email lately regarding the status of the DGX queues, puzzling pending messages, and scheduling or related issues. Many of the pending messages on the DGX queues appear to be incorrect and are likely due to a bug in the scheduler. The main reason why users jobs to these queues remain pending is because there are currently no available resources. To address this and other scheduling issues that have come up, I plan to update the scheduler tomorrow morning around 9am. Currently running jobs should not be impacted by the update. Also, in order to reduce the load and allow more users access, I plan to temporarily reduce the cumulative per-user limits on the dgx and dgx2 queues to 4 GPUs and 24 CPUs. If you have a single job that exceeds this limit, please contact me. Rob Yelle HPC Manager