CoE HPC News, February 14 2025: Stuck jobs, queue changes, published features

Cluster users, Happy Valentines Day! See below the latest news regarding the COE HPC cluster. Jobs stuck in queue? I have been getting a lot of complaints lately of jobs being stuck in the queue. The most likely reason is that the queue is busy (see Job queuing changes below), but sometimes a job may be stuck for some other reason, so I encourage users to continue to reach out to me if they feel their job is stuck in the queue, just make sure you leave your job in the queue for me to troubleshoot. This will allow me to see what is holding the job up and possibly make adjustments to push it through. Job queuing changes Many of you have noticed that several queues have been very busy lately. In fact, the cluster has reached an all time high lately in terms of percent utilization of resources overall. In particular, the dgxh and dgx2 partitions have been at or near 100% utilization for much of the last two weeks. It is fantastic that the COE cluster has been such a useful resource for so many. The down side is that the wait times have been very long in many cases, and I am receiving complaints almost every day about queue wait times. Also, in many cases cluster resources (CPUs and GPUs) are not being efficiently used – in other words, they have been reserved but are idle for whatever reason, and to date there is no reliable mechanism to address this. To address the high demand and other queuing challenges, I am considering and evaluating a number of major changes to certain partitions. In the past I normally just reduced the resource limits, and for some partitions like dgxh I kept the time limit at 2 days for faster job turnover, but those one-dimensional approaches are starting to become too restrictive. Another approach I have started to implement is a relatively new feature in Slurm, one that limits the number of resource-run-limits one user can have active at a time. For instance, the dgxh partition formerly had a limit of 2 GPUs with a separate, independent time limit of 2 days. However, since the dgxh partition is in such high demand, I have set a new limit on this partition that allows only 1 GPU to be used for 2 days, or 2 GPUs at a time to be used for only 1 day, i.e. a 2 GPU-day or 48-GPU-hour limit. The flip side of this limit is that it is now possible to schedule a larger number of resources but for a shorter period of time, e.g. 4 GPUs for 12 hours, or 8 GPUs for 6 hours. I have applied similar limits to the dgx2, ampere, gpu, and share partitions. The updated resource limits are posted under “Summary of partition limits” in the Slurm-howto link below: https://it.engineering.oregonstate.edu/hpc/slurm-howto<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fslurm-howto&data=05%7C02%7Ccluster-users%40lists.engr.oregonstate.edu%7C51fdb0e2c7d14f098e0d08dd4d3ec5ca%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638751654169084961%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=tJPDey9TxRluoX%2FJVBoZR2i5hhBczbUOuZyobC2%2F54Y%3D&reserved=0> The goal of implementing these new limits is to try reducing wait times while keeping resource limits flexible, and these limits will be adjusted to accommodate needs and demand. The new limits may now result in some jobs pending with “Max*RunMinsPerUser” messages. If you already have jobs submitted with the older limits, you may leave them in the queue for now and I’ll try to push through what I can over the next week. More changes may be implemented and announced in the near future. If you have any questions, feedback or suggestions regarding the queue changes, let me know. Published list of features, “ib” is now obsolete The “ib” feature had been used by those users running MPI jobs. Now the entire cluster has infiniband, so the “ib” feature or “constraint=ib” directive in Slurm is no longer required and has been removed. To see an updated list of features, including OS and GPU features, check out the “Summary of partitions” in the link below: https://it.engineering.oregonstate.edu/hpc/slurm-howto<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fslurm-howto&data=05%7C02%7Ccluster-users%40lists.engr.oregonstate.edu%7C51fdb0e2c7d14f098e0d08dd4d3ec5ca%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638751654173200970%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=Bnh98XY1R3nMinaibxym0BIskx6kNkpQBK5TEUJ69vI%3D&reserved=0> Stay tuned for more news and updates. For up-to-date status on the cluster, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C02%7Ccluster-users%40lists.engr.oregonstate.edu%7C51fdb0e2c7d14f098e0d08dd4d3ec5ca%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638751654173217646%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=6kPi7m8fV1%2BKV%2BtLjAPkdPRe98fPhCQDUuHE0154QZU%3D&reserved=0> Rob Yelle HPC Manager
participants (1)
-
Yelle, Robert Brian