CoE HPC News, May 6 2024: DGXH partition, submit-c and a host of issues

Cluster users, See below the latest news on the COE HPC cluster. Update on DGXH partition The dgxh partition has been in very high demand lately, running almost steadily at or near 100% for the last few weeks, and wait times for this resource have been as long as a week or more. I am continually working on improving this situation so that everyone gets a fair chance to use these resources, and in a way that meets their need. When your allocation is ready, make sure you use it effectively and don’t waste it. This is a limited resource that is primarily for production runs. We have other resources that can be used for testing and debugging. Sessions that are found running idle or unused for a period of time may be terminated, so use this resource wisely. VSCode issues A number of our users use VSCode to ssh to the cluster. This has an unfortunate side effect of leaving defunct processes running on the submit nodes which can eventually prevent the user from logging in again. This problem is occurring with increasing regularity, so to address this, VSCode processes will be automatically cleaned up on a periodic basis. We advise that you do not use “srun” through VScode as it may be terminated with the VSCode process. Submit-c is online Submit-c has been upgraded to EL8 and is back online for both ssh and portal access. Also, upgraded versions of VScode should work on this node. If anyone has any problems with submit-c, let me know. HPC Portal issues Occasionally when one logs in through the HPC portal, they are met with a “Bad Request” error. One way to work around this problem is to choose a different submit node, e.g. if you get a “Bad Request” on submit-a, then try submit-b or submit-c. Alternatively, try deleting your browser cache and cookies, then try again. Some users have recently encountered a problem being locked out of their Desktop sessions on EL9 systems. This issue should now be resolved and users should no longer be locked out, but to prevent the constant lock out I recommend users disable or change their screensaver settings. Apptainer/Singularity issues It was recently reported that Apptainer is not working on the EL9 systems. This is still being worked on. In the meantime I recommend using Apptainer on the EL7 systems where possible. Summer Maintenance The next cluster maintenance period is scheduled for the week of June 17, after finals week. The cluster will be offline for part of that week, so please plan accordingly. If you have any questions or concerns, let me know. For up-to-date status on the cluster, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engineering.oregonstate.edu%2Fhpc%2Fhpc-cluster-status-and-news&data=05%7C02%7Ccluster-users%40engr.orst.edu%7Ccbab0ad9030547191e1808dc6e1e1553%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C638506322680081006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Ume%2FOhyBqND%2Bs60uwWaqTAYfWQRDIIxNrAgwPTz9JcI%3D&reserved=0> Rob Yelle HPC Manager
participants (1)
-
Yelle, Robert Brian