HPCC users,
Here is the latest news on the CoE HPC cluster:
1. Queuing system maintenance this week.
The partitions or queues will be brought offline every morning this week between 8 and 9am, starting tomorrow (Tuesday the 25th). This means that new or pending jobs will not start during that one hour window, so that troubleshooting and maintenance can be performed on the Slurm queuing environment. Jobs currently running on the cluster during this time will not be affected. The submit nodes will still be accessible, and data still be available. The queues will resume by 9am. Please plan accordingly. I apologize for any inconvenience this may cause. If you have any questions or concerns about this, let me know.
2. New commands/scripts for Slurm.
At the request of our users, some additional, potentially useful commands have been added to the Slurm module to assist our users, e.g.:
“nodestat {partition}” will show the status and resources used on each node on that partition, e.g., how many CPUs, GPUs and RAM are on each node, and how much are currently in use.
“showjob {jobid}” will provide information on a currently running or pending job. This can be used to obtain the estimated start time for a pending job, if available.
“tracejob -S YYYY-MM-DD” will provide information on jobs started since the date YYYY-MM-DD.
“sql” gives an alternate, longer listing format of job listing than the default “squeue” command.
“squ” only lists jobs owned by user, using the default “squeue” format
Try them out, and let me know how they might be improved or if there are other details that you would like to see and I will look into it.
Cheers,
Rob Yelle
HPC Manager