Cluster users,

 

The cluster is back online near full capacity. Here is more news regarding the COE HPC cluster:

 

More AC maintenance

 

Another round of AC maintenance is scheduled for next Thursday July 6 which will impact the KEC server rooms. To mitigate the heat load, the cluster will operate at reduced capacity again starting next Wednesday evening the 5th, until the server room temperature returns to a safe operating level, which should be the night of the 6th based on this past round of AC maintenance.

 

Early Thursday morning during the AC maintenance, I plan to relocate two submit servers, submit-b and submit-c. This will require that I power them off which will terminate any srun sessions running on them. Jobs submitted through sbatch on these nodes or through the HPC portal will be unaffected. In addition, the Slurm scheduler will temporarily be offline early Thursday morning. This will prevent new submissions (which will be limited anyways) or queries on job and node status, but running jobs will not be affected. I anticipate that Slurm will be back online by 10am and the two submit nodes by 11am.

 

For up-to-date status on the cluster, check out the link below:

 

https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news

 

If you have any questions or concerns, let me know.

 

Have a great weekend!

 

Rob Yelle

HPC Manager