HPCC users,

 

Here is the latest news on the CoE HPC cluster:

 

Summer break maintenance week:

 

The HPC Cluster will undergo its regularly scheduled quarterly maintenance next week (June 13-17)  The following maintenance activities will be performed:

 

Operating system updates

BIOS and firmware updates as needed

Nvidia driver and CUDA updates

Slurm configuration changes
HPC Portal updates and configuration changes

Miscellaneous hardware maintenance as needed

 

The entire cluster will be offline starting Tuesday the 14th at 8am, and will remain offline until approximately Wednesday the 15th at 3pm.  Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”.  If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: 

 

scontrol update job {jobid} TimeLimit=2-00:00:00

 

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

 

I will send out another announcement when the cluster is back online.  After the offline period, maintenance will continue on various parts of the cluster throughout the week.  

 

For cluster news and status updates, check out the link below:

 

https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news

 

Cheers,

 

Rob Yelle

HPC Manager