Re: [cluster-users] CoE HPC News, March 15 edition: Springbreak maintenance

26 Mar 2021

      HPCC users,

The HPC cluster is up and operational but with limited availability.  Additional resources will become available over time as maintenance progresses, check out the HPC status page for more details and status updates:

https://it.engineering.oregonstate.edu/hpc-cluster-status-and-news

If you encounter any problems with the cluster, let me know.

Rob

On Mar 16, 2021, at 3:11 PM, Yelle, Robert Brian <robert.yelle@oregonstate.edu<mailto:robert.yelle@oregonstate.edu>> wrote:

HPCC users,

KEC experienced an unscheduled power outage, which resulted in shutting down all DGX2 nodes, killing all jobs running on these systems.  Dgx2-1,2 and 3 are back online now, but I am still working on bringing dgx2-4, 5 and 6 back online.

Also, the offline period of the cluster maintenance has been pushed back until Thursday, March 25 at 7am, and will remain offline until Friday, March 26 around 3pm.  Another announcement will be sent to confirm that the cluster is back online.  For more news and information on the cluster status, including more details of the maintenance period, visit the link below:

https://it.engineering.oregonstate.edu/hpc-cluster-status-and-news

Rob

On Mar 15, 2021, at 12:08 PM, Yelle, Robert Brian <robert.yelle@oregonstate.edu<mailto:robert.yelle@oregonstate.edu>> wrote:

HPCC users,

Here is the latest news on the CoE HPC cluster:

Springbreak maintenance:

The HPC Cluster will undergo its regularly scheduled quarterly maintenance after Finals Week, during Springbreak (March 22-26).  The following maintenance activities will be performed:

OS updates
BIOS and firmware updates
Infiniband updates
Slurm update and configuration changes
Power redistribution for various compute nodes
Miscellaneous hardware maintenance as needed

Maintenance activities will begin the Monday after Finals week, and the entire cluster will be offline starting Tuesday the 23th at 8am, and will remain offline until Wednesday the 24th at 4pm.  After that, maintenance will continue on parts of the cluster throughout the week.  Jobs found running on the Univa cluster by Tuesday morning will need to be terminated, and Slurm jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”.  If you wish for your Slurm job to start and finish) before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit.

If you have any questions or concerns, let me know.

Rob Yelle
HPC Manager