Re: [cluster-users] CoE HPC News, August 30 2021 Edition: Fall maintenance, OS upgrade

8 Sep 2021

      HPCC Users,

The HPC cluster is back online with limited availability.  Parts of the cluster are still offline for maintenance, and additional nodes will be brought back online during the course of the week after their maintenance has completed.

Stay tuned for more announcements in the upcoming weeks.

Rob Yelle
HPC Manager

From: Yelle, Robert Brian <robert.yelle@oregonstate.edu>
Date: Monday, August 30, 2021 at 3:56 PM
To: cluster-users@engr.orst.edu<mailto:cluster-users@engr.orst.edu> <cluster-users@engr.orst.edu<mailto:cluster-users@engr.orst.edu>>
Subject: CoE HPC News, August 30 2021 Edition: Fall maintenance, OS upgrade
HPCC users,

Here is the latest news on the CoE HPC cluster:

Fall maintenance:

The HPC Cluster will undergo its regularly scheduled quarterly maintenance next week (September 7-10)  The following maintenance activities will be performed:

OS updates
BIOS and firmware updates as needed
Slurm update and configuration changes
Power redistribution for various compute nodes
Miscellaneous hardware maintenance as needed

The entire cluster will be offline starting Tuesday the 7th at 8am, and will remain offline until approximately Wednesday the 8th at 1pm.  Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”.  If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit.

I will send out another announcement when the cluster is back online.  After the offline period, maintenance will continue on various parts of the cluster throughout the week.

Operating System upgrade:

Nvidia says they will probably not support Cuda 11.2 or higher on DGX nodes running the RHEL 7 or Centos 7 operating systems.  To ensure that our users will be able to use Cuda 11.2 or higher on these systems, we plan on upgrading all the DGX nodes to Centos 8 or some other RHEL 8 based system.  This is a major change and will take some time to test and deploy the new image(s), and test applications on these images.  We will probably start with the DGXS nodes and then roll out the new OS image to the DGX-2 nodes, and eventually to the entire cluster as things stabilize on the new OS.  Stay tuned for more details as this develops.

If you have any questions or concerns, let me know.

Rob Yelle
HPC Manager

Yelle, Robert Brian

tags

participants (1)