Engineering Mailing Lists
Sign In Sign Up
Manage this list Sign In Sign Up

Keyboard Shortcuts

Thread View

  • j: Next unread message
  • k: Previous unread message
  • j a: Jump to all threads
  • j l: Jump to MailingList overview

cluster-users

Thread Start a new thread
Download
Threads by month
  • ----- 2025 -----
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2024 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2023 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2022 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
  • February
  • January
  • ----- 2021 -----
  • December
  • November
  • October
  • September
  • August
  • July
  • June
  • May
  • April
  • March
cluster-users@lists.engr.oregonstate.edu

June 2025

  • 1 participants
  • 1 discussions
CoE HPC News, June 3, 2025: Changes to dgxh partition, summer maintenance reminder
by Yelle, Robert Brian 03 Jun '25

03 Jun '25
Cluster users, Please see news below regarding the COE cluster. Changes to MIG on dgxh partition Earlier this Spring quarter, MIG was enabled on all GPUs of the dgxh partition to increase availability of the H100 GPUs, and to allow requesting of certain VRAM configurations. It has come to my attention that enabling the MIG feature happens to disable other important features of the H100. To accommodate those who require these features, the number of H100 GPUs with MIG enabled will be reduced, and the “7g.80gb” and “7g.140gb” MIG devices will be phased out. In the meantime, you may use the Slurm options below to request the following VRAM configurations on the dgxh partition: 80 GB or more VRAM: use “--constraint=vram80g” 140 GB VRAM: use “--constraint=vram140g” Or if you are using the HPC portal, put vram80g or vram140g, respectively, into the features field. More changes regarding MIG are planned but will not be implemented until the maintenance period (see below). Summer maintenance This is a reminder that the next cluster-wide maintenance is scheduled for the week of June 16, after Finals week. The maintenance activities planned for this time include: DDN (hpc-share) storage system updates Slurm update and configuration changes Operating system image updates Nvidia GPU, infiniband, and storage driver updates BIOS and firmware updates as needed Miscellaneous hardware maintenance as needed The entire cluster will be offline starting Monday, June 16 at 1pm, and will remain offline until approximately Wednesday the 18th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. if there is less than 3 days until the maintenance begins, then to change the time limit of your pending job to 2 days do: scontrol update job {jobid} TimeLimit=2-00:00:00 Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit. HPC scratch share This is just a reminder that the HPC scratch storage (hpc-share) should not be considered permanent storage. This filesystem is not backed up, and it does not currently support snapshots. If files are accidently deleted or lost, there are no tools available for recovery. Please make sure to save your important data to a permanent storage solution. For more information on other storage options within the College of Engineering, check out the link below: https://it.engineering.oregonstate.edu/data-storage-options<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engine…> For up-to-date status on the cluster, check out the link below: https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engine…> Rob Yelle HPC Manager
1 0
0 0

HyperKitty Powered by HyperKitty version 1.3.12.