Cluster users,
Please see important message regarding the dgxh partition below.
MIG enabled on dgxh partition
The dgxh partition is usually the busiest partition with the longest wait times. I have been monitoring the GPU usage of the dgxh partition over the last couple months. My findings indicate that over half of the jobs using the H100 GPU use less than half of the 80 GB VRAM available, and over a third use less than 20 GB VRAM. Therefore, to allow more efficient use of the H100 GPUs, the Multi-instance GPU (MIG) feature of this GPU has been enabled on some of the dgxh nodes. This feature allows the H100 to be subdivided into separate instances of varying compute and memory configurations, which will make more H100 GPU instances available to users which should help reduce wait times. At present, the following configurations are supported:
2g.20gb: 20 GB VRAM
3g.40gb: 40 GB VRAM
7g.80gb: 80 GB VRAM
7g.140gb: 140 GB VRAM
By default, when you request a GPU on the dgxh partition, you will be given an instance with the lowest VRAM available. You can request a specific GPU configuration by using the options below:
--partition=dgxh --gres=gpu:{configuration}:1
For instance, if you know that your job will require more than 40 GB VRAM but less than 80, you can request a specific GPU instance with that VRAM as follows:
--partition=dgxh --gres=gpu:7g.80gb:1
You can also make this request on the HPC portal by using the interactive Advanced COEHPC Desktop or Jupyter Server app and put “7g.80gb:1” into the GPUs field.
The configurations and number of instances will be adjusted according to demand.
Be advised that the output of “nvidia-smi” is different with MIG enabled. If for some reason your jobs stop working because MIG is enabled, you might try requesting an instance with at least 80 GB VRAM. If it still fails with plenty of VRAM, please let me know.
Summer maintenance
The next cluster-wide maintenance is scheduled for June 16-20, after Finals week. For up-to-date status on the cluster, check out the link below:
https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit.engine…>
Rob Yelle
HPC Manager