HPC Cluster Status and News

Status

October 9, 2025

The cluster is online.

News

September 2, 2025

Fall maintenance

This is to let you know that I will be conducting cluster-wide maintenance next week, September 8-12. There will not be a planned outage or offline period for this maintenance week, all updates will happen in rolling fashion with the cluster online. However, I will need to reboot the submit nodes next week, and the cluster will operate at reduced capacity throughout the week. The maintenance activities planned for this time include

DDN (hpc-share) storage system updates

Slurm update and configuration changes

Operating system image updates

Nvidia GPU, infiniband, and storage driver updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

Here is the schedule for rebooting submit nodes:

Submit-c reboot scheduled for Tuesday, September 9 at 12pm.

Submit-b reboot scheduled for Wednesday, September 10 at 9am.

Submit-a reboot scheduled for Thursday, September 11 at 9am.

Jobs submitted using sbatch or the HPC portal should not be interrupted during the maintenance, but srun jobs will be terminated when a submit node is rebooted, so please plan accordingly.

July 29, 2025

Power outage reminder and update

A power outage is planned for KEC for Friday, August 1. The building will be closed that day. All DGX-2 and DGX-H100 nodes will be without power that day, and the cluster will run at greatly reduced capacity through the weekend. In addition, on Monday, August 4, work will begin to replace the second UPS in the KEC datacenter, and that work is expected to take three days. So, while capacity is expected to increase on Monday, it may not be fully restored until Thursday the 7^th after work has completed on the second UPS. Please plan accordingly.

HPC portal upgrade and submit node reboot

The HPC portal is being upgraded this week, and I plan to reboot the submit nodes around the time of the power outage to roll out that upgrade.

Submit-c has been offline for maintenance but should be available by this Thursday afternoon at 1pm.

Submit-b is scheduled to be rebooted Thursday afternoon at 4pm.

Submit-a is scheduled to be rebooted Monday morning the 4^th at 9am.

VS Code issues

I have received a number of complaints over the last few weeks that users are having trouble logging in using VS Code. One user reported that the latest working version for them was 1.99.3, but many have experienced issues with VSCode versions > 1.99.3. I have had one user report success with VSCode versions > 1.102.0 but upgrading has not resolved issues for others. If you are having trouble logging in using VS Code, my advice is, apply the latest updates and try again, and if that does not work, revert back to 1.99.3 – right now that seems to be the last stable version for logging in.

July 23, 2025

Upcoming power outage in KEC

As some of you may have heard, a power outage is planned for KEC for August 1. The building will be closed that day and the cluster will run at a greatly reduced capacity from August 1-3, and will be brought back to full capacity on Monday morning the 4th.

DGX2-6 RIP

One of our former flagship servers, dgx2-6, has recently died. Fortunately I was able to move its sizable scratch directory to another server, so that directory (/nfs/hpc/dgx2-6) is still available for now.

Use of local scratch directories

Every compute node has a local scratch directory that can be used by anyone for temporary storage. The GPU-heavy nodes, including the DGX nodes, all have large scratch directories that are presented over the HPC network for convenience. While I certainly encourage users to use these scratch directories when needed, I want to remind everyone that these scratch directories are for temporary storage, and should never be considered a solution for long term storage of important files. These directories are not backed up, so if the filesystem dies, the data there is lost. Please be sure to always copy your important data off these local scratch directories to a longer term storage solution. To see data storage options available in the College of Engineering, see link below:

https://it.engineering.oregonstate.edu/data-storage-options

June 18, 2025

The COE cluster is back online and most compute resources have been restored.

There is a partial power outage scheduled for next week to replace two older UPS systems in KEC that provide backup power. This will affect the KEC datacenter and thus the HPC cluster. I understand that it will take three days per UPS system, so six business days total, starting Monday morning the 23^rd and ending Monday afternoon the 30^th. Most of the HPC cluster has redundant power, but some HPC compute nodes do not have redundant power, so the cluster will need to run at reduced capacity for that time period while these UPS systems are being replaced.

June 3, 2025

Changes to MIG on dgxh partition

Earlier this Spring quarter, MIG was enabled on all GPUs of the dgxh partition to increase availability of the H100 GPUs, and to allow requesting of certain VRAM configurations. It has come to my attention that enabling the MIG feature happens to disable other important features of the H100. To accommodate those who require these features, the number of H100 GPUs with MIG enabled will be reduced, and the “7g.80gb” and “7g.140gb” MIG devices will be phased out. In the meantime, you may use the Slurm options below to request the following VRAM configurations on the dgxh partition:

80 GB or more VRAM: use “--constraint=vram80g”

140 GB VRAM: use “--constraint=vram140g”

Or if you are using the HPC portal, put vram80g or vram140g, respectively, into the features field. More changes regarding MIG are planned but will not be implemented until the maintenance period (see below).

Summer maintenance

This is a reminder that the next cluster-wide maintenance is scheduled for the week of June 16, after Finals week. The maintenance activities planned for this time include:

DDN (hpc-share) storage system updates

Slurm update and configuration changes

Operating system image updates

Nvidia GPU, infiniband, and storage driver updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline starting Monday, June 16 at 1pm, and will remain offline until approximately Wednesday the 18th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. if there is less than 3 days until the maintenance begins, then to change the time limit of your pending job to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

HPC scratch share

This is just a reminder that the HPC scratch storage (hpc-share) should not be considered permanent storage. This filesystem is not backed up, and it does not currently support snapshots. If files are accidently deleted or lost, there are no tools available for recovery. Please make sure to save your important data to a permanent storage solution. For more information on other storage options within the College of Engineering, check out the link below:

https://it.engineering.oregonstate.edu/data-storage-options

April 10, 2025

MIG enabled on dgxh partition

The dgxh partition is usually the busiest partition with the longest wait times. I have been monitoring the GPU usage of the dgxh partition over the last couple months. My findings indicate that over half of the jobs using the H100 GPU use less than half of the 80 GB VRAM available, and over a third use less than 20 GB VRAM. Therefore, to allow more efficient use of the H100 GPUs, the Multi-instance GPU (MIG) feature of this GPU has been enabled on some of the dgxh nodes. This feature allows the H100 to be subdivided into separate instances of varying compute and memory configurations, which will make more H100 GPU instances available to users which should help reduce wait times. At present, the following configurations are supported:

2g.20gb: 20 GB VRAM

3g.40gb: 40 GB VRAM

7g.80gb: 80 GB VRAM

7g.140gb: 140 GB VRAM

By default, when you request a GPU on the dgxh partition, you will be given an instance with the lowest VRAM available. You can request a specific GPU configuration by using the options below:

--partition=dgxh --gres=gpu:{configuration}:1

For instance, if you know that your job will require more than 40 GB VRAM but less than 80, you can request a specific GPU instance with that VRAM as follows:

--partition=dgxh --gres=gpu:7g.80gb:1

You can also make this request on the HPC portal by using the interactive Advanced COEHPC Desktop or Jupyter Server app and put “7g.80gb:1” into the GPUs field.

The configurations and number of instances will be adjusted according to demand.

Be advised that the output of “nvidia-smi” is different with MIG enabled. If for some reason your jobs stop working because MIG is enabled, you might try requesting an instance with at least 80 GB VRAM. If it still fails with plenty of VRAM, please let me know.

Summer maintenance

The next cluster-wide maintenance is scheduled for June 16-20, after Finals week.

March 12, 2025

Email notification

Email notification for Slurm jobs was not functional for a while after the head node migration in December, but it is working now.

New partition limits

This is a reminder that the new partition or queue limits are posted and updated under “Summary of partition limits” in the Slurm Howto below:

https://it.engineering.oregonstate.edu/hpc/slurm-howto

Springbreak maintenance

The next cluster maintenance is scheduled for the week of March 24. The maintenance activities planned for this time include:

Slurm update and queue+configuration changes

Operating system image updates

Nvidia GPU, infiniband, and storage driver updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline starting Monday, March 24 at 12pm, and will remain offline until approximately Wednesday the 26th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. if there is less than 3 days until the maintenance begins, then to change the time limit of your pending job to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

Job purge/Queue changes

Any pending jobs remaining during the maintenance period may be purged while new queuing changes are implemented.

February 14, 2025

Jobs stuck in queue?

I have been getting a lot of complaints lately of jobs being stuck in the queue. The most likely reason is that the queue is busy (see Job queuing changes below), but sometimes a job may be stuck for some other reason, so I encourage users to continue to reach out to me if they feel their job is stuck in the queue, just make sure you leave your job in the queue for me to troubleshoot. This will allow me to see what is holding the job up and possibly make adjustments to push it through.

Job queuing changes

Many of you have noticed that several queues have been very busy lately. In fact, the cluster has reached an all time high lately in terms of percent utilization of resources overall. In particular, the dgxh and dgx2 partitions have been at or near 100% utilization for much of the last two weeks. It is fantastic that the COE cluster has been such a useful resource for so many. The down side is that the wait times have been very long in many cases, and I am receiving complaints almost every day about queue wait times. Also, in many cases cluster resources (CPUs and GPUs) are not being efficiently used – in other words, they have been reserved but are idle for whatever reason, and to date there is no reliable mechanism to address this.

To address the high demand and other queuing challenges, I am considering and evaluating a number of major changes to certain partitions. In the past I normally just reduced the resource limits, and for some partitions like dgxh I kept the time limit at 2 days for faster job turnover, but those one-dimensional approaches are starting to become too restrictive. Another approach I have started to implement is a relatively new feature in Slurm, one that limits the number of resource-run-limits one user can have active at a time. For instance, the dgxh partition formerly had a limit of 2 GPUs with a separate, independent time limit of 2 days. However, since the dgxh partition is in such high demand, I have set a new limit on this partition that allows only 1 GPU to be used for 2 days, or 2 GPUs at a time to be used for only 1 day, i.e. a 2 GPU-day or 48-GPU-hour limit. The flip side of this limit is that it is now possible to schedule a larger number of resources but for a shorter period of time, e.g. 4 GPUs for 12 hours, or 8 GPUs for 6 hours. I have applied similar limits to the dgx2, ampere, gpu, and share partitions. The updated resource limits are posted under “Summary of partition limits” in the Slurm-howto link below:

https://it.engineering.oregonstate.edu/hpc/slurm-howto

The goal of implementing these new limits is to try reducing wait times while keeping resource limits flexible, and these limits will be adjusted to accommodate needs and demand. The new limits may now result in some jobs pending with “Max*RunMinsPerUser” messages. If you already have jobs submitted with the older limits, you may leave them in the queue for now and I’ll try to push through what I can over the next week.

More changes may be implemented and announced in the near future. If you have any questions, feedback or suggestions regarding the queue changes, let me know.

Published list of features, “ib” is now obsolete

The “ib” feature had been used by those users running MPI jobs. Now the entire cluster has infiniband, so the “ib” feature or “constraint=ib” directive in Slurm is no longer required and has been removed. To see an updated list of features, including OS and GPU features, check out the “Summary of partitions” in the link below:

https://it.engineering.oregonstate.edu/hpc/slurm-howto

January 21, 2025

Brief Slurm outage

The new head node will be offline for brief maintenance this Thursday morning, January 23, between 8-10am. The Slurm scheduler will be down so no new jobs will be queued during this time, but currently running or pending jobs will not be affected.

NoVNC zlib error

Since the OnDemand HPC Portal upgrade in December, a number of users have received the following error on the portal:

noVNC encountered an error: Uncaught Error: Incomplete zlib block

The error message is contained in a large red box, and usually occurs when trying to select text. Unfortunately you cannot get rid of the box so you need to re-launch your session to work around it. To avoid getting this error message, make sure you enable compression before launching your interactive app. At present there is no way for me to set the minimum compression level, so this must be done by the user – just don’t set it to 0.

New Nvidia DGX H200 server

The College of Engineering recently purchased a new DGX H200 server, and this has now been added to the dgxh partition. To request an H200 GPU on the dgxh partition, you can use the “--constraint=h200” Slurm option, or add “h200” as a feature on the HPC portal.

Coming soon

Some of you may have noticed that the Multi-instance GPU (MIG) feature of the H100 GPU has been enabled on some of the dgxh nodes. The MIG feature will allow more efficient use of the H100 GPUs, and will shorten waiting times. This is currently in testing phase, and a couple of issues have come up during testing, but it is expected to be rolled out to all dgxh nodes once these issues are resolved.

The next cluster-wide maintenance is scheduled for Springbreak, March 24-28.

[/accordion

[accordion collapsed]

January 8, 2025

Miscellaneous news:

The "constraint=ib" is standard and no longer a feature, so please remove this from your sbatch or srun options.

The next cluster maintenance is scheduled for the week of Springbreak, March 24-28.

December 9, 2024

Maintenance week December 16-20

The next cluster maintenance is scheduled for the week of December 16. The maintenance activities planned for this time include:

Head node upgrade

OnDemand HPC portal upgrade

Slurm upgrade and configuration changes

Operating system image updates

Nvidia GPU and infiniband driver updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline starting Monday, December 16 at 1pm, and will remain offline until approximately Wednesday the 18th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

November 12, 2024

The next cluster maintenance is scheduled for the week of December 16th.

September 10, 2024

Maintenance week September 23-27

The next cluster maintenance is scheduled for the week of September 23. The maintenance activities planned for this time include:

Operating system updates

OnDemand HPC portal upgrade

Slurm update and configuration changes

Nvidia driver updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline starting Monday, September 23 at 8am, and will remain offline until approximately Tuesday the 24th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

June 20, 2024

Cluster online

The cluster is back online, but currently at limited capacity while maintenance continues on the rest of the cluster. If you are trying to access a resource and you get either of these messages below:

(partitiondown)

(ReqNodeNotAvail)

This means that the resource you are requesting is not available yet. But keep checking, additional resources will be brought back online as the maintenance progresses.

Submit-a is still undergoing maintenance until next week, so in the meantime please use Submit-b, Submit-c, or just Submit, instead. When you access the upgraded nodes, you may be met with the following message:

"host key for submit-b.hpc.eng.oregonstate.edu has changed and you have requested strict checking. Host key verification failed."

Or something similar. To address this, please remove your old host keys as follows:

ssh-keygen -R submit-a.hpc.engr.oregonstate.edu

ssh-keygen -R submit-b.hpc.engr.oregonstate.edu

ssh-keygen -R submit-c.hpc.engr.oregonstate.edu

ssh-keygen -R submit.hpc.engr.oregonstate.edu

After that, try connecting again and accept the new host keys and you should be set.

New HPC storage

All HPC share data has been migrated to our new DDN storage appliance, still located on /nfs/hpc/share, and all upgraded nodes are now using this storage. Everyone should check their HPC share directories to make sure their data is there as expected, and if you think something is missing or doesn’t look right, let me know.

Survey

If you haven’t had a chance yet, please take a few minutes to complete the survey below, your feedback is important.

https://oregonstate.qualtrics.com/jfe/form/SV_290Wnkkv7IFqSW2

June 10, 2024

This is a reminder that Summer maintenance is scheduled to start next week (after Finals week), and will go on until completed. During this time I plan to complete the migration of the remaining cluster nodes to the EL8 and EL9 based operating systems. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:

OnDemand HPC portal upgrade

Slurm upgrade and configuration changes

Nvidia driver updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline starting Monday, June 17 at 1pm, and will remain offline until approximately Wednesday the 19th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

May 6, 2024

Update on DGXH partition

The dgxh partition has been in very high demand lately, running almost steadily at or near 100% for the last few weeks, and wait times for this resource have been as long as a week or more. I am continually working on improving this situation so that everyone gets a fair chance to use these resources, and in a way that meets their need. When your allocation is ready, make sure you use it effectively and don’t waste it. This is a limited resource that is primarily for production runs. We have other resources that can be used for testing and debugging. Sessions that are found running idle or unused for a period of time may be terminated, so use this resource wisely.

VSCode issues

A number of our users use VSCode to ssh to the cluster. This has an unfortunate side effect of leaving defunct processes running on the submit nodes which can eventually prevent the user from logging in again. This problem is occurring with increasing regularity, so to address this, VSCode processes will be automatically cleaned up on a periodic basis. We advise that you do not use “srun” through VScode as it may be terminated with the VSCode process.

Submit-c is online

Submit-c has been upgraded to EL8 and is back online for both ssh and portal access. Also, upgraded versions of VScode should work on this node. If anyone has any problems with submit-c, let me know.

HPC Portal issues

Occasionally when one logs in through the HPC portal, they are met with a “Bad Request” error. One way to work around this problem is to choose a different submit node, e.g. if you get a “Bad Request” on submit-a, then try submit-b or submit-c. Alternatively, try deleting your browser cache and cookies, then try again.

Some users have recently encountered a problem being locked out of their Desktop sessions on EL9 systems. This issue should now be resolved and users should no longer be locked out, but to prevent the constant lock out I recommend users disable or change their screensaver settings.

Apptainer/Singularity issue

It was recently reported that Apptainer is not working on the EL9 systems. This is still being worked on. In the meantime I recommend using Apptainer on the EL7 systems where possible.

Summer Maintenance

The next cluster maintenance period is scheduled for the week of June 17, after finals week. The cluster will be offline for part of that week, so please plan accordingly.

April 9, 2024

Update on DGX-2 nodes

Last week a number of users reported a problem with the Nvidia GPU driver on the DGX-2 systems after the upgrade to EL9, so the newly upgraded DGX-2 systems had to be taken offline. A ticket was opened with Nvidia last week to troubleshoot the issue. Earlier this morning Nvidia spotted a potential problem, and a fix is currently being rolled out to the DGX-2 nodes. Most of the DGX-2 systems should be back online by this evening. For up-to-date status on the cluster, including the DGX nodes, check out the link below:

https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news

AI Week and GPU Day Reminder

Just a reminder that this is AI Week. Tomorrow (Wednesday April 10) is GPU Day, where Nvidia and Mark III will team up to “bring you an action-packed day of learning about what GPUs are, how they can help your research, and how to optimize them for your workloads”. To register for GPU day and other AI Week events, please check out the link below:

https://dri.oregonstate.edu/ai-week

MarkIII AI Training Series Reminder

Another reminder that Mark III is offering a seven part series on AI and ML trainings every Tuesday at 11am starting next week, April 16 and running through May 28. The training topics are listed below:

April 16 - Intro to Machine Learning and AI: The Basics, A Tutorial, and Lab

April 23 - Intro to Deep Learning: An Introduction to Neural Networks

April 30 - Introduction to Datasets

May 7 - Introduction to Large Language Models

May 14 - Getting Started with Containers and the software stack around AI + How to get started working with OSU HPC Services

May 21 - Intro to Omniverse & Digital Twins

May 28 - Intro to Isaac Sim and AI in Robotics

I encourage you to check them out and sign up for them using the link below if you are interested:

April 2, 2024

Welcome to the start of Spring quarter 2024! See below the latest news on the CoE HPC cluster.

Update on EL8/EL9 upgrade

I was not able to complete the upgrade of the rest of the cluster to EL8 and EL9 during the Springbreak maintenance week. The upgrade process has been bumpy, and at this time only nodes from the dgx2, dgxs and ampere partitions have been migrated to EL9. The ampere partition is available now, and the dgx2 and dgxs partitions should be available by tomorrow (Wednesday) morning.

Most of the rest of the cluster, including submit-a and submit-b, is still running EL7. I will continue working on the rest of the cluster over the course of this term, and will try to post on the HPC status page the approximate dates when certain nodes or partitions will be upgraded. I will work with affected research groups on scheduling upgrade windows.

AI Week!

As announced through various sources, next week (April 8-12) is AI Week. Wednesday April 10 is GPU Day, where Nvidia and Mark III will team up to “bring you an action-packed day of learning about what GPUs are, how they can help your research, and how to optimize them for your workloads”. For more information, and to register for GPU day and other AI Week events, please check out the link below:

https://dri.oregonstate.edu/ai-week

Intro to HPC workshop

I am offering my “Intro to HPC” workshops again this quarter, approximately once per week starting next Tuesday the 9^th. This workshop is designed to help new users become acquainted with and start using the cluster. Due to a problem with my Bookings, people were unable to book a meeting during April. This has now been fixed, so if you are interested in attending this workshop this month, please register for a date using the link below:

https://outlook.office365.com/owa/calendar/HPCworkshop@OregonStateUniversity.onmicrosoft.com/bookings/s/U-N8PbNjJEKGOBurPBr_SQ2

MarkIII AI training series

Mark III is offering a seven part series on AI and ML trainings every Tuesday at 11am from April 16 through May 28:

April 16 - Intro to Machine Learning and AI: The Basics, A Tutorial, and Lab

April 23 - Intro to Deep Learning: An Introduction to Neural Networks

April 30 - Introduction to Datasets

May 7 - Introduction to Large Language Models

May 14 - Getting Started with Containers and the software stack around AI + How to get started working with OSU HPC Services

May 21 - Intro to Omniverse & Digital Twins

May 28 - Intro to Isaac Sim and AI in Robotics

I encourage you to check them out and sign up for them using the link below if you are interested:

March 14, 2024

Unplanned cluster outage

Yesterday afternoon I made a change in the queuing system which did not affect the queuing server itself, but it did impact all of the client nodes. I frequently make changes to the queuing system, however this particular change unexpectedly caused client daemons to crash, which unfortunately and unexpectedly caused several jobs to terminate prematurely with the “NODE FAIL” message. For those of you wondering why your job or session was terminated yesterday, this is likely the reason, and I apologize for the inconvenience this outage has caused.

Springbreak Maintenance/OS upgrades

The next cluster maintenance is scheduled for Springbreak, March 25-29. During this week I plan to complete the migration of the remaining cluster nodes to the EL8 and EL9 based operating systems. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:

OnDemand HPC portal upgrade

Slurm update and configuration changes

Nvidia driver updates

Infiniband updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline after Finals week, starting Monday, March 25 at 12pm, and will remain offline until approximately Wednesday the 27th at 5pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

Tentative planned cooling outage

A cooling outage impacting the HPC cluster has tentatively been scheduled for Friday April 5 (time window TBD). The cluster will run at reduced capacity that day until the planned AC maintenance has completed.

Dealing with broken software on EL8/EL9

As the upgrades have progressed, a number of users have determined that some software which had worked on EL7 no longer works on EL9. In some cases, this has been addressed by installing new or compatible packages on the EL9 hosts. Also, it is currently possible to avoid an EL8/EL9 host by requesting the “el7” feature, either through the HPC portal or by adding the “--constraint=el7” option to your srun or sbatch options. However, keep in mind that this solution is only for the short term, since the entire cluster will eventually be running EL8/EL9, so in these cases the longer term solution is to rebuild the software on the EL9 (or EL8) host. I appreciate everyone’s feedback and patience regarding the upgrade, and I encourage everyone to continue reporting any issues that they encounter with the upgrade.

Status of Nvidia DGX H100 systems

The dgxh partition will remain in testing phase until the week of April 15.

February 2, 2024

Busy queues

The demand for resources, especially GPUs, has been very high lately, due to upcoming deadlines. This is resulting in much longer wait times than usual, so please be patient. If you have been waiting for over a day for your resources, let me know and leave your job in the queue as I may be able to work it in or at least determine what is holding it up.

EL8/EL9 upgrades

The cluster upgrade to EL8 and EL9 has been progressing slowly as various bugs are being worked through. Progress is expected to pick up the week of February 12, as submit nodes and more compute nodes from the “share”, “dgx2” and “dgxs” partition will be migrated. These partitions will not be migrated all at once, but each week one or more nodes from each partition will be upgraded until the migration is completed. At this point it is anticipated that the migration will be completed by the end of March. At present the DGX nodes “dgx2-4” and “dgxs-3” have been upgraded to EL9 with support for Cuda 12.2, and are now available through the dgx2 and dgxs partitions.

Note that if you do not wish to land on an EL8 or EL9 node at this time, then you may request the “el7” feature, either through the HPC portal or by adding the “--constraint=el7” option to your srun or sbatch options. For those of you who would prefer to land on an EL9 node, for instance, because of updated dependencies and software support, or the availability of more recent versions of Cuda, then you may request the “el9” feature through the portal or use the “--constraint=el9” option in srun or sbatch.

Please continue to report any issues that you encounter with the cluster as the upgrade progresses.

Portal and ssh fixes on Nvidia DGX H100 and other EL9 systems

The HPC portal and ssh are now working properly on the dgxh and other EL9 systems. The dgxh partition is currently available on the Advanced Desktop (Xfce desktop only!) and Jupyter Server apps. Note that with the Advanced Desktop on EL9 systems, you should disable the screen lock or you risk being locked out of your session and you’ll have to start over. To do this, click on “Applications” on the upper left hand corner, then select “Settings”, then “Xfce screensaver”, then the “Lock Screen” tab, and turn off the “Enable Lock Screen”.

The dgxh partition will remain in testing phase until further notice while other issues are being addressed.

VSCode issues

Some users have recently reported a problem using VSCode on the HPC submit nodes. It appears that the latest version or update of VSCode no longer supports EL7, so you may not be able to run it on the submit nodes. The submit nodes will be upgraded to EL8 later this month through early March. Until then, my recommendation is to use an older version of VSCode if possible, or use the Flip servers until the submit nodes are upgraded.

Springbreak Maintenance

The next cluster maintenance is scheduled for Springbreak, March 25-29.

December 1, 2023

Winter maintenance/OS upgrades

The cluster will operate at reduced capacity throughout the month of December through mid-January while cluster nodes are upgraded to either EL8 or EL9 linux. I will post a tentative upgrade schedule sometime next week, and will coordinate with partition group owners to help minimize the impact of these upgrades. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:

OnDemand HPC portal updates

Slurm update and configuration changes

Nvidia driver updates

Infiniband updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline after Finals week, starting Monday, December 18th at 1pm, and will remain offline until approximately Wednesday the 20th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

Status of Nvidia DGX H100 systems

I want to thank everyone who has given me feedback regarding the new DGX systems. A GPU issue came up which required that I open a ticket with Nvidia to troubleshoot. The nodes were then taken offline, then reinstalled and reconfigured. The GPU issue has now been addressed, so pytorch and other CUDA applications should work. Also, the automount problem appears to be resolved, so users can access directories in /nfs/hpc/*. I anticipate that the HPC portal/vncserver will be available for these hosts sometime next week, so stay tuned. However, ssh access is still not available. The dgxh partition is back online and the testing period resumed while other issues continue to be worked out. Feel free to give them another shot, and please continue to report any further issues to me.

October 24, 2023

New Nvidia DGX H100 systems are online!

Many of you may have received the announcement last week that three new Nvidia DGX H100 systems have been added to the COE HPC cluster. Each DGX H100 system has 112 CPUs, 8 Nvidia H100 GPUs, and 2TB RAM! These systems are now available in a new partition called “dgxh”. This partition is in a testing period for at least this week to resolve existing issues and smooth out any other kinks that crop up. During this testing period, the current resource limits are 2 GPUs and 32 CPUs per user, and the time limit is 24 hours. Be advised that the DGX H100 systems are running RHEL9 based linux, which is different than the RHEL7 based systems currently used by the rest of the cluster. Also, these systems are not yet available through the HPC portal or through ssh. Give them a try and let me know of any issues you encounter.

DGX partition change

We currently have separate “dgx” and “dgx2” partitions for the our DGX-2 systems, depending on how many GPUs are needed. For various reasons that have come up over time, it is no longer advantageous to have separate partitions for these systems, so these will again be merged into a single “dgx2” partition. The “dgx” partition will be phased out or possibly re-purposed, so please use “dgx2” instead. The resource limits will remain at 4 GPUs and 32 CPUs on the dgx2 partition for now, but may increase again in the future.

Submit-a offline this week

Just a reminder that Submit-a will be offline until next Monday the 30^th for maintenance. Until then, please use submit-b or submit-c.

October 9, 2023

HPC training reminders

Attention new cluster users!

I am offering my “Intro to HPC” workshops every Wednesday at 3pm and Thursday at 4pm from October 11 through November 9. This workshop is designed to help new users become acquainted with and start using the cluster. To book a workshop session, click on the “Intro to HPC training session” link on the site below:

https://it.engineering.oregonstate.edu/hpc

AI and ML series of trainings from MarkIII starting next week!

For anyone who is interested, Mark III is offering a free five part series of AI and ML trainings starting next Tuesday October 17 and running through November 14:

Tuesday, October 17 at 11am: Intro to AI and Machine Learning: The basics, tutorial and lab

Tuesday, October 24 at 11am: Intro to Deep Learning: An introduction to Neural Networks

Tuesday, October 31 at 11am: Intro to Datasets

Tuesday, November 7 at 11am: Intro to Computer Vision and Image Analytics

Tuesday, November 14 at 11am: Getting Started with Containers and AI

If you are interested in more details and want to register for any of the MarkIII trainings listed above, please check out the “OSU AI Series” on the right hand side of the HPC web site above. Also note the HPC resources from Nvidia also displayed on the right hand side.

Updates on cluster updates

The updates of the cluster compute and GPU nodes should be completed by the end of the day Tuesday October 10. The submit nodes will be taken offline and scheduled to be updated on the following dates:

Submit-c: Monday, October 16 at 9am

Submit-b: Monday, October 23 at 9am

Submit-a: Monday, October 30 at 9am

Please plan accordingly. Srun jobs still running on these hosts when they are scheduled to be updated will be terminated. Jobs submitted via sbatch or the HPC portal will not be affected. If you have any questions or concerns, let me know.

October 4, 2023

OS and GPU driver updates

The cluster is currently running at reduced capacity while OS updates are being rolled out. In addition, some users have reported that they cannot run their GPU codes on some GPU nodes, which appears to be due to the older drivers on these nodes. New GPU drivers are available, so the GPU nodes are scheduled to be offline next week in a staggered fashion (some Monday the 9^th and some Tuesday the 10^th) so that the new drivers can be installed. If you need GPU resources, please schedule your jobs so that they can complete before these offline periods, otherwise they will remain pending with the message “Required Node not available” or “ReqNodeNotAvail”. If that happens, you may be able to get in by reducing your time requirement using the “--time” or “-t" option in srun or sbatch.

OS upgrade

Many of you are aware that many COE linux servers have been or are being upgraded to Enterprise Linux 9 (EL9). This will also happen on the HPC cluster over the course of this Fall quarter. More details to come.

Intro to HPC workshop and other HPC training resources

I am offering my “Intro to HPC” workshops every Wednesday at 3pm and Thursday at 4pm from October 11 through November 9. This workshop is designed to help new users become acquainted with and start using the cluster. To book a workshop session, click on the “Intro to HPC training session” link on the site below:

https://it.engineering.oregonstate.edu/hpc

In addition to the workshop, other HPC resources and training offerings from Nvidia and Mark III are displayed on the right hand side of the HPC web site. Mark III is offering a series of AI and ML trainings every Tuesday starting October 17 through November 14. I encourage you to check them out and sign up for them if you are interested.

July 17, 2023

Due to cooling issues, the cluster will run at reduced capacity during the weekends whenever there is a heat advisory.

May 18, 2023

The cluster will undergo its regularly scheduled maintenance during Springbreak starting June 19th. The following maintenance activities will be performed:

Operating system updates

Slurm upgrade and configuration changes

Nvidia driver updates

Infiniband driver updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline starting Monday the 19th at 1pm, and will remain offline until approximately Wednesday the 21st at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit.

March 6, 2023

The cluster will undergo its regularly scheduled maintenance during Springbreak starting next Monday the 27th. The following maintenance activities will be performed:

Operating system updates

Slurm update and configuration changes

Nvidia driver updates

Infiniband driver updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline starting Monday the 27th at 1pm, and will remain offline until approximately Thursday the 30th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit.

November 10, 2022

Cooling Failure

Early this morning there was a cooling failure in the KEC datacenter which allowed temperatures to climb to unsafe levels, resulting in the automatic shutdown of all DGX-2 nodes and thus the termination of all jobs running on these nodes. Cooling has been restored and temperatures have returned to safe levels, and the DGX-2 nodes are back online.

October 14, 2022

Intro to HPC Workshop

I will be holding an HPC workshop over Zoom, covering the basics of using the CoE HPC cluster next week at the following date and times:

Wednesday, October 19 @3pm

Thursday, October 20 @4pm

If you are interested in attending this workshop, please let me know which session works best for you.

New STAR-CCM+ app

STAR-CCM+ has been added to the list of interactive apps on the HPC portal. If anyone has any problems using it, let me know. You can check out the HPC portal here:
https://ondemand.hpc.engr.oregonstate.edu

Updated Jupyter Server app

Previously the Jupyter Server app allowed you to activate your own python virtual environment and conda installation, but could not activate an environment created using conda. The Jupyter Server app on the HPC portal has now been further improved to allow your own conda environment. If anybody has any problems using this app, let me know.

New Software installs

Here is a list of recent software installs:

Python 3.10

GCC 12.2

Mathematica 13.1

Matlab 2022b with Parallel Server. If anybody is interested in using Matlab Parallel Server, please contact me.

The software listed above can be accessed through the modules system, type “module avail” to see what is available.

September 1, 2022

Fall maintenance week:

The HPC Cluster will undergo its regularly scheduled quarterly maintenance September 12-16. The following maintenance activities will be performed:

Operating system updates

BIOS and firmware updates as needed

Slurm scheduler upgrade

Nvidia driver and CUDA updates

Miscellaneous hardware maintenance as needed

The entire cluster will be offline starting Tuesday the 13th at 8am, and will remain offline until approximately Wednesday the 14th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

August 8, 2022

Datacenter cooling outage

Most of the HPC cluster underwent an emergency shutdown last night due to a cooling failure in the KEC datacenter as temperatures had reached critical levels. Unfortunately many jobs or interactive sessions were terminated as a result of the emergency shutdown. Most of the cooling has been restored, and HPC resources are slowly being brought back online to a level that can be accommodated by the available cooling.

New HPC portal apps

New interactive apps have been added to the HPC portal:

Matlab

Mathematica

R Studio

Ansys Workbench (for approved Ansys users only)

If you use any of these applications, check these out and let me know if you have any trouble using them. You can check out the HPC portal here:

https://ondemand.hpc.engr.oregonstate.edu

DGX queue change reminder

This is a reminder that the DGX partitions have been redefined as follows:

If you need 4 GPUs or less, please use the “dgx” partition.

If you need 4 GPUs or more, please use the “dgx2” partition.

If your jobs to the dgx/dgx2 partitions are pending with “QOSMinGRES” or “QOSMaxGRES”, or if your jobs are rejected for those reasons, that means you need to change the partition as noted above.

Note that these partitions can longer be used with a list of partitions. Your default account can be used to access these partitions.

August 1, 2022

DGX partition changes

Two changes are being introduced to the DGX systems this month to improve job scheduling and flexibility. First, the “dgx” and “dgx2” partitions are being redefined. Starting tomorrow morning, the “dgx” partition will be used for smaller GPU workloads, i.e. 4 GPUs or less, and 12 CPUs or less, whereas the “dgx2” partition will be for larger GPU workloads, i.e. 4 GPUs or more, and 12 CPUs or more. For those of you who normally request fewer than 4 GPUs at a time under the “dgx2” partition, please change to the “dgx” partition. As some of you are aware, the larger the resource request, the longer the wait, and this change was recommended by the vendor as a way to improve the scheduling of larger workloads on the DGX systems.

New DGX limits

The second change involves the DGX resource limits. The GPU and CPU limits for the DGX systems sometimes change based on overall load and resource availability. Lately we have settled to a limit of 8 GPUs and 32 CPUs in use at a time per user. This limit will temporarily be lifted in lieu of new limits based on cumulative GPU and CPU running times. This means that if GPUs are available, then more GPUs than the normal limit can be used by each user at a time, though for a shorter period (e.g. 16 GPUs for one day). This is to improve job flexibility while also maximizing use of resources, to allow users to run more jobs at a time, or to run larger single calculations or experiments.

The new limits will be activated starting this week, and may require a lot of adjustment at first to optimize the load on the dgx partitions. These limits will be posted on the HPC status page once activated. If you have any questions about, or experience issues due to the new limits, let me know.

Jupyter Notebook app

The Jupyter Notebook app on the HPC Portal is being replaced by the new Jupyter Server app, and will no longer appear in the list of interactive apps.

July 12, 2022

HPC Portal maintenance

The HPC Portal will be offline next Monday morning the 18^th for maintenance. During this time, the Web and Portal packages will be updated, and to replace expiring SSL certificates. Jobs running through the portal at the time should not be interrupted, but they will not be accessible until the portal is back online. The portal is expected to be back online by noon on the 18th.

New Jupyter Server app

An improved Jupyter Server app has been added to the list of Interative Apps on the HPC Portal. This app gives you the option to use Jupyter Lab, and allows you to specify your own Python or Conda environment to run Jupyter, if desired. This app is in beta testing phase and will eventually replace the existing Jupyter Notebook app.

Preempt Partition

The “preempt” partition has undergone a lengthy testing period, and is being used with increasing frequency and bears mentioning. The preempt partition is available to all users, but it is not the same as the share partition. It is a low priority partition which can give you access to unused resources, but that access can be cancelled or “preempted” by a higher priority request or job. The preempt partition can be useful for those who have jobs that are restartable or checkpointed, or for short jobs or requests, i.e. where there is low risk or mitigated loss to interruption. It is not recommended to use this partition for long jobs that are not checkpointed.

New Software installs

Here is a list of recent software installs:

GCC 12.1, 11.3, 9.5

Mathematica 13.0

Matlab 2021b-r3 with Parallel Server. If anybody is interested in using Matlab Parallel Server, please contact me.

Anaconda 2022.05, will become the new default version

Cuda 11.7 have been installed, and Nvidia drivers have recently been upgraded on most GPU nodes to support the latest Cuda version. Drivers for Cuda 11.5+ have not yet been released for the DGX systems.

The software listed above can be accessed through the modules system, type “module avail” to see what is available.

Fall break maintenance:

The summer break maintenance is tentatively scheduled for September 12-16, 2022.

HPC Portal:

The OpenOnDemand HPC Portal is online. You may try it out here.

New commands/scripts for Slurm.

At the request of our users, some additional, potentially useful commands have been added to the Slurm module to assist our users, e.g.:

“nodestat partition” will show the status and resources used on each node on that partition, e.g., how many CPUs, GPUs and RAM are on each node, and how much are currently in use.

“showjob jobid” will provide information on a currently running or pending job. This can be used to obtain the estimated start time for a pending job, if available.

“tracejob -S YYYY-MM-DD” will provide information on jobs started since the date YYYY-MM-DD.

“sqe” gives an alternate, longer listing format of job listing than the default “squeue” command.

“squ” only lists jobs owned by user, using the default “squeue” format

To receive cluster announcements and news via email, subscribe to the cluster email list.

Information Technology and Computing Support

September 2, 2025

July 29, 2025

July 23, 2025

June 18, 2025

June 3, 2025

April 10, 2025

March 12, 2025

February 14, 2025

January 21, 2025

[/accordion

January 8, 2025

December 9, 2024

November 12, 2024

September 10, 2024

June 20, 2024

June 10, 2024

May 6, 2024

April 9, 2024

April 2, 2024

March 14, 2024

February 2, 2024

December 1, 2023

October 24, 2023

October 9, 2023

October 4, 2023

July 17, 2023

May 18, 2023

March 6, 2023

November 10, 2022

October 14, 2022

September 1, 2022

August 8, 2022

August 1, 2022

July 12, 2022

HPC Training and Resources

Nvida Resources

Mark III Training Opportunities

Contact Info