Status

March 26, 2024:

The cluster is back online but at very limited capacity. Compute resources will be added as the upgrades progress.

The following nodes are offline until further notice:

  submit-c

 

 

 

News

 

March14, 2024

 

Unplanned cluster outage

Yesterday afternoon I made a change in the queuing system which did not affect the queuing server itself, but it did impact all of the client nodes. I frequently make changes to the queuing system, however this particular change unexpectedly caused client daemons to crash, which unfortunately and unexpectedly caused several jobs to terminate prematurely with the “NODE FAIL” message. For those of you wondering why your job or session was terminated yesterday, this is likely the reason, and I apologize for the inconvenience this outage has caused.

 

Springbreak Maintenance/OS upgrades

The next cluster maintenance is scheduled for Springbreak, March 25-29. During this week I plan to complete the migration of the remaining cluster nodes to the EL8 and EL9 based operating systems. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:

OnDemand HPC portal upgrade

Slurm update and configuration changes

Nvidia driver updates 

Infiniband updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline after Finals week, starting Monday, March 25 at 12pm, and will remain offline until approximately Wednesday the 27th at 5pm.  Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”.  If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: 

 scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

 

Tentative planned cooling outage

A cooling outage impacting the HPC cluster has tentatively been scheduled for Friday April 5 (time window TBD). The cluster will run at reduced capacity that day until the planned AC maintenance has completed.

 

Dealing with broken software on EL8/EL9  

As the upgrades have progressed, a number of users have determined that some software which had worked on EL7 no longer works on EL9. In some cases, this has been addressed by installing new or compatible packages on the EL9 hosts. Also, it is currently possible to avoid an EL8/EL9 host by requesting the “el7” feature, either through the HPC portal or by adding the “--constraint=el7” option to your srun or sbatch options. However, keep in mind that this solution is only for the short term, since the entire cluster will eventually be running EL8/EL9, so in these cases the longer term solution is to rebuild the software on the EL9 (or EL8) host. I appreciate everyone’s feedback and patience regarding the upgrade, and I encourage everyone to continue reporting any issues that they encounter with the upgrade.

 

Status of Nvidia DGX H100 systems

The dgxh partition will remain in testing phase until the week of April 15.

 

February 2, 2024

 

Busy queues

The demand for resources, especially GPUs, has been very high lately, due to upcoming deadlines. This is resulting in much longer wait times than usual, so please be patient. If you have been waiting for over a day for your resources, let me know and leave your job in the queue as I may be able to work it in or at least determine what is holding it up.

 

EL8/EL9 upgrades 

The cluster upgrade to EL8 and EL9 has been progressing slowly as various bugs are being worked through. Progress is expected to pick up the week of February 12, as submit nodes and more compute nodes from the “share”, “dgx2” and “dgxs” partition will be migrated. These partitions will not be migrated all at once, but each week one or more nodes from each partition will be upgraded until the migration is completed. At this point it is anticipated that the migration will be completed by the end of March. At present the DGX nodes “dgx2-4” and “dgxs-3” have been upgraded to EL9 with support for Cuda 12.2, and are now available through the dgx2 and dgxs partitions.

Note that if you do not wish to land on an EL8 or EL9 node at this time, then you may request the “el7” feature, either through the HPC portal or by adding the “--constraint=el7” option to your srun or sbatch options. For those of you who would prefer to land on an EL9 node, for instance, because of updated dependencies and software support, or the availability of more recent versions of Cuda, then you may request the “el9” feature through the portal or use the “--constraint=el9” option in srun or sbatch.

Please continue to report any issues that you encounter with the cluster as the upgrade progresses.

 

Portal and ssh fixes on Nvidia DGX H100 and other EL9 systems

The HPC portal and ssh are now working properly on the dgxh and other EL9 systems. The dgxh partition is currently available on the Advanced Desktop (Xfce desktop only!) and Jupyter Server apps. Note that with the Advanced Desktop on EL9 systems, you should disable the screen lock or you risk being locked out of your session and you’ll have to start over. To do this, click on “Applications” on the upper left hand corner, then select “Settings”, then “Xfce screensaver”, then the “Lock Screen” tab, and turn off the “Enable Lock Screen”.

 The dgxh partition will remain in testing phase until further notice while other issues are being addressed.

 

VSCode issues

Some users have recently reported a problem using VSCode on the HPC submit nodes. It appears that the latest version or update of VSCode no longer supports EL7, so you may not be able to run it on the submit nodes. The submit nodes will be upgraded to EL8 later this month through early March. Until then, my recommendation is to use an older version of VSCode if possible, or use the Flip servers until the submit nodes are upgraded. 

 

Springbreak Maintenance

The next cluster maintenance is scheduled for Springbreak, March 25-29.

 

December 1, 2023

Winter maintenance/OS upgrades

The cluster will operate at reduced capacity throughout the month of December through mid-January while cluster nodes are upgraded to either EL8 or EL9 linux. I will post a tentative upgrade schedule sometime next week, and will coordinate with partition group owners to help minimize the impact of these upgrades. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:

 OnDemand HPC portal updates

Slurm update and configuration changes

Nvidia driver updates 

Infiniband updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

The entire cluster will be offline after Finals week, starting Monday, December 18th at 1pm, and will remain offline until approximately Wednesday the 20th at 3pm.  Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”.  If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: 

scontrol update job {jobid} TimeLimit=2-00:00:00

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

 

Status of Nvidia DGX H100 systems

I want to thank everyone who has given me feedback regarding the new DGX systems. A GPU issue came up which required that I open a ticket with Nvidia to troubleshoot. The nodes were then taken offline, then reinstalled and reconfigured. The GPU issue has now been addressed, so pytorch and other CUDA applications should work. Also, the automount problem appears to be resolved, so users can access directories in /nfs/hpc/*. I anticipate that the HPC portal/vncserver will be available for these hosts sometime next week, so stay tuned. However, ssh access is still not available. The dgxh partition is back online and the testing period resumed while other issues continue to be worked out. Feel free to give them another shot, and please continue to report any further issues to me.   

 

October 24, 2023

New Nvidia DGX H100 systems are online!

Many of you may have received the announcement last week that three new Nvidia DGX H100 systems have been added to the COE HPC cluster. Each DGX H100 system has 112 CPUs, 8 Nvidia H100 GPUs, and 2TB RAM! These systems are now available in a new partition called “dgxh”. This partition is in a testing period for at least this week to resolve existing issues and smooth out any other kinks that crop up.  During this testing period, the current resource limits are 2 GPUs and 32 CPUs per user, and the time limit is 24 hours. Be advised that the DGX H100 systems are running RHEL9 based linux, which is different than the RHEL7 based systems currently used by the rest of the cluster. Also, these systems are not yet available through the HPC portal or through ssh. Give them a try and let me know of any issues you encounter.

 

DGX partition change

We currently have separate “dgx” and “dgx2” partitions for the our DGX-2 systems, depending on how many GPUs are needed. For various reasons that have come up over time, it is no longer advantageous to have separate partitions for these systems, so these will again be merged into a single “dgx2” partition. The “dgx” partition will be phased out or possibly re-purposed, so please use “dgx2” instead. The resource limits will remain at 4 GPUs and 32 CPUs on the dgx2 partition for now, but may increase again in the future.

Submit-a offline this week

Just a reminder that Submit-a will be offline until next Monday the 30th for maintenance. Until then, please use submit-b or submit-c. 

 

 

October 9, 2023

HPC training reminders   

Attention new cluster users!

I am offering my “Intro to HPC” workshops every Wednesday at 3pm and Thursday at 4pm from October 11  through November 9. This workshop is designed to help new users become acquainted with and start using the cluster. To book a workshop session, click on the “Intro to HPC training session” link on the site below:

 

https://it.engineering.oregonstate.edu/hpc

 

AI and ML series of trainings from MarkIII starting next week!

For anyone who is interested, Mark III is offering a free five part series of AI and ML trainings  starting next Tuesday October 17 and running through November 14:

Tuesday, October 17 at 11am: Intro to AI and Machine Learning: The basics, tutorial and lab

Tuesday, October 24 at 11am: Intro to Deep Learning: An introduction to Neural Networks

Tuesday, October 31 at 11am: Intro to Datasets

Tuesday, November 7 at 11am: Intro to Computer Vision and Image Analytics

Tuesday, November 14 at 11am: Getting Started with Containers and AI

 If you are interested in more details and want to register for any of the MarkIII trainings listed above, please check out the “OSU AI Series” on the right hand side of the HPC web site above. Also note the HPC resources from Nvidia also displayed on the right hand side.

 

Updates on cluster updates

The updates of the cluster compute and GPU nodes should be completed by the end of the day Tuesday October 10. The submit nodes will be taken offline and scheduled to be updated on the following dates:

Submit-c: Monday, October 16 at 9am

Submit-b: Monday, October 23 at 9am

Submit-a: Monday, October 30 at 9am

Please plan accordingly. Srun jobs still running on these hosts when they are scheduled to be updated will be terminated. Jobs submitted via sbatch or the HPC portal will not be affected. If you have any questions or concerns, let me know.

  

October 4, 2023

OS and GPU driver updates

The cluster is currently running at reduced capacity while OS updates are being rolled out.  In addition, some users have reported that they cannot run their GPU codes on some GPU nodes, which appears to be due to the older drivers on these nodes. New GPU drivers are available, so the GPU nodes are scheduled to be offline next week in a staggered fashion (some Monday the 9th and some Tuesday the 10th) so that the new drivers can be installed. If you need GPU resources, please schedule your jobs so that they can complete before these offline periods, otherwise they will remain pending with the message “Required Node not available” or “ReqNodeNotAvail”. If that happens, you may be able to get in by reducing your time requirement using the “--time” or “-t" option in srun or sbatch.

 

OS upgrade

Many of you are aware that many COE linux servers have been or are being upgraded to Enterprise Linux 9 (EL9). This will also happen on the HPC cluster over the course of this Fall quarter. More details to come.

  

Intro to HPC workshop and other HPC training resources  

I am offering my “Intro to HPC” workshops every Wednesday at 3pm and Thursday at 4pm from October 11  through November 9. This workshop is designed to help new users become acquainted with and start using the cluster. To book a workshop session, click on the “Intro to HPC training session” link on the site below:

https://it.engineering.oregonstate.edu/hpc

In addition to the workshop, other HPC resources and training offerings from Nvidia and Mark III are displayed on the right hand side of the HPC web site. Mark III is offering a series of AI and ML trainings every Tuesday starting October 17 through November 14. I encourage you to check them out and sign up for them if you are interested.

 

July 17, 2023

Due to cooling issues, the cluster will run at reduced capacity during the weekends whenever there is a heat advisory.

 

May 18, 2023

The cluster will undergo its regularly scheduled maintenance during Springbreak starting June 19th. The following maintenance activities will be performed:

 

Operating system updates

Slurm upgrade and configuration changes

Nvidia driver updates 

Infiniband driver updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

 

The entire cluster will be offline starting Monday the 19th at 1pm, and will remain offline until approximately Wednesday the 21st at 3pm.  Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”.  If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: 

 

scontrol update job {jobid} TimeLimit=2-00:00:00

 

Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit.

 

March 6, 2023

The cluster will undergo its regularly scheduled maintenance during Springbreak starting next Monday the 27th. The following maintenance activities will be performed:

 

Operating system updates

Slurm update and configuration changes

Nvidia driver updates 

Infiniband driver updates

BIOS and firmware updates as needed

Miscellaneous hardware maintenance as needed

 

The entire cluster will be offline starting Monday the 27th at 1pm, and will remain offline until approximately Thursday the 30th at 3pm.  Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”.  If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: 

 

scontrol update job {jobid} TimeLimit=2-00:00:00

 

Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit.

 

 

November 10, 2022

 

Cooling Failure

 

Early this morning there was a cooling failure in the KEC datacenter which allowed temperatures to climb to unsafe levels, resulting in the automatic shutdown of all DGX-2 nodes and thus the termination of all jobs running on these nodes. Cooling has been restored and temperatures have returned to safe levels, and the DGX-2 nodes are back online.

 

 

October 14, 2022

 

Intro to HPC Workshop

I will be holding an HPC workshop over Zoom, covering the basics of using the CoE HPC cluster next week at the following date and times:

  Wednesday, October 19 @3pm

  Thursday, October 20 @4pm

If you are interested in attending this workshop, please let me know which session works best for you.

 

New STAR-CCM+ app

STAR-CCM+ has been added to the list of interactive apps on the HPC portal. If anyone has any problems using it, let me know. You can check out the HPC portal here:
https://ondemand.hpc.engr.oregonstate.edu

 

Updated Jupyter Server app

Previously the Jupyter Server app allowed you to activate your own python virtual environment and conda installation, but could not activate an environment created using conda. The Jupyter Server app on the HPC portal has now been further improved to allow your own conda environment. If anybody has any problems using this app, let me know.

New Software installs

 

Here is a list of recent software installs:

  Python 3.10

  GCC 12.2

  Mathematica 13.1

  Matlab 2022b with Parallel Server.  If anybody is interested in using Matlab Parallel Server, please contact me.

  

The software listed above can be accessed through the modules system, type “module avail” to see what is available.

 

 

September 1, 2022

 

Fall maintenance week:

 

The HPC Cluster will undergo its regularly scheduled quarterly maintenance September 12-16.  The following maintenance activities will be performed:

 

Operating system updates

BIOS and firmware updates as needed

Slurm scheduler upgrade

Nvidia driver and CUDA updates

Miscellaneous hardware maintenance as needed

 

The entire cluster will be offline starting Tuesday the 13th at 8am, and will remain offline until approximately Wednesday the 14th at 3pm.  Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”.  If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do: 

 

scontrol update job {jobid} TimeLimit=2-00:00:00

 

Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.

 

August 8, 2022

 

Datacenter cooling outage

Most of the HPC cluster underwent an emergency shutdown last night due to a cooling failure in the KEC datacenter as temperatures had reached critical levels.  Unfortunately many jobs or interactive sessions were terminated as a result of the emergency shutdown.  Most of the cooling has been restored, and HPC resources are slowly being brought back online to a level that can be accommodated by the available cooling.

 

New HPC portal apps

New interactive apps have been added to the HPC portal:

            Matlab

            Mathematica

            R Studio

            Ansys Workbench (for approved Ansys users only)

If you use any of these applications, check these out and let me know if you have any trouble using them. You can check out the HPC portal here:

https://ondemand.hpc.engr.oregonstate.edu

 

DGX queue change reminder

This is a reminder that the DGX partitions have been redefined as follows:

    If you need 4 GPUs or less, please use the “dgx” partition.

    If you need 4 GPUs or more, please use the “dgx2” partition.

If your jobs to the dgx/dgx2 partitions are pending with “QOSMinGRES” or “QOSMaxGRES”, or if your jobs are rejected for those reasons, that means you need to change the partition as noted above.

Note that these partitions can longer be used with a list of partitions. Your default partition can be used to access these partitions.

 

August 1, 2022:

 

DGX partition changes

Two changes are being introduced to the DGX systems this month to improve job scheduling and flexibility. First, the “dgx” and “dgx2” partitions are being redefined. Starting tomorrow morning, the “dgx” partition will be used for smaller GPU workloads, i.e. 4 GPUs or less, and 12 CPUs or less, whereas the “dgx2” partition will be for larger GPU workloads, i.e. 4 GPUs or more, and 12 CPUs or more. For those of you who normally request fewer than 4 GPUs at a time under the “dgx2” partition, please change to the “dgx” partition. As some of you are aware, the larger the resource request, the longer the wait, and this change was recommended by the vendor as a way to improve the scheduling of larger workloads on the DGX systems. 

New DGX limits

The second change involves the DGX resource limits. The GPU and CPU limits for the DGX systems sometimes change based on overall load and resource availability. Lately we have settled to a limit of 8 GPUs and 32 CPUs in use at a time per user. This limit will temporarily be lifted in lieu of new limits based on cumulative GPU and CPU running times. This means that if GPUs are available, then more GPUs than the normal limit can be used by each user at a time, though for a shorter period (e.g. 16 GPUs for one day). This is to improve job flexibility while also maximizing use of resources, to allow users to run more jobs at a time, or to run larger single calculations or experiments. 

The new limits will be activated starting this week, and may require a lot of adjustment at first to optimize the load on the dgx partitions. These limits will be posted on the HPC status page once activated. If you have any questions about, or experience issues due to the new limits, let me know.

Jupyter Notebook app

The Jupyter Notebook app on the HPC Portal is being replaced by the new Jupyter Server app, and will no longer appear in the list of interactive apps.  

 

July 12, 2022:

HPC Portal maintenance

The HPC Portal will be offline next Monday morning the 18th for maintenance.  During this time, the Web and Portal packages will be updated, and to replace expiring SSL certificates.  Jobs running through the portal at the time should not be interrupted, but they will not be accessible until the portal is back online.  The portal is expected to be back online by noon on the 18th.

New Jupyter Server app

An improved Jupyter Server app has been added to the list of Interative Apps on the HPC Portal.  This app gives you the option to use Jupyter Lab, and allows you to specify your own Python or Conda environment to run Jupyter, if desired.  This app is in beta testing phase and will eventually replace the existing Jupyter Notebook app.

Preempt Partition

The “preempt” partition has undergone a lengthy testing period, and is being used with increasing frequency and bears mentioning. The preempt partition is available to all users, but it is not the same as the share partition. It is a low priority partition which can give you access to unused resources, but that access can be cancelled or “preempted” by a higher priority request or job. The preempt partition can be useful for those who have jobs that are restartable or checkpointed, or for short jobs or requests, i.e. where there is low risk or mitigated loss to interruption. It is not recommended to use this partition for long jobs that are not checkpointed.

New Software installs

 

Here is a list of recent software installs:

 GCC 12.1, 11.3, 9.5

 Mathematica 13.0

 Matlab 2021b-r3 with Parallel Server.  If anybody is interested in using Matlab Parallel Server, please contact me.

 Anaconda 2022.05, will become the new default version

 Cuda 11.7 have been installed, and Nvidia drivers have recently been upgraded on most GPU nodes to support the latest Cuda version. Drivers for Cuda 11.5+ have not yet been released for the DGX systems. 

 

The software listed above can be accessed through the modules system, type “module avail” to see what is available.

 

Fall break maintenance:

The summer break maintenance is tentatively scheduled for September 12-16, 2022.  

 

HPC Portal:

 

The OpenOnDemand HPC Portal is online.  You may try it out here.

 

 

New commands/scripts for Slurm.

 

At the request of our users, some additional, potentially useful commands have been added to the Slurm module to assist our users, e.g.:

 

“nodestat partition” will show the status and resources used on each node on that partition, e.g., how many CPUs, GPUs and RAM are on each node, and how much are currently in use. 

 

“showjob jobid” will provide information on a currently running or pending job.  This can be used to obtain the estimated start time for a pending job, if available.

 

“tracejob -S YYYY-MM-DD” will provide information on jobs started since the date YYYY-MM-DD.

 

“sql” gives an alternate, longer listing format of job listing than the default “squeue” command.

 

“squ” only lists jobs owned by user, using the default “squeue” format

 

 

To receive cluster announcements and news via email, subscribe to the cluster email list.