Status
November 30, 2023:
The cluster is online. The HPC portal on submit-a appears to be fixed, please report any problems to COEIT support.
News
December 1, 2023
Winter maintenance/OS upgrades
The cluster will operate at reduced capacity throughout the month of December through mid-January while cluster nodes are upgraded to either EL8 or EL9 linux. I will post a tentative upgrade schedule sometime next week, and will coordinate with partition group owners to help minimize the impact of these upgrades. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:
OnDemand HPC portal updates
Slurm update and configuration changes
Nvidia driver updates
Infiniband updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed
The entire cluster will be offline after Finals week, starting Monday, December 18th at 1pm, and will remain offline until approximately Wednesday the 20th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.
Status of Nvidia DGX H100 systems
I want to thank everyone who has given me feedback regarding the new DGX systems. A GPU issue came up which required that I open a ticket with Nvidia to troubleshoot. The nodes were then taken offline, then reinstalled and reconfigured. The GPU issue has now been addressed, so pytorch and other CUDA applications should work. Also, the automount problem appears to be resolved, so users can access directories in /nfs/hpc/*. I anticipate that the HPC portal/vncserver will be available for these hosts sometime next week, so stay tuned. However, ssh access is still not available. The dgxh partition is back online and the testing period resumed while other issues continue to be worked out. Feel free to give them another shot, and please continue to report any further issues to me.
October 24, 2023
New Nvidia DGX H100 systems are online!
Many of you may have received the announcement last week that three new Nvidia DGX H100 systems have been added to the COE HPC cluster. Each DGX H100 system has 112 CPUs, 8 Nvidia H100 GPUs, and 2TB RAM! These systems are now available in a new partition called “dgxh”. This partition is in a testing period for at least this week to resolve existing issues and smooth out any other kinks that crop up. During this testing period, the current resource limits are 2 GPUs and 32 CPUs per user, and the time limit is 24 hours. Be advised that the DGX H100 systems are running RHEL9 based linux, which is different than the RHEL7 based systems currently used by the rest of the cluster. Also, these systems are not yet available through the HPC portal or through ssh. Give them a try and let me know of any issues you encounter.
DGX partition change
We currently have separate “dgx” and “dgx2” partitions for the our DGX-2 systems, depending on how many GPUs are needed. For various reasons that have come up over time, it is no longer advantageous to have separate partitions for these systems, so these will again be merged into a single “dgx2” partition. The “dgx” partition will be phased out or possibly re-purposed, so please use “dgx2” instead. The resource limits will remain at 4 GPUs and 32 CPUs on the dgx2 partition for now, but may increase again in the future.
Submit-a offline this week
Just a reminder that Submit-a will be offline until next Monday the 30th for maintenance. Until then, please use submit-b or submit-c.
October 9, 2023
HPC training reminders
Attention new cluster users!
I am offering my “Intro to HPC” workshops every Wednesday at 3pm and Thursday at 4pm from October 11 through November 9. This workshop is designed to help new users become acquainted with and start using the cluster. To book a workshop session, click on the “Intro to HPC training session” link on the site below:
https://it.engineering.oregonstate.edu/hpc
AI and ML series of trainings from MarkIII starting next week!
For anyone who is interested, Mark III is offering a free five part series of AI and ML trainings starting next Tuesday October 17 and running through November 14:
Tuesday, October 17 at 11am: Intro to AI and Machine Learning: The basics, tutorial and lab
Tuesday, October 24 at 11am: Intro to Deep Learning: An introduction to Neural Networks
Tuesday, October 31 at 11am: Intro to Datasets
Tuesday, November 7 at 11am: Intro to Computer Vision and Image Analytics
Tuesday, November 14 at 11am: Getting Started with Containers and AI
If you are interested in more details and want to register for any of the MarkIII trainings listed above, please check out the “OSU AI Series” on the right hand side of the HPC web site above. Also note the HPC resources from Nvidia also displayed on the right hand side.
Updates on cluster updates
The updates of the cluster compute and GPU nodes should be completed by the end of the day Tuesday October 10. The submit nodes will be taken offline and scheduled to be updated on the following dates:
Submit-c: Monday, October 16 at 9am
Submit-b: Monday, October 23 at 9am
Submit-a: Monday, October 30 at 9am
Please plan accordingly. Srun jobs still running on these hosts when they are scheduled to be updated will be terminated. Jobs submitted via sbatch or the HPC portal will not be affected. If you have any questions or concerns, let me know.
October 4, 2023
OS and GPU driver updates
The cluster is currently running at reduced capacity while OS updates are being rolled out. In addition, some users have reported that they cannot run their GPU codes on some GPU nodes, which appears to be due to the older drivers on these nodes. New GPU drivers are available, so the GPU nodes are scheduled to be offline next week in a staggered fashion (some Monday the 9th and some Tuesday the 10th) so that the new drivers can be installed. If you need GPU resources, please schedule your jobs so that they can complete before these offline periods, otherwise they will remain pending with the message “Required Node not available” or “ReqNodeNotAvail”. If that happens, you may be able to get in by reducing your time requirement using the “--time” or “-t" option in srun or sbatch.
OS upgrade
Many of you are aware that many COE linux servers have been or are being upgraded to Enterprise Linux 9 (EL9). This will also happen on the HPC cluster over the course of this Fall quarter. More details to come.
Intro to HPC workshop and other HPC training resources
I am offering my “Intro to HPC” workshops every Wednesday at 3pm and Thursday at 4pm from October 11 through November 9. This workshop is designed to help new users become acquainted with and start using the cluster. To book a workshop session, click on the “Intro to HPC training session” link on the site below:
https://it.engineering.oregonstate.edu/hpc
In addition to the workshop, other HPC resources and training offerings from Nvidia and Mark III are displayed on the right hand side of the HPC web site. Mark III is offering a series of AI and ML trainings every Tuesday starting October 17 through November 14. I encourage you to check them out and sign up for them if you are interested.
July 17, 2023
Due to cooling issues, the cluster will run at reduced capacity during the weekends whenever there is a heat advisory.
May 18, 2023
The cluster will undergo its regularly scheduled maintenance during Springbreak starting June 19th. The following maintenance activities will be performed:
Operating system updates
Slurm upgrade and configuration changes
Nvidia driver updates
Infiniband driver updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed
The entire cluster will be offline starting Monday the 19th at 1pm, and will remain offline until approximately Wednesday the 21st at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit.
March 6, 2023
The cluster will undergo its regularly scheduled maintenance during Springbreak starting next Monday the 27th. The following maintenance activities will be performed:
Operating system updates
Slurm update and configuration changes
Nvidia driver updates
Infiniband driver updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed
The entire cluster will be offline starting Monday the 27th at 1pm, and will remain offline until approximately Thursday the 30th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit.
November 10, 2022
Cooling Failure
Early this morning there was a cooling failure in the KEC datacenter which allowed temperatures to climb to unsafe levels, resulting in the automatic shutdown of all DGX-2 nodes and thus the termination of all jobs running on these nodes. Cooling has been restored and temperatures have returned to safe levels, and the DGX-2 nodes are back online.
October 14, 2022
Intro to HPC Workshop
I will be holding an HPC workshop over Zoom, covering the basics of using the CoE HPC cluster next week at the following date and times:
Wednesday, October 19 @3pm
Thursday, October 20 @4pm
If you are interested in attending this workshop, please let me know which session works best for you.
New STAR-CCM+ app
STAR-CCM+ has been added to the list of interactive apps on the HPC portal. If anyone has any problems using it, let me know. You can check out the HPC portal here:
https://ondemand.hpc.engr.oregonstate.edu
Updated Jupyter Server app
Previously the Jupyter Server app allowed you to activate your own python virtual environment and conda installation, but could not activate an environment created using conda. The Jupyter Server app on the HPC portal has now been further improved to allow your own conda environment. If anybody has any problems using this app, let me know.
New Software installs
Here is a list of recent software installs:
Python 3.10
GCC 12.2
Mathematica 13.1
Matlab 2022b with Parallel Server. If anybody is interested in using Matlab Parallel Server, please contact me.
The software listed above can be accessed through the modules system, type “module avail” to see what is available.
September 1, 2022
Fall maintenance week:
The HPC Cluster will undergo its regularly scheduled quarterly maintenance September 12-16. The following maintenance activities will be performed:
Operating system updates
BIOS and firmware updates as needed
Slurm scheduler upgrade
Nvidia driver and CUDA updates
Miscellaneous hardware maintenance as needed
The entire cluster will be offline starting Tuesday the 13th at 8am, and will remain offline until approximately Wednesday the 14th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.
August 8, 2022
Datacenter cooling outage
Most of the HPC cluster underwent an emergency shutdown last night due to a cooling failure in the KEC datacenter as temperatures had reached critical levels. Unfortunately many jobs or interactive sessions were terminated as a result of the emergency shutdown. Most of the cooling has been restored, and HPC resources are slowly being brought back online to a level that can be accommodated by the available cooling.
New HPC portal apps
New interactive apps have been added to the HPC portal:
Matlab
Mathematica
R Studio
Ansys Workbench (for approved Ansys users only)
If you use any of these applications, check these out and let me know if you have any trouble using them. You can check out the HPC portal here:
https://ondemand.hpc.engr.oregonstate.edu
DGX queue change reminder
This is a reminder that the DGX partitions have been redefined as follows:
If you need 4 GPUs or less, please use the “dgx” partition.
If you need 4 GPUs or more, please use the “dgx2” partition.
If your jobs to the dgx/dgx2 partitions are pending with “QOSMinGRES” or “QOSMaxGRES”, or if your jobs are rejected for those reasons, that means you need to change the partition as noted above.
Note that these partitions can longer be used with a list of partitions. Your default partition can be used to access these partitions.
August 1, 2022:
DGX partition changes
Two changes are being introduced to the DGX systems this month to improve job scheduling and flexibility. First, the “dgx” and “dgx2” partitions are being redefined. Starting tomorrow morning, the “dgx” partition will be used for smaller GPU workloads, i.e. 4 GPUs or less, and 12 CPUs or less, whereas the “dgx2” partition will be for larger GPU workloads, i.e. 4 GPUs or more, and 12 CPUs or more. For those of you who normally request fewer than 4 GPUs at a time under the “dgx2” partition, please change to the “dgx” partition. As some of you are aware, the larger the resource request, the longer the wait, and this change was recommended by the vendor as a way to improve the scheduling of larger workloads on the DGX systems.
New DGX limits
The second change involves the DGX resource limits. The GPU and CPU limits for the DGX systems sometimes change based on overall load and resource availability. Lately we have settled to a limit of 8 GPUs and 32 CPUs in use at a time per user. This limit will temporarily be lifted in lieu of new limits based on cumulative GPU and CPU running times. This means that if GPUs are available, then more GPUs than the normal limit can be used by each user at a time, though for a shorter period (e.g. 16 GPUs for one day). This is to improve job flexibility while also maximizing use of resources, to allow users to run more jobs at a time, or to run larger single calculations or experiments.
The new limits will be activated starting this week, and may require a lot of adjustment at first to optimize the load on the dgx partitions. These limits will be posted on the HPC status page once activated. If you have any questions about, or experience issues due to the new limits, let me know.
Jupyter Notebook app
The Jupyter Notebook app on the HPC Portal is being replaced by the new Jupyter Server app, and will no longer appear in the list of interactive apps.
July 12, 2022:
HPC Portal maintenance
The HPC Portal will be offline next Monday morning the 18th for maintenance. During this time, the Web and Portal packages will be updated, and to replace expiring SSL certificates. Jobs running through the portal at the time should not be interrupted, but they will not be accessible until the portal is back online. The portal is expected to be back online by noon on the 18th.
New Jupyter Server app
An improved Jupyter Server app has been added to the list of Interative Apps on the HPC Portal. This app gives you the option to use Jupyter Lab, and allows you to specify your own Python or Conda environment to run Jupyter, if desired. This app is in beta testing phase and will eventually replace the existing Jupyter Notebook app.
Preempt Partition
The “preempt” partition has undergone a lengthy testing period, and is being used with increasing frequency and bears mentioning. The preempt partition is available to all users, but it is not the same as the share partition. It is a low priority partition which can give you access to unused resources, but that access can be cancelled or “preempted” by a higher priority request or job. The preempt partition can be useful for those who have jobs that are restartable or checkpointed, or for short jobs or requests, i.e. where there is low risk or mitigated loss to interruption. It is not recommended to use this partition for long jobs that are not checkpointed.
New Software installs
Here is a list of recent software installs:
GCC 12.1, 11.3, 9.5
Mathematica 13.0
Matlab 2021b-r3 with Parallel Server. If anybody is interested in using Matlab Parallel Server, please contact me.
Anaconda 2022.05, will become the new default version
Cuda 11.7 have been installed, and Nvidia drivers have recently been upgraded on most GPU nodes to support the latest Cuda version. Drivers for Cuda 11.5+ have not yet been released for the DGX systems.
The software listed above can be accessed through the modules system, type “module avail” to see what is available.
Fall break maintenance:
The summer break maintenance is tentatively scheduled for September 12-16, 2022.
HPC Portal:
The OpenOnDemand HPC Portal is online. You may try it out here.
New commands/scripts for Slurm.
At the request of our users, some additional, potentially useful commands have been added to the Slurm module to assist our users, e.g.:
“nodestat partition” will show the status and resources used on each node on that partition, e.g., how many CPUs, GPUs and RAM are on each node, and how much are currently in use.
“showjob jobid” will provide information on a currently running or pending job. This can be used to obtain the estimated start time for a pending job, if available.
“tracejob -S YYYY-MM-DD” will provide information on jobs started since the date YYYY-MM-DD.
“sql” gives an alternate, longer listing format of job listing than the default “squeue” command.
“squ” only lists jobs owned by user, using the default “squeue” format
To receive cluster announcements and news via email, subscribe to the cluster email list.