Status
December 9, 2024:
The cluster is online, but cluster nodes are being drained/reserved for the upcoming maintenance starting December 16 (see News below for details).
The following nodes are offline until further notice:
dgx2-6
submit-c
News
December 9, 2024
Maintenance week December 16-20
The next cluster maintenance is scheduled for the week of December 16. The maintenance activities planned for this time include:
Head node upgrade
OnDemand HPC portal upgrade
Slurm upgrade and configuration changes
Operating system image updates
Nvidia GPU and infiniband driver updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed
The entire cluster will be offline starting Monday, December 16 at 1pm, and will remain offline until approximately Wednesday the 18th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.
November 12, 2024
The next cluster maintenance is scheduled for the week of December 16th.
September 10, 2024
Maintenance week September 23-27
The next cluster maintenance is scheduled for the week of September 23. The maintenance activities planned for this time include:
Operating system updates
OnDemand HPC portal upgrade
Slurm update and configuration changes
Nvidia driver updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed
The entire cluster will be offline starting Monday, September 23 at 8am, and will remain offline until approximately Tuesday the 24th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.
June 20, 2024
Cluster online
The cluster is back online, but currently at limited capacity while maintenance continues on the rest of the cluster. If you are trying to access a resource and you get either of these messages below:
(partitiondown)
Or
(ReqNodeNotAvail)
This means that the resource you are requesting is not available yet. But keep checking, additional resources will be brought back online as the maintenance progresses.
Submit-a is still undergoing maintenance until next week, so in the meantime please use Submit-b, Submit-c, or just Submit, instead. When you access the upgraded nodes, you may be met with the following message:
"host key for submit-b.hpc.eng.oregonstate.edu has changed and you have requested strict checking. Host key verification failed."
Or something similar. To address this, please remove your old host keys as follows:
ssh-keygen -R submit-a.hpc.engr.oregonstate.edu
ssh-keygen -R submit-b.hpc.engr.oregonstate.edu
ssh-keygen -R submit-c.hpc.engr.oregonstate.edu
ssh-keygen -R submit.hpc.engr.oregonstate.edu
After that, try connecting again and accept the new host keys and you should be set.
New HPC storage
All HPC share data has been migrated to our new DDN storage appliance, still located on /nfs/hpc/share, and all upgraded nodes are now using this storage. Everyone should check their HPC share directories to make sure their data is there as expected, and if you think something is missing or doesn’t look right, let me know.
Survey
If you haven’t had a chance yet, please take a few minutes to complete the survey below, your feedback is important.
https://oregonstate.qualtrics.com/jfe/form/SV_290Wnkkv7IFqSW2
June 10, 2024
This is a reminder that Summer maintenance is scheduled to start next week (after Finals week), and will go on until completed. During this time I plan to complete the migration of the remaining cluster nodes to the EL8 and EL9 based operating systems. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:
OnDemand HPC portal upgrade
Slurm upgrade and configuration changes
Nvidia driver updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed
The entire cluster will be offline starting Monday, June 17 at 1pm, and will remain offline until approximately Wednesday the 19th at 4pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
May 6, 2024
Update on DGXH partition
The dgxh partition has been in very high demand lately, running almost steadily at or near 100% for the last few weeks, and wait times for this resource have been as long as a week or more. I am continually working on improving this situation so that everyone gets a fair chance to use these resources, and in a way that meets their need. When your allocation is ready, make sure you use it effectively and don’t waste it. This is a limited resource that is primarily for production runs. We have other resources that can be used for testing and debugging. Sessions that are found running idle or unused for a period of time may be terminated, so use this resource wisely.
VSCode issues
A number of our users use VSCode to ssh to the cluster. This has an unfortunate side effect of leaving defunct processes running on the submit nodes which can eventually prevent the user from logging in again. This problem is occurring with increasing regularity, so to address this, VSCode processes will be automatically cleaned up on a periodic basis. We advise that you do not use “srun” through VScode as it may be terminated with the VSCode process.
Submit-c is online
Submit-c has been upgraded to EL8 and is back online for both ssh and portal access. Also, upgraded versions of VScode should work on this node. If anyone has any problems with submit-c, let me know.
HPC Portal issues
Occasionally when one logs in through the HPC portal, they are met with a “Bad Request” error. One way to work around this problem is to choose a different submit node, e.g. if you get a “Bad Request” on submit-a, then try submit-b or submit-c. Alternatively, try deleting your browser cache and cookies, then try again.
Some users have recently encountered a problem being locked out of their Desktop sessions on EL9 systems. This issue should now be resolved and users should no longer be locked out, but to prevent the constant lock out I recommend users disable or change their screensaver settings.
Apptainer/Singularity issue
It was recently reported that Apptainer is not working on the EL9 systems. This is still being worked on. In the meantime I recommend using Apptainer on the EL7 systems where possible.
Summer Maintenance
The next cluster maintenance period is scheduled for the week of June 17, after finals week. The cluster will be offline for part of that week, so please plan accordingly.
April 9, 2024
Update on DGX-2 nodes
Last week a number of users reported a problem with the Nvidia GPU driver on the DGX-2 systems after the upgrade to EL9, so the newly upgraded DGX-2 systems had to be taken offline. A ticket was opened with Nvidia last week to troubleshoot the issue. Earlier this morning Nvidia spotted a potential problem, and a fix is currently being rolled out to the DGX-2 nodes. Most of the DGX-2 systems should be back online by this evening. For up-to-date status on the cluster, including the DGX nodes, check out the link below:
https://it.engineering.oregonstate.edu/hpc/hpc-cluster-status-and-news
AI Week and GPU Day Reminder
Just a reminder that this is AI Week. Tomorrow (Wednesday April 10) is GPU Day, where Nvidia and Mark III will team up to “bring you an action-packed day of learning about what GPUs are, how they can help your research, and how to optimize them for your workloads”. To register for GPU day and other AI Week events, please check out the link below:
https://dri.oregonstate.edu/ai-week
MarkIII AI Training Series Reminder
Another reminder that Mark III is offering a seven part series on AI and ML trainings every Tuesday at 11am starting next week, April 16 and running through May 28. The training topics are listed below:
April 16 - Intro to Machine Learning and AI: The Basics, A Tutorial, and Lab
April 23 - Intro to Deep Learning: An Introduction to Neural Networks
April 30 - Introduction to Datasets
May 7 - Introduction to Large Language Models
May 14 - Getting Started with Containers and the software stack around AI + How to get started working with OSU HPC Services
May 21 - Intro to Omniverse & Digital Twins
May 28 - Intro to Isaac Sim and AI in Robotics
I encourage you to check them out and sign up for them using the link below if you are interested:
https://trending.markiiisys.com/osu-aiseries-2024
April 2, 2024
Welcome to the start of Spring quarter 2024! See below the latest news on the CoE HPC cluster.
Update on EL8/EL9 upgrade
I was not able to complete the upgrade of the rest of the cluster to EL8 and EL9 during the Springbreak maintenance week. The upgrade process has been bumpy, and at this time only nodes from the dgx2, dgxs and ampere partitions have been migrated to EL9. The ampere partition is available now, and the dgx2 and dgxs partitions should be available by tomorrow (Wednesday) morning.
Most of the rest of the cluster, including submit-a and submit-b, is still running EL7. I will continue working on the rest of the cluster over the course of this term, and will try to post on the HPC status page the approximate dates when certain nodes or partitions will be upgraded. I will work with affected research groups on scheduling upgrade windows.
AI Week!
As announced through various sources, next week (April 8-12) is AI Week. Wednesday April 10 is GPU Day, where Nvidia and Mark III will team up to “bring you an action-packed day of learning about what GPUs are, how they can help your research, and how to optimize them for your workloads”. For more information, and to register for GPU day and other AI Week events, please check out the link below:
https://dri.oregonstate.edu/ai-week
Intro to HPC workshop
I am offering my “Intro to HPC” workshops again this quarter, approximately once per week starting next Tuesday the 9th. This workshop is designed to help new users become acquainted with and start using the cluster. Due to a problem with my Bookings, people were unable to book a meeting during April. This has now been fixed, so if you are interested in attending this workshop this month, please register for a date using the link below:
MarkIII AI training series
Mark III is offering a seven part series on AI and ML trainings every Tuesday at 11am from April 16 through May 28:
April 16 - Intro to Machine Learning and AI: The Basics, A Tutorial, and Lab
April 23 - Intro to Deep Learning: An Introduction to Neural Networks
April 30 - Introduction to Datasets
May 7 - Introduction to Large Language Models
May 14 - Getting Started with Containers and the software stack around AI + How to get started working with OSU HPC Services
May 21 - Intro to Omniverse & Digital Twins
May 28 - Intro to Isaac Sim and AI in Robotics
I encourage you to check them out and sign up for them using the link below if you are interested:
https://trending.markiiisys.com/osu-aiseries-2024
March14, 2024
Unplanned cluster outage
Yesterday afternoon I made a change in the queuing system which did not affect the queuing server itself, but it did impact all of the client nodes. I frequently make changes to the queuing system, however this particular change unexpectedly caused client daemons to crash, which unfortunately and unexpectedly caused several jobs to terminate prematurely with the “NODE FAIL” message. For those of you wondering why your job or session was terminated yesterday, this is likely the reason, and I apologize for the inconvenience this outage has caused.
Springbreak Maintenance/OS upgrades
The next cluster maintenance is scheduled for Springbreak, March 25-29. During this week I plan to complete the migration of the remaining cluster nodes to the EL8 and EL9 based operating systems. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:
OnDemand HPC portal upgrade
Slurm update and configuration changes
Nvidia driver updates
Infiniband updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed
The entire cluster will be offline after Finals week, starting Monday, March 25 at 12pm, and will remain offline until approximately Wednesday the 27th at 5pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.
Tentative planned cooling outage
A cooling outage impacting the HPC cluster has tentatively been scheduled for Friday April 5 (time window TBD). The cluster will run at reduced capacity that day until the planned AC maintenance has completed.
Dealing with broken software on EL8/EL9
As the upgrades have progressed, a number of users have determined that some software which had worked on EL7 no longer works on EL9. In some cases, this has been addressed by installing new or compatible packages on the EL9 hosts. Also, it is currently possible to avoid an EL8/EL9 host by requesting the “el7” feature, either through the HPC portal or by adding the “--constraint=el7” option to your srun or sbatch options. However, keep in mind that this solution is only for the short term, since the entire cluster will eventually be running EL8/EL9, so in these cases the longer term solution is to rebuild the software on the EL9 (or EL8) host. I appreciate everyone’s feedback and patience regarding the upgrade, and I encourage everyone to continue reporting any issues that they encounter with the upgrade.
Status of Nvidia DGX H100 systems
The dgxh partition will remain in testing phase until the week of April 15.
February 2, 2024
Busy queues
The demand for resources, especially GPUs, has been very high lately, due to upcoming deadlines. This is resulting in much longer wait times than usual, so please be patient. If you have been waiting for over a day for your resources, let me know and leave your job in the queue as I may be able to work it in or at least determine what is holding it up.
EL8/EL9 upgrades
The cluster upgrade to EL8 and EL9 has been progressing slowly as various bugs are being worked through. Progress is expected to pick up the week of February 12, as submit nodes and more compute nodes from the “share”, “dgx2” and “dgxs” partition will be migrated. These partitions will not be migrated all at once, but each week one or more nodes from each partition will be upgraded until the migration is completed. At this point it is anticipated that the migration will be completed by the end of March. At present the DGX nodes “dgx2-4” and “dgxs-3” have been upgraded to EL9 with support for Cuda 12.2, and are now available through the dgx2 and dgxs partitions.
Note that if you do not wish to land on an EL8 or EL9 node at this time, then you may request the “el7” feature, either through the HPC portal or by adding the “--constraint=el7” option to your srun or sbatch options. For those of you who would prefer to land on an EL9 node, for instance, because of updated dependencies and software support, or the availability of more recent versions of Cuda, then you may request the “el9” feature through the portal or use the “--constraint=el9” option in srun or sbatch.
Please continue to report any issues that you encounter with the cluster as the upgrade progresses.
Portal and ssh fixes on Nvidia DGX H100 and other EL9 systems
The HPC portal and ssh are now working properly on the dgxh and other EL9 systems. The dgxh partition is currently available on the Advanced Desktop (Xfce desktop only!) and Jupyter Server apps. Note that with the Advanced Desktop on EL9 systems, you should disable the screen lock or you risk being locked out of your session and you’ll have to start over. To do this, click on “Applications” on the upper left hand corner, then select “Settings”, then “Xfce screensaver”, then the “Lock Screen” tab, and turn off the “Enable Lock Screen”.
The dgxh partition will remain in testing phase until further notice while other issues are being addressed.
VSCode issues
Some users have recently reported a problem using VSCode on the HPC submit nodes. It appears that the latest version or update of VSCode no longer supports EL7, so you may not be able to run it on the submit nodes. The submit nodes will be upgraded to EL8 later this month through early March. Until then, my recommendation is to use an older version of VSCode if possible, or use the Flip servers until the submit nodes are upgraded.
Springbreak Maintenance
The next cluster maintenance is scheduled for Springbreak, March 25-29.
December 1, 2023
Winter maintenance/OS upgrades
The cluster will operate at reduced capacity throughout the month of December through mid-January while cluster nodes are upgraded to either EL8 or EL9 linux. I will post a tentative upgrade schedule sometime next week, and will coordinate with partition group owners to help minimize the impact of these upgrades. In addition to the OS upgrades, regular maintenance activities will be performed during this time, e.g.:
OnDemand HPC portal updates
Slurm update and configuration changes
Nvidia driver updates
Infiniband updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed
The entire cluster will be offline after Finals week, starting Monday, December 18th at 1pm, and will remain offline until approximately Wednesday the 20th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the offline period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.
Status of Nvidia DGX H100 systems
I want to thank everyone who has given me feedback regarding the new DGX systems. A GPU issue came up which required that I open a ticket with Nvidia to troubleshoot. The nodes were then taken offline, then reinstalled and reconfigured. The GPU issue has now been addressed, so pytorch and other CUDA applications should work. Also, the automount problem appears to be resolved, so users can access directories in /nfs/hpc/*. I anticipate that the HPC portal/vncserver will be available for these hosts sometime next week, so stay tuned. However, ssh access is still not available. The dgxh partition is back online and the testing period resumed while other issues continue to be worked out. Feel free to give them another shot, and please continue to report any further issues to me.
October 24, 2023
New Nvidia DGX H100 systems are online!
Many of you may have received the announcement last week that three new Nvidia DGX H100 systems have been added to the COE HPC cluster. Each DGX H100 system has 112 CPUs, 8 Nvidia H100 GPUs, and 2TB RAM! These systems are now available in a new partition called “dgxh”. This partition is in a testing period for at least this week to resolve existing issues and smooth out any other kinks that crop up. During this testing period, the current resource limits are 2 GPUs and 32 CPUs per user, and the time limit is 24 hours. Be advised that the DGX H100 systems are running RHEL9 based linux, which is different than the RHEL7 based systems currently used by the rest of the cluster. Also, these systems are not yet available through the HPC portal or through ssh. Give them a try and let me know of any issues you encounter.
DGX partition change
We currently have separate “dgx” and “dgx2” partitions for the our DGX-2 systems, depending on how many GPUs are needed. For various reasons that have come up over time, it is no longer advantageous to have separate partitions for these systems, so these will again be merged into a single “dgx2” partition. The “dgx” partition will be phased out or possibly re-purposed, so please use “dgx2” instead. The resource limits will remain at 4 GPUs and 32 CPUs on the dgx2 partition for now, but may increase again in the future.
Submit-a offline this week
Just a reminder that Submit-a will be offline until next Monday the 30th for maintenance. Until then, please use submit-b or submit-c.
October 9, 2023
HPC training reminders
Attention new cluster users!
I am offering my “Intro to HPC” workshops every Wednesday at 3pm and Thursday at 4pm from October 11 through November 9. This workshop is designed to help new users become acquainted with and start using the cluster. To book a workshop session, click on the “Intro to HPC training session” link on the site below:
https://it.engineering.oregonstate.edu/hpc
AI and ML series of trainings from MarkIII starting next week!
For anyone who is interested, Mark III is offering a free five part series of AI and ML trainings starting next Tuesday October 17 and running through November 14:
Tuesday, October 17 at 11am: Intro to AI and Machine Learning: The basics, tutorial and lab
Tuesday, October 24 at 11am: Intro to Deep Learning: An introduction to Neural Networks
Tuesday, October 31 at 11am: Intro to Datasets
Tuesday, November 7 at 11am: Intro to Computer Vision and Image Analytics
Tuesday, November 14 at 11am: Getting Started with Containers and AI
If you are interested in more details and want to register for any of the MarkIII trainings listed above, please check out the “OSU AI Series” on the right hand side of the HPC web site above. Also note the HPC resources from Nvidia also displayed on the right hand side.
Updates on cluster updates
The updates of the cluster compute and GPU nodes should be completed by the end of the day Tuesday October 10. The submit nodes will be taken offline and scheduled to be updated on the following dates:
Submit-c: Monday, October 16 at 9am
Submit-b: Monday, October 23 at 9am
Submit-a: Monday, October 30 at 9am
Please plan accordingly. Srun jobs still running on these hosts when they are scheduled to be updated will be terminated. Jobs submitted via sbatch or the HPC portal will not be affected. If you have any questions or concerns, let me know.
October 4, 2023
OS and GPU driver updates
The cluster is currently running at reduced capacity while OS updates are being rolled out. In addition, some users have reported that they cannot run their GPU codes on some GPU nodes, which appears to be due to the older drivers on these nodes. New GPU drivers are available, so the GPU nodes are scheduled to be offline next week in a staggered fashion (some Monday the 9th and some Tuesday the 10th) so that the new drivers can be installed. If you need GPU resources, please schedule your jobs so that they can complete before these offline periods, otherwise they will remain pending with the message “Required Node not available” or “ReqNodeNotAvail”. If that happens, you may be able to get in by reducing your time requirement using the “--time” or “-t" option in srun or sbatch.
OS upgrade
Many of you are aware that many COE linux servers have been or are being upgraded to Enterprise Linux 9 (EL9). This will also happen on the HPC cluster over the course of this Fall quarter. More details to come.
Intro to HPC workshop and other HPC training resources
I am offering my “Intro to HPC” workshops every Wednesday at 3pm and Thursday at 4pm from October 11 through November 9. This workshop is designed to help new users become acquainted with and start using the cluster. To book a workshop session, click on the “Intro to HPC training session” link on the site below:
https://it.engineering.oregonstate.edu/hpc
In addition to the workshop, other HPC resources and training offerings from Nvidia and Mark III are displayed on the right hand side of the HPC web site. Mark III is offering a series of AI and ML trainings every Tuesday starting October 17 through November 14. I encourage you to check them out and sign up for them if you are interested.
July 17, 2023
Due to cooling issues, the cluster will run at reduced capacity during the weekends whenever there is a heat advisory.
May 18, 2023
The cluster will undergo its regularly scheduled maintenance during Springbreak starting June 19th. The following maintenance activities will be performed:
Operating system updates
Slurm upgrade and configuration changes
Nvidia driver updates
Infiniband driver updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed
The entire cluster will be offline starting Monday the 19th at 1pm, and will remain offline until approximately Wednesday the 21st at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit.
March 6, 2023
The cluster will undergo its regularly scheduled maintenance during Springbreak starting next Monday the 27th. The following maintenance activities will be performed:
Operating system updates
Slurm update and configuration changes
Nvidia driver updates
Infiniband driver updates
BIOS and firmware updates as needed
Miscellaneous hardware maintenance as needed
The entire cluster will be offline starting Monday the 27th at 1pm, and will remain offline until approximately Thursday the 30th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit.
November 10, 2022
Cooling Failure
Early this morning there was a cooling failure in the KEC datacenter which allowed temperatures to climb to unsafe levels, resulting in the automatic shutdown of all DGX-2 nodes and thus the termination of all jobs running on these nodes. Cooling has been restored and temperatures have returned to safe levels, and the DGX-2 nodes are back online.
October 14, 2022
Intro to HPC Workshop
I will be holding an HPC workshop over Zoom, covering the basics of using the CoE HPC cluster next week at the following date and times:
Wednesday, October 19 @3pm
Thursday, October 20 @4pm
If you are interested in attending this workshop, please let me know which session works best for you.
New STAR-CCM+ app
STAR-CCM+ has been added to the list of interactive apps on the HPC portal. If anyone has any problems using it, let me know. You can check out the HPC portal here:
https://ondemand.hpc.engr.oregonstate.edu
Updated Jupyter Server app
Previously the Jupyter Server app allowed you to activate your own python virtual environment and conda installation, but could not activate an environment created using conda. The Jupyter Server app on the HPC portal has now been further improved to allow your own conda environment. If anybody has any problems using this app, let me know.
New Software installs
Here is a list of recent software installs:
Python 3.10
GCC 12.2
Mathematica 13.1
Matlab 2022b with Parallel Server. If anybody is interested in using Matlab Parallel Server, please contact me.
The software listed above can be accessed through the modules system, type “module avail” to see what is available.
September 1, 2022
Fall maintenance week:
The HPC Cluster will undergo its regularly scheduled quarterly maintenance September 12-16. The following maintenance activities will be performed:
Operating system updates
BIOS and firmware updates as needed
Slurm scheduler upgrade
Nvidia driver and CUDA updates
Miscellaneous hardware maintenance as needed
The entire cluster will be offline starting Tuesday the 13th at 8am, and will remain offline until approximately Wednesday the 14th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the --time option) and resubmit.
August 8, 2022
Datacenter cooling outage
Most of the HPC cluster underwent an emergency shutdown last night due to a cooling failure in the KEC datacenter as temperatures had reached critical levels. Unfortunately many jobs or interactive sessions were terminated as a result of the emergency shutdown. Most of the cooling has been restored, and HPC resources are slowly being brought back online to a level that can be accommodated by the available cooling.
New HPC portal apps
New interactive apps have been added to the HPC portal:
Matlab
Mathematica
R Studio
Ansys Workbench (for approved Ansys users only)
If you use any of these applications, check these out and let me know if you have any trouble using them. You can check out the HPC portal here:
https://ondemand.hpc.engr.oregonstate.edu
DGX queue change reminder
This is a reminder that the DGX partitions have been redefined as follows:
If you need 4 GPUs or less, please use the “dgx” partition.
If you need 4 GPUs or more, please use the “dgx2” partition.
If your jobs to the dgx/dgx2 partitions are pending with “QOSMinGRES” or “QOSMaxGRES”, or if your jobs are rejected for those reasons, that means you need to change the partition as noted above.
Note that these partitions can longer be used with a list of partitions. Your default partition can be used to access these partitions.
August 1, 2022:
DGX partition changes
Two changes are being introduced to the DGX systems this month to improve job scheduling and flexibility. First, the “dgx” and “dgx2” partitions are being redefined. Starting tomorrow morning, the “dgx” partition will be used for smaller GPU workloads, i.e. 4 GPUs or less, and 12 CPUs or less, whereas the “dgx2” partition will be for larger GPU workloads, i.e. 4 GPUs or more, and 12 CPUs or more. For those of you who normally request fewer than 4 GPUs at a time under the “dgx2” partition, please change to the “dgx” partition. As some of you are aware, the larger the resource request, the longer the wait, and this change was recommended by the vendor as a way to improve the scheduling of larger workloads on the DGX systems.
New DGX limits
The second change involves the DGX resource limits. The GPU and CPU limits for the DGX systems sometimes change based on overall load and resource availability. Lately we have settled to a limit of 8 GPUs and 32 CPUs in use at a time per user. This limit will temporarily be lifted in lieu of new limits based on cumulative GPU and CPU running times. This means that if GPUs are available, then more GPUs than the normal limit can be used by each user at a time, though for a shorter period (e.g. 16 GPUs for one day). This is to improve job flexibility while also maximizing use of resources, to allow users to run more jobs at a time, or to run larger single calculations or experiments.
The new limits will be activated starting this week, and may require a lot of adjustment at first to optimize the load on the dgx partitions. These limits will be posted on the HPC status page once activated. If you have any questions about, or experience issues due to the new limits, let me know.
Jupyter Notebook app
The Jupyter Notebook app on the HPC Portal is being replaced by the new Jupyter Server app, and will no longer appear in the list of interactive apps.
July 12, 2022:
HPC Portal maintenance
The HPC Portal will be offline next Monday morning the 18th for maintenance. During this time, the Web and Portal packages will be updated, and to replace expiring SSL certificates. Jobs running through the portal at the time should not be interrupted, but they will not be accessible until the portal is back online. The portal is expected to be back online by noon on the 18th.
New Jupyter Server app
An improved Jupyter Server app has been added to the list of Interative Apps on the HPC Portal. This app gives you the option to use Jupyter Lab, and allows you to specify your own Python or Conda environment to run Jupyter, if desired. This app is in beta testing phase and will eventually replace the existing Jupyter Notebook app.
Preempt Partition
The “preempt” partition has undergone a lengthy testing period, and is being used with increasing frequency and bears mentioning. The preempt partition is available to all users, but it is not the same as the share partition. It is a low priority partition which can give you access to unused resources, but that access can be cancelled or “preempted” by a higher priority request or job. The preempt partition can be useful for those who have jobs that are restartable or checkpointed, or for short jobs or requests, i.e. where there is low risk or mitigated loss to interruption. It is not recommended to use this partition for long jobs that are not checkpointed.
New Software installs
Here is a list of recent software installs:
GCC 12.1, 11.3, 9.5
Mathematica 13.0
Matlab 2021b-r3 with Parallel Server. If anybody is interested in using Matlab Parallel Server, please contact me.
Anaconda 2022.05, will become the new default version
Cuda 11.7 have been installed, and Nvidia drivers have recently been upgraded on most GPU nodes to support the latest Cuda version. Drivers for Cuda 11.5+ have not yet been released for the DGX systems.
The software listed above can be accessed through the modules system, type “module avail” to see what is available.
Fall break maintenance:
The summer break maintenance is tentatively scheduled for September 12-16, 2022.
HPC Portal:
The OpenOnDemand HPC Portal is online. You may try it out here.
New commands/scripts for Slurm.
At the request of our users, some additional, potentially useful commands have been added to the Slurm module to assist our users, e.g.:
“nodestat partition” will show the status and resources used on each node on that partition, e.g., how many CPUs, GPUs and RAM are on each node, and how much are currently in use.
“showjob jobid” will provide information on a currently running or pending job. This can be used to obtain the estimated start time for a pending job, if available.
“tracejob -S YYYY-MM-DD” will provide information on jobs started since the date YYYY-MM-DD.
“sql” gives an alternate, longer listing format of job listing than the default “squeue” command.
“squ” only lists jobs owned by user, using the default “squeue” format
To receive cluster announcements and news via email, subscribe to the cluster email list.