The next maintenance period is scheduled for March 21-25. The following maintenance activities will be performed:
OS updates
BIOS and firmware updates as needed
Slurm configuration changes
Power redistribution for various compute nodes
Miscellaneous hardware maintenance as needed
The entire cluster will be offline starting Tuesday the 22nd at 8am, and will remain offline until approximately Wednesday the 23th at 3pm. Jobs scheduled to run into that offline period will remain pending with the message “ReqNodeNotAvail, Reserved for maintenance”. If you wish for your Slurm job to start and finish before the maintenance period begins, you will need to adjust your time limit accordingly, e.g. to change to 2 days do:
scontrol update job {jobid} TimeLimit=2-00:00:00
Alternatively, you can cancel your pending job, adjust your walltime (using the —-time option) and resubmit.
HPC Portal:
The OpenOnDemand HPC Portal is online. You may try it out here.
New commands/scripts for Slurm.
At the request of our users, some additional, potentially useful commands have been added to the Slurm module to assist our users, e.g.:
“nodestat {partition}” will show the status and resources used on each node on that partition, e.g., how many CPUs, GPUs and RAM are on each node, and how much are currently in use.
“showjob {jobid}” will provide information on a currently running or pending job. This can be used to obtain the estimated start time for a pending job, if available.
“tracejob -S YYYY-MM-DD” will provide information on jobs started since the date YYYY-MM-DD.
“sql” gives an alternate, longer listing format of job listing than the default “squeue” command.
“squ” only lists jobs owned by user, using the default “squeue” format
To receive cluster announcements and news via email, subscribe to the cluster email list.