Slurm Workload Manager is a batch scheduling software used for requesting resources and running jobs on the CoE HPC cluster. Once your HPC account has been enabled, you will be assigned to a Slurm account corresponding to your department or class. Assignment to a research group account is required for accessing private resources, and assignment to a class account is required for accessing resources reserved for class use.
The Slurm software module should be loaded automatically when you log into a submit node, which you can confirm by typing:
module list
If Slurm is not listed, you may need to login to TEACH and click "Reset Unix Config Files". In the short term you may load it by typing:
module load slurm
Once the Slurm module is loaded, you can begin using Slurm to request resources and submit jobs. See the sections below for examples using Slurm.
If you want to request resources for an interactive session or single command, you can use the Slurm command "srun":
srun {srun options} {command or shell}
See examples of srun below:
To start an interactive (pseudo-terminal bash shell) session on the cluster:
srun --pty bash
The tcsh and other shells can be used instead of bash for srun (and sbatch) if that is preferred.
To request an interactive tcsh session from the default (share) queue for more than the default 12 hours, e.g. 3 days:
srun --time=3-00:00:00 --pty tcsh
To request an interactive bash session with 64GB RAM from the share queue:
srun -p share --mem=64G --pty bash
To request an interactive bash session to run an application using 12 OpenMP threads from the share queue:
srun -p share -c 12 --pty bash
To request an interactive bash session with a GPU from the dgx queue using the mime account:
srun -A mime -p dgx --gres=gpu:1 --pty bash
To request an interactive session with a GPU from either the dgx2, gpu, or share queues using the eecs account:
srun -A eecs -p dgx2,gpu,share --gres=gpu:1 --pty bash
To request an interactive session with two GPUs on a specific node in the dgx2 queue using the eecs account:
srun -A eecs -p dgx2 --nodelist=dgx2-1 --gres=gpu:2 --pty bash
To request an interactive tcsh session with X11 (e.g. for running a GUI) from the class queue using the cs475-575 account:
srun -A cs475-575 -p class --x11 --pty tcsh
To request an interactive session to run an MPI application with 8 processes over an arbitrary number of nodes from the share queue:
srun -p share -n 8 --pty bash
To request an interactive session to run an MPI application on 2 nodes with infiniband and 8 processes per infiniband node from either the cbee or share queues using the cbee account:
srun -A cbee -p cbee,share -N 2 --ntasks-per-node=8 --constraint=ib --pty bash
To request an interactive session on a compute node containing the "avx2" instruction set, from the preempt queue:
srun -p preempt --constraint=avx2 --pty bash
To request an interactive session with one GPU on a compute node with an RTX6000 GPU, from the preempt queue:
srun -p preempt --gres=gpu:1 --constraint=rtx6000 --pty bash
For more information or options on srun, check out the manual page by typing "man srun".
Submitting batch jobs are useful when you need to run several jobs, or long jobs. Batch job submissions will not be interrupted by lost network or VPN connections. The Slurm command "sbatch" is used to submit batch jobs.
To submit a batch job to the cluster:
sbatch myBatchFile.sh
where myBatchFile.sh is a script containing the commands you wish to run as well as a list of SBATCH directives specifying the resources or parameters that you need for your job. See the sample myBatchFile.sh batch script below.
#!/bin/bash
#SBATCH -J helloWorld # name of job
#SBATCH -A mySponsoredAccount # name of my sponsored account, e.g. class or research group, NOT ONID!
#SBATCH -p share # name of partition or queue
#SBATCH -o helloWorld.out # name of output file for this submission script
#SBATCH -e helloWorld.err # name of error file for this submission script
# load any software environment module required for app (e.g. matlab, gcc, cuda)
module load software/version
# run my job (e.g. matlab, python)
mySoftwareExecutable
There are additional sample batch scripts located in /apps/samples, so feel free to copy them to your directory and modify them according to your needs.
Here are some other useful SBATCH directives that can be included in a batch script to specify additional resources or different parameters from the defaults:
#SBATCH --time=2-12:30:00 # time limit on job: 2 days, 12 hours, 30 minutes (default 12 hours)
#SBATCH -N 2 # number of nodes (default 1)
#SBATCH --nodelist=node1,node2... # required node, or comma separated list of required nodes
#SBATCH -n 3 # number of MPI tasks (default 1)
#SBATCH --ntasks-per-node=3 # number of MPI tasks per node (default 1), use for multiple nodes (-N>1)
#SBATCH -c 4 # number of cores/threads per task (default 1)
#SBATCH --gres=gpu:1 # number of GPUs to request (default 0)
#SBATCH --mem=10G # request 10 gigabytes memory (per node, default depends on node)
#SBATCH --constraint=ib # request node with infiniband
#SBATCH --constraint=avx512 # request node with AVX512 instruction set
#SBATCH --constraint=a40 # request node with A40 GPU
For more information or options on sbatch, check out the manual page by typing "man sbatch".
Here are the currently available node features that can be requested using the Slurm "--constraint" option:
#SBATCH --constraint=ib # request node with infiniband
#SBATCH --constraint=eth # request node with ethernet
#SBATCH --constraint=avx # request node with AVX instruction set
#SBATCH --constraint=avx2 # request node with AVX2 instruction set
#SBATCH --constraint=avx512 # request node with AVX512 instruction set
#SBATCH --constraint=haswell # request node with Intel Haswell CPU
#SBATCH --constraint=broadwell # request node with Intel Broadwell CPU
#SBATCH --constraint=skylake # request node with Intel Skylake CPU
#SBATCH --constraint=t4 # request node with T4 GPU
#SBATCH --constraint=k40m # request node with K40m GPU
#SBATCH --constraint=m60 # request node with M60 GPU
#SBATCH --constraint=rtx6000 # request node with RTX6000 GPU
#SBATCH --constraint=rtx8000 # request node with RTX8000 GPU
#SBATCH --constraint=v100 # request node with V100 GPU
#SBATCH --constraint=a40 # request node with A40 GPU
Note that when requesting a node with a specific GPU using the constraint option, you must still use the "--gres" option to request the GPU itself.
To cancel a job on the cluster:
scancel {jobid}
To cancel all of your jobs on the cluster:
scancel -u {ONID}
For more information or options on scancel, type "man scancel".
To view a list of queues and their status:
sinfo
For more information or options on sinfo, type "man sinfo".
To view status and available resources (CPU, GPU, RAM) of all nodes in a partition:
nodestat {partition}
To view the list and status of all jobs in all queues:
squeue
-or-
sq
To view the list and status of all jobs in one partition or queue:
squeue -p {partition}
To view the list and status of your jobs:
squeue -u {ONID}
-or-
squ
For more information or options on squeue, type "man squeue".
To view status of jobs in the queue but with an alternate, longer listing format:
sql
To see more detailed information on a specific pending or running job:
showjob {jobid}
Here is a list of partitions, their descriptions, and a list of accounts that have access to them:
Partition | Description/Owner | Accounts |
share | Shared resources, open to all users | ALL |
class | Resources reserved for class use | class accounts |
dgxh | DGX-H systems | ALL |
dgxh-ceoas | DGX-H systems for CEOAS | ceoas |
dgx2 | DGX-2 systems | ALL |
dgxs | DGX workstations for short-term use | n/a |
gpu | Shared GPU servers for EECS | eecs |
gpu-dmv | Resources for research group | dmv |
ampere | Resources for research group | ALL |
athena | Resources for research group | hlab |
bee | Shared resources for BEE | fc-lab, sg-ecohydro |
bee1 | Resources for research group | fc-lab |
cbee | Resources for research group | la-grp |
cp-grp | Resources for research group | cp-grp |
ecohydro | Resources for research group | sg-ecohydro |
eecs | Shared resources for EECS | eecs |
eecs2 | Resources for research group | virl-grp |
eecs3 | Resources for research group | rah-grp |
mime1 | Resources for research group | kt-lab |
mime2 | Resources for research group | jt-grp |
mime3 | Resources for research group | mime3_grp |
mime4 | Resources for research group | nrg |
mime5 | Resources for research group | simon-grp |
mime7 | Resources for research group | ba-grp |
nacse | Resources for NACSE | nacse |
nerhp | Shared resources for NSE | nse |
nerhp2 | Resources for research group | ig-lab |
nse3 | Resources for research group | cp-grp,tp-grp |
sail | Resources for research group | sail |
soundbendor | Shared resources for research groups | soundbendor |
tp-grp | Resources for research group | tp-grp |
preempt | Low priority queue with access to all resources, but subject to preemption by higher priority jobs | ALL |
Here is a list of partitions and their time and resource limits:
Partition |
Time Limit | CPU cores | GPUs | RAM |
share | 7 days | 500 | none | none |
class | varies | varies | varies | varies |
dgxh | 2 days | 64 | 2 | 500g |
dgx2 | 7 days | 32 | 8 | 500g |
dgxs | 24 hours | 20 | 4 | n/a |
gpu | 7 days | 48 | 8 | 750g |
ampere | 7 days | 64 | 2 | 250g |
eecs | 7 days | 360 | n/a | none |
preempt | 7 days | none | none | none |
The total number of jobs that one user can submit to all partitions at once is 1000.
The total number of jobs that one user can have running in all partitions at once is 400.
For more information on using Slurm, visit the website below:
https://slurm.schedmd.com/