A. General FAQs
Easy! Make sure you have an engineering account - you can create one here and click on "Create a new account (Enable your Engineering resources)". Then activate your HPC account by clicking on the "High Performance Computing" button on the right hand side under "Account Tools". Your account should be ready shortly. If not, or if you have trouble with your account, try opening a ticket through the COEIT support portal. Please provide your ONID, department, and advisor, and you will be assigned to the appropriate Slurm account.
You can request assistance on HPC related issues by opening a ticket through the COEIT support portal.
Please put "HPC" as part of the subject line to alert COEIT that this issue involves the HPC cluster, this will facilitate that the ticket is properly routed and will improve response time.
In order to access hosts on the CoE HPC cluster, you must ssh to one of the submit nodes (submit-a, b, c) and then use Slurm to request resources from these nodes. Once you have reserved resources in Slurm, you can ssh to that node. For more information on using Slurm, check out this link.
The pattern of names starting with dgx* (eg. dgx2-2) and cn* (e.g. cn-gpu5) are shortened internal cluster names that can be used within the cluster (e.g. the submit nodes) but are currently not resolvable outside the cluster. The corresponding external names begin with "compute-" and is that way for historical reasons. For internal names starting with "dgx", just add "compute-" to the name, e.g. for dgx2-2 you should use "compute-dgx2-2". For names beginning with "cn", replace cn with "compute", e.g. for cn-gpu5 you should use "compute-gpu5". Be advised that ssh access is not allowed until resources have already been reserved using Slurm.
The mostly likely cause is that your tmux session are exceeding memory or CPU utilization limits on the submit nodes. Monitor your tmux sessions closely using the "ps" command. If you have a lot of srun commands running through a single tmux session, you might try spreading the load out among the three submit nodes.
If your tmux session is terminated, then any srun sessions will also be terminated. If you have a well established workflow, you might try putting your commands into a script and use sbatch instead of srun. Jobs submitted in batch mode will not be terminated even if your network or VPN connection fails or if your tmux session is terminated. If your work flow is not suitable to batch mode, you might try the HPC portal instead (VPN or campus network connection required).
B. Slurm FAQs
The HPC cluster is a heavily shared compute resource, and Slurm is a cluster resource and workload manager. When you request an allocation in Slurm (or any cluster manager), Slurm will search all compute resources (partitions) that are available to you so you do not have to find them yourself, and then Slurm allocates the resources to you that you requested (e.g. CPU, GPU, RAM) if immediately available (or schedule them if not currently available), and give you get dedicated access to those resources. This protects your calculations or jobs from memory contention as well as CPU/GPU sharing with other users, which leads to much greater job stability and performance. This resource protection would not be possible if ssh access was granted to anybody at anytime. Without a cluster manager, the best case for a busy cluster is that users jobs would slow down to a crawl due to CPU and GPU load sharing, or worse, crash or cause other jobs to crash due to memory contention.
There is likely a problem with your shell environment. There are a couple of potential solutions:
1) Log into the TEACH web site, then click on the link titled "Reset Unix Config Files", then open another login session to a submit node. This should correct the problem.
2) Alternatively, if you have invested a lot of time and effort into your shell environment and do not want to reset it, edit your shell configuration file (~/.bashrc if you use bash, or ~/.cshrc if you use csh or tcsh) using your favorite linux editor (e.g. nano, vim, emacs). When in editing these files, look for lines that are setting your executable path, then insert ($PATH for bash, $path for csh/tcsh) into that line, e.g.:
if you use bash:
export PATH=$PATH:/usr/local/bin:/usr/local/apps/bin
if you use csh/tcsh:
set path=( $path /usr/local/bin /usr/local/apps/bin)
After adding $PATH or $path to your shell configuration file, open a new session to a submit node and try running "srun" or "sinfo" again.
If neither #1 or #2 corrects the issue, let us know by opening up a ticket through the COEIT support portal.
There are two possibilities:
1) You have not yet been added to a Slurm account. Please open a ticket through the COEIT support portal with this error message, and provide your advisor's name and your department, and we'll add you to the appropriate account.
2) You are trying to access a restricted partition using the wrong Slurm account. If you have access to that partition, try providing the correct account required for that partition using the "-A" option in srun or sbatch.
Everybody has access to the "share", "dgx*" and "preempt" partitions.
The share partition can be used for both CPU, GPU, and large memory jobs, for up to 7 days.
The dgxs partition is mostly for short GPU jobs (<24 hrs), or troubleshooting GPU jobs.
The preempt partition is a low priority queue that usually consists of all HPC resources, and is a way to take advantage of unused resources outside the share partition, but jobs submitted to the preempt partition may be cancelled or "preempted" by higher priority jobs, so use this queue at your own risk. The preempt partition may be useful for short jobs (e.g. few hours), or jobs that are checkpointed or restartable. Long jobs (>24hrs) that are not restartable or checkpointed should not use the preempt partition.
Access to other partitions require being added to a research group, department, or class account.
The limits for each partition may vary depending on demand of that partition.
For more information on partition access and current limits, read the section "Summary of accounts, partitions and limits" located in the Slurm howto.
There may be several reasons for your job failing or being killed. Please view the error and output files from your batch script or your application for clues to why it failed. Some common reasons are listed below:
1) Out of time. Your job did not request enough time for your job. For example, you may get a message like this from your srun command or sbatch output:
"slurmstepd: error: *** Step=123456 ON hostname CANCELLED AT ... DUE TO TIME LIMIT ***"
It means that your default or requested time was not enough for your job or application to complete. To resolve, simply request additional time using the "--time" option in Slurm, e.g. to request 3.5 days using srun:
srun --time=3-12:00:00 --pty bash
or add this directive in your sbatch script:
#SBATCH --time=3-12:00:00
Note that the maximum time limit on most partitions is 7 days (see output of "sinfo" for timelimit on each partition).
2) Out-of-memory (OOM). Your job did not have enough memory allocated. For example, you may get a message like this from your srun command or sbatch output:
"slurmstepd: error: Detected 1 oom-kill event(s) in StepID=123456.batch. Some of your processes may have been killed by the cgroup out-of-memory handler"
It means that your default (or requested) allocated memory was not enough for your job or application. To resolve, simply request additional memory using the "--mem" option in Slurm, e.g. to request 10GB of memory using srun:
srun --mem=10g --pty bash
or add this directive in your sbatch script:
#SBATCH --mem=10G
If you are not sure how much to request, try the "tracejob" command to view a record of your job which will show the amount of memory requested (e.g. mem=1700M) and State (e.g. OUT_OF_MEMORY).
tracejob -j {jobid}
3) Cancelled due to preemption. There is nothing wrong with your job, it was running on the low priority preempt queue and was preempted by a higher priority job. For example, you may get a message like this from your srun command:
"srun: Force terminated job 123456. srun: Job step aborted:... slurmstepd: error: Step 123456 ... CANCELLED AT YYYY-MM-DDTHH:mm:ss DUE TO PREEMPTION"
To avoid this result, do not use the preempt partition.
4) Unknown. If it is not clear why your job failed, please submit a ticket to the COEIT support portal and provide any relevant output from your batch job or srun session.
To reserve a specific host in Slurm, use the "--nodelist" option in srun or sbatch, e.g. for dgx2-X:
srun -A {account} -p dgx2 --nodelist=dgx2-X --pty bash
or add this directive in your sbatch script:
#SBATCH --nodelist=dgx2-X
a) 'ReqNodeNotAvail, Reserved for maintenance':
The reason is that maintenance reservation window has been scheduled into Slurm, and your scheduled job would run into this window. It will remain pending until after the maintenance period is over. If your job can complete before the maintenance period begins, you can change the walltime of your pending job as follows:
scontrol update job {jobid} TimeLimit=D-HH:MM:SS
Note that you can only decrease your walltime, you cannot increase it.
Check this link for details on any scheduled maintenance.
b) 'Resources':
The job is waiting for resources to become available.
c) 'Priority':
The job is queued behind a higher priority job.
d) 'QOSMax*PerUser':
The maximum resource limit (CPU, GPU, or RAM) has been reached for the user.
e) 'QOSGrp*':
The maximum resource limit (CPU, GPU, or RAM) has been reached for the group account.
f) 'QOSGrp*RunMins':
The maximum active running limit of resources (CPU, GPU, or RAM) has been reached for the group account.
For reasons b through f, please be patient and your job will eventually start after other jobs complete.
Most likely this was the result of submitting a job which exceeds certain limits (e.g. GPU or CPU limits) of the partition you are submitting to.
Try this option is squeue:
squeue -j {jobid} --Format=starttime
or look for the StartTime field in the output of this command:
showjob {jobid}
If you feel your job is stuck in the queue, please leave your job in the queue, and open a ticket in the COEIT support portal and provide your job number. It is important to leave your job in the queue to facilitate troubleshooting.
First reserve a desired number of tasks (-n) or tasks per node (--ntasks-per-node) over a desired number of nodes (-N), e.g.:
srun -p share -N 2 --ntasks-per-node=4 --pty bash
Look for MPI modules:
module avail mpi
Load an MPI module, e.g. for OpenMPI:
module load openmpi/3.1
Compile the MPI code using the OpenMPI compiler wrapper:
mpicc simple.c
Run the OpenMPI executable using the amount of tasks you requested from Slurm:
mpirun -mca btl self,tcp -np 8 ./a.out
To use MPICH instead of OpenMPI, load an MPICH module:
module load mpich/3.3
Compile the MPI code using the MPICH compiler wrapper:
mpicc simple.c
Determine which hosts were reserved in Slurm:
echo $SLURM_NODELIST
Run the MPICH executable using the amount of tasks you requested from Slurm and the nodes assigned to this job which are listed in $SLURM_NODELIST, e.g. if the value of $SLURM_NODELIST was "cn-7-[4,5]" you would put:
mpirun -hosts=cn-7-4,cn-7-5 -np 8 ./a.out
A sample script for batch submissions might look like this:
#!/bin/bash
#SBATCH -p share
#SBATCH -N 2
#SBATCH --ntasks-per-node=4
# load an MPI module
module load openmpi/3.1
# compile MPI code
mpicc simple.c
# run MPI code
mpirun -mca btl self,tcp -np 8 ./a.out
OpenMPI is currently recommended for batch submissions requiring more than one node.
Instructions for using IntelMPI will be added soon.
C. Software FAQs
Many commonly used software is available through your default executable path. To confirm, try running the application. If you get "command not found", it might be available through the modules system.
If you are unable to find your software through the modules system, you can request the software by opening a ticket through the COEIT support portal.
Please put "HPC" as part of the subject line to alert COEIT that this software request is for the HPC cluster.
There are a number of python versions available via the modules system. Different python versions can be access through loading different python or conda modules. To see what is available, do:
module avail python conda
Due to the diversity of python needs among our python users, it is very difficult to install specific packages in a way that meets everyone's needs, so we recommend that python users manage their own python environments. First check one of our minimal conda modules which has numerous python packages pre-installed, and see if that has what you need:
module load conda/{version}
pip list
conda list
If the conda module does not contain the python package you need, try setting up your own python virtual environment.
You can use the conda module as the basis for your python virtual environment, just be sure to load the conda module before activating your environment.
Alternatively, you can install a conda provider like Miniforge into your own directory and customize python that way. Python environments can take significant space, therefore we recommend you install them into your HPC share (/nfs/hpc/share/{username}).
First, see what python versions are available through the modules system:
module avail python
Next, load the version of python that you want to use for your virtual environment:
module load python/{version}
which python
python -V
Now create your python environment. We recommend that you install your python environment into your hpc-share directory, or /nfs/hpc/share/{username}. To create a python environment with the name "myPythonEnv", do:
cd ~/hpc-share
python -m venv myPythonEnv
Now activate your new python environment, e.g.:
source myPythonEnv/bin/activate # for bash users
- or -
source myPythonEnv/bin/activate.csh # for tcsh users
Now you can install and manage your own python packages within your python environment, e.g.:
which pip
pip install jupyter jupyterlab numpy scipy matplotlib
We recommend you download Miniforge from the conda-forge community into your hpc-share directory:
cd ~/hpc-share
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
Install it, e.g.:
sh ./Miniforge3-Linux-x86_64.sh -p /nfs/hpc/share/{username}/miniforge
Activate it, e.g.:
source miniforge/bin/activate # for bash users
- or -
source miniforge/bin/activate.csh # for tcsh users
Now you can install and manage your own python packages with conda, e.g.:
conda install numpy
- or create a conda environment first with the numpy package -
conda create -n myCondaEnv numpy
conda activate myCondaEnv
If you haven't already, create a link to your hpc share, then create a directory there to save your R packages:
ln -s /nfs/hpc/share/username ~/hpc-share
mkdir -pv ~/hpc-share/R/{R version, e.g. 4.1.3}
Copy this .Rprofile into your home directory:
cp -v /usr/local/apps/R/.Rprofile ~/
or copy and paste these contents into your .Rprofile:
# set libPaths to HPC share/R/{version number}
.libPaths(new='~/hpc-share/R/4.1.3/')
# set a CRAN mirror
local({r <- getOption("repos")
r["CRAN"] <- "https://ftp.osuosl.org/pub/cran/"
options(repos=r)})
The next time you load an R module (e.g. R/4.1.3) and run R, you should be able to save R libraries to your directory in ~/hpc-share/R/4.1.3.
First, you might consider using Jupyter Notebook through the HPC portal over https. Just log in with your ONID credentials, and select "Jupyter Notebook" from the "Interactive Apps. Then put in "conda" for a module to load (and also add "cuda" if you plan to reserve a GPU) then fill in the blanks with whatever time and resources you need. This will launch the Jupyter Notebook app on your browser with the resources you have reserved.
If however, you still require running Jupyter over ssh rather than https then follow the steps below:
1) from a submit node, request resources (e.g. 4 cpu cores and gpu if needed) using srun, e.g.:
srun -c 4 --gres=gpu:1 --pty bash
2) from your active srun session, load a conda module which has jupyter-notebook installed:
module load conda
Alternatively, you may activate your own conda instance or python virtual environment containing jupyter-notebook.
3) from your active srun session, launch Jupyter notebook without browser but using an open port like 8080:
jupyter notebook --no-browser --port=8080
4) from your laptop or desktop, set up an ssh tunnel, e.g.:
ssh -N -L 8080:localhost:8080 {onid}@compute-{hostname}.hpc.engr.oregonstate.edu
This part can be tricky. The hostname given in the "srun" session is an internal cluster name, so if accessing from outside the cluster that name must start with "compute-" (see FAQ #A.3 for details). So if your srun session is on dgx2-1 then you would use "compute-dgx2-1", but if your session is on cn-gpu5 then you would use "compute-gpu5".
5) after providing the credentials for your ssh tunnel, open a browser on your laptop or desktop to:
You should now have access to your Jupyter Notebook on the HPC cluster.
Yes. Matlab can make use of all the cores that you request on a single node, but if you need multiple workers over multiple nodes, you can try using Matlab Parallel Server using the following steps:
1) Launch Matlab 2021b on your Windows (or Mac or Linux) computer
2) Go to the "Add-Ons" icon, select "Get Add-Ons"
3) Within the Add-On Explorer, search for "Slurm". Select the "Parallel Computing Toolbox plugin for MATLAB Parallel Server with Slurm", and install that Add-On. You will be asked to authenticate to your Matlab account.
4) After installation has completed, you will be asked to configure this Add-On. Proceed with the following options:
Choose "non shared" profile.
ClusterHost = any HPC submit host, e.g. "submit-b.hpc.engr.oregonstate.edu" (other choices can be submit-a or submit-c)
RemoteJobStorageLocation = "/nfs/hpc/share/{onid} or "/nfs/hpc/share/{onid}/matlab"
Username = {onid}
5) After that, you can go to "Parallel" and select "Create and Manage Clusters" to further edit your profile, e.g. you can set NumWorkers and NumThreads.
6) Click "Done" to save changes, then validate your new profile. The validation may fail on the last step due to a name resolution error, but the Matlab job should still run.
At present, running Matlab Parallel Server in this way has some limitations. For more options, it may be better to run Matlab Parallel Server via command line directly on the cluster.
We currently support Apptainer for running containers. If you haven't already, create a link to your hpc share, then create a directory there to save your containers:
ln -s /nfs/hpc/share/username ~/hpc-share
mkdir -pv ~/hpc-share/apptainer
cd ~/hpc-share/apptainer
Pull a container image using apptainer, e.g. lolcow image from Docker:
apptainer pull docker::/ghcr.io/apptainer/lolcow
Alternatively, if you have an apptainer definition file you may build the container image, e.g. lolcow:
apptainer build --fakeroot lolcow.sif lolcow.def
If you encounter an error on a submit node when trying to install a container, try reserving resources using Slurm.
To run your container, e.g. lolcow:
apptainer run lolcow.sif
- or to run a specific command -
apptainer exec lolcow.sif cat /etc/os-release
- or to open a shell within the container -
apptainer shell lolcow.sif
You may also run apptainer images in batch, see the examples in /apps/samples.
For more information on using Apptainer, see the docs.
D. Data/Storage FAQs
You need a secure file transfer capable application like MobaXterm, WinSCP, Filezilla or Cyberduck. Alternatively, you can use the HPC portal for smaller file transfers. If you are using Windows and MobaXterm for your ssh sessions, then you can open an sftp session to one of the submit nodes.
If you are using a Mac or Linux, an alternative command line option is to open a terminal and use the sftp command, or scp to one of the submit nodes, e.g.:
sftp onid@submit-b.hpc.engr.oregonstate.edu
-or-
scp myLocalFile onid@submit-c.hpc.engr.oregonstate.edu:
Everyone has a fixed 15 GB quota that will not be increased. Exceeding the quota can cause jobs on the cluster to fail, and other things to not work properly.
All researchers should have an HPC scratch share located in /nfs/hpc/share/{username}, with a 1 TB quota. We encourage you to run your jobs from there and store your data there. However, this should not be considered permanent storage and is subject to being purged. The current purge policy is 90 days. Users will be notified in advance if their files are scheduled to be purged.
The HPC share is a limited resource and should be considered short term storage. If you run out, you should copy your important data to long term storage and delete unnecessary files from your HPC share.
A list of storage options available within the College of Engineering can be found here. Other solutions include Box or other cloud storage provider.
The HPC share is only visible to the HPC cluster via the infiniband network, so it cannot be mounted the same way as other shares like guille. However, the HPC share is easy to access via MobaXterm or other file transfer transfer application with an sftp connection to one of the submit nodes. Alternatively, you can install the filesystem client SSHFS to mount the HPC share from a submit node over ssh.