HPC FAQs

1. "How can I get access to the HPC cluster?"

Easy! Make sure you have an engineering account - you can create one here and click on "Create a new account (Enable your Engineering resources)". Then activate your HPC account by clicking on the "High Performance Computing" button on the right hand side under "Account Tools". Your account should be ready by next business day. If not, or if you have trouble with your account, try opening a ticket through the support portal.

If you are not in the College of Engineering, please fill out this webform. Please provide your ONID, department, and advisor, and you will be assigned to the appropriate Slurm account.

2. "How can I get assistance on an HPC cluster issue?"

3. "I try to ssh to a DGX or HPC node and I get the following message 'Access denied by pam_slurm_adopt...', why?"

**4. "I try to ssh to dgx* or cn* and I get 'ssh: Could not resolve hostname...', what is wrong?"**

5. "Why are my tmux and srun sessions being terminated?"

6. "When I try to access the HPC portal, I get a Bad Request error"

1. "What is Slurm, why do I need to use it, why can't I just ssh to a DGX or GPU server or other compute node?"

2. "When I type 'srun' or 'sinfo', I get 'command not found', did I do something wrong?"

There is likely a problem with your shell environment. There are a couple of potential solutions:

1) Log into the TEACH web site, then click on the link titled "Reset Unix Config Files", then open another login session to a submit node. This should correct the problem.

2) Alternatively, if you have invested a lot of time and effort into your shell environment and do not want to reset it, edit your shell configuration file (~/.bashrc if you use bash, or ~/.cshrc if you use csh or tcsh) using your favorite linux editor (e.g. nano, vim, emacs). When in editing these files, look for lines that are setting your executable path, then insert ($PATH for bash, $path for csh/tcsh) into that line, e.g.:

if you use bash:

export PATH=$PATH:/usr/local/bin:/usr/local/apps/bin

if you use csh/tcsh:

set path=( $path /usr/local/bin /usr/local/apps/bin)

After adding $PATH or $path to your shell configuration file, open a new session to a submit node and try running "srun" or "sinfo" again.

If neither #1 or #2 corrects the issue, let us know by opening up a ticket through the support portal.

3. "When I run 'srun', I get the message 'srun: error: Unable to allocate resources: Invalid account or account/partition combination specified'. What am I doing wrong?"

4. "What are the different partitions, which ones do I have access to, and what are their limits or policies?"

5. "Why is my job failing or getting killed or cancelled?"

There may be several reasons for your job failing or being killed. Please view the error and output files from your batch script or your application for clues to why it failed. Some common reasons are listed below:

1) Out of time. Your job did not request enough time for your job. For example, you may get a message like this from your srun command or sbatch output:

"slurmstepd: error: *** Step=123456 ON hostname CANCELLED AT ... DUE TO TIME LIMIT ***"

It means that your default or requested time was not enough for your job or application to complete. To resolve, simply request additional time using the "--time" option in Slurm, e.g. to request 3.5 days using srun:

srun --time=3-12:00:00 --pty bash

or add this directive in your sbatch script:

#SBATCH --time=3-12:00:00

Note that the maximum time limit on most partitions is 7 days (see output of "sinfo" for time limit on each partition).

2) Out-of-memory (OOM). Your job did not have enough memory allocated. For example, you may get a message like this from your srun command or sbatch output:

"slurmstepd: error: Detected 1 oom-kill event(s) in StepID=123456.batch. Some of your processes may have been killed by the cgroup out-of-memory handler"

It means that your default (or requested) allocated memory was not enough for your job or application. To resolve, simply request additional memory using the "--mem" option in Slurm, e.g. to request 10GB of memory using srun:

srun --mem=10g --pty bash

or add this directive in your sbatch script:

#SBATCH --mem=10G

If you are not sure how much to request, try the "tracejob" command to view a record of your job which will show the amount of memory requested (e.g. mem=1700M) and State (e.g. OUT_OF_MEMORY).

tracejob -j {jobid}

3) Cancelled due to preemption. There is nothing wrong with your job, it was running on the low priority preempt queue and was preempted by a higher priority job. For example, you may get a message like this from your srun command:

"srun: Force terminated job 123456. srun: Job step aborted:... slurmstepd: error: Step 123456 ... CANCELLED AT YYYY-MM-DDTHH:mm:ss DUE TO PREEMPTION"

To avoid this result, do not use the preempt partition.

4) Unknown. If it is not clear why your job failed, please submit a ticket to the support portal and provide any relevant output from your batch job or srun session.

6. "All of my data is on the scratch directory on dgx2-X, how can I select that host to run my jobs on?"

To reserve a specific host in Slurm, use the "--nodelist" option in srun or sbatch, e.g. for dgx2-X:

srun -A {account} -p dgx2 --nodelist=dgx2-X --pty bash

or add this directive in your sbatch script:

#SBATCH --nodelist=dgx2-X

7. "Why is my Slurm job pending with the message ...?"

a) 'ReqNodeNotAvail, Reserved for maintenance':

The reason is that maintenance reservation window has been scheduled into Slurm, and your scheduled job would run into this window. It will remain pending until after the maintenance period is over. If your job can complete before the maintenance period begins, you can change the walltime of your pending job as follows:

scontrol update job {jobid} TimeLimit=D-HH:MM:SS

Note that you can only decrease your walltime, you cannot increase it.

Check this link for details on any scheduled maintenance.

b) 'Resources':

The job is waiting for resources to become available.

c) 'Priority':

The job is queued behind a higher priority job.

d) 'QOSMax*PerUser':

The maximum resource limit (CPU, GPU, or RAM) has been reached for the user.

e) 'QOSGrp*':

The maximum resource limit (CPU, GPU, or RAM) has been reached for the group account.

f) 'QOSGrp*RunMins':

The maximum active running limit of resources (CPU, GPU, or RAM) has been reached for the group account.

For reasons b through f, please be patient and your job will eventually start after other jobs complete.

8. "Why is my Slurm job fail with the message 'Unable to allocate resources: Job violates accounting/QOS policy...'?"

9. "How can I find out when my job will start?"

Try this option is squeue:

squeue -j {jobid} --Format=starttime

or look for the StartTime field in the output of this command:

showjob {jobid}

If you feel your job is stuck in the queue, please leave your job in the queue, and open a ticket in the support portal and provide your job number. It is important to leave your job in the queue to facilitate troubleshooting.

10. "How do I run an MPI job on the cluster?"

First reserve a desired number of tasks (-n) or tasks per node (--ntasks-per-node) over a desired number of nodes (-N), e.g.:

srun -p share -N 2 --ntasks-per-node=4 --pty bash

Look for MPI modules:

module avail mpi

Load an MPI module, e.g. for OpenMPI:

module load openmpi/3.1

Compile the MPI code using the OpenMPI compiler wrapper:

mpicc simple.c

Run the OpenMPI executable using the amount of tasks you requested from Slurm:

mpirun -mca btl self,tcp -np 8 ./a.out

To use MPICH instead of OpenMPI, load an MPICH module:

module load mpich/3.3

Compile the MPI code using the MPICH compiler wrapper:

mpicc simple.c

Determine which hosts were reserved in Slurm:

echo $SLURM_NODELIST

Run the MPICH executable using the amount of tasks you requested from Slurm and the nodes assigned to this job which are listed in $SLURM_NODELIST, e.g. if the value of $SLURM_NODELIST was "cn-7-[4,5]" you would put:

mpirun -hosts=cn-7-4,cn-7-5 -np 8 ./a.out

A sample script for batch submissions might look like this:

#!/bin/bash
#SBATCH -p share
#SBATCH -N 2
#SBATCH --ntasks-per-node=4
# load an MPI module
module load openmpi/3.1
# compile MPI code
mpicc simple.c
# run MPI code
mpirun -mca btl self,tcp -np 8 ./a.out

OpenMPI is currently recommended for batch submissions requiring more than one node.

Instructions for using IntelMPI will be added soon.

1. "I need to run certain software on the HPC cluster, is it installed? If not, how can I get it installed?"

2. "Can you install a specific python version or python package for me?"

There are a number of python versions available via the modules system. Different python versions can be access through loading different python or conda modules. To see what is available, do:

module avail python conda

Due to the diversity of python needs among our python users, it is very difficult to install specific packages in a way that meets everyone's needs, so we recommend that python users manage their own python environments. First check one of our minimal conda modules which has numerous python packages pre-installed, and see if that has what you need:

module load conda/{version}
pip list
conda list

If the conda module does not contain the python package you need, try setting up your own python virtual environment.

You can use the conda module as the basis for your python virtual environment, just be sure to load the conda module before activating your environment.

Alternatively, you can install a conda provider like Miniforge into your own directory and customize python that way. Python environments can take significant space, therefore we recommend you install them into your HPC share (/nfs/hpc/share/{username}).

3. "How can I create my own python environment?"

First, see what python versions are available through the modules system:

module avail python

Next, load the version of python that you want to use for your virtual environment:

module load python/{version}
which python
python -V

Now create your python environment. We recommend that you install your python environment into your hpc-share directory, or /nfs/hpc/share/{username}. To create a python environment with the name "myPythonEnv", do:

cd ~/hpc-share
python -m venv myPythonEnv

Now activate your new python environment, e.g.:

source myPythonEnv/bin/activate   # for bash users
- or -
source myPythonEnv/bin/activate.csh # for tcsh users

Now you can install and manage your own python packages within your python environment, e.g.:

which pip
pip install jupyter jupyterlab numpy scipy matplotlib

4. "How can I install my own instance of conda?"

We recommend you download Miniforge from the conda-forge community into your hpc-share directory:

cd ~/hpc-share
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh

Install it, e.g.:

sh ./Miniforge3-Linux-x86_64.sh -p /nfs/hpc/share/{username}/miniforge

Activate it, e.g.:

source miniforge/bin/activate   # for bash users
- or -
source miniforge/bin/activate.csh # for tcsh users

Now you can install and manage your own python packages with conda, e.g.:

conda install numpy
- or create a conda environment first with the numpy package - 
conda create -n myCondaEnv numpy
conda activate myCondaEnv

5. "How can I install my own R packages for use on the cluster?"

If you haven't already, create a link to your hpc share, then create a directory there to save your R packages:

ln -s /nfs/hpc/share/username ~/hpc-share
mkdir -pv ~/hpc-share/R/{R version, e.g. 4.1.3}

Copy this .Rprofile into your home directory:

cp -v /usr/local/apps/R/.Rprofile ~/

or copy and paste these contents into your .Rprofile:

# set libPaths to HPC share/R/{version number}
.libPaths(new='~/hpc-share/R/4.1.3/')
# set a CRAN mirror
local({r <- getOption("repos")
      r["CRAN"] <- "https://ftp.osuosl.org/pub/cran/"
      options(repos=r)})

The next time you load an R module (e.g. R/4.1.3) and run R, you should be able to save R libraries to your directory in ~/hpc-share/R/4.1.3.

6. "How can I run Jupyter Notebook over an ssh tunnel on the HPC cluster?"

First, you might consider using Jupyter Notebook through the HPC portal over https. Just log in with your ONID credentials, and select "Jupyter Notebook" from the "Interactive Apps. Then put in "conda" for a module to load (and also add "cuda" if you plan to reserve a GPU) then fill in the blanks with whatever time and resources you need. This will launch the Jupyter Notebook app on your browser with the resources you have reserved.

If however, you still require running Jupyter over ssh rather than https then follow the steps below:

1) from a submit node, request resources (e.g. 4 cpu cores and gpu if needed) using srun, e.g.:

srun -c 4 --gres=gpu:1 --pty bash

2) from your active srun session, load a conda module which has jupyter-notebook installed:

module load conda

Alternatively, you may activate your own conda instance or python virtual environment containing jupyter-notebook.

3) from your active srun session, launch Jupyter notebook without browser but using an open port like 8080:

jupyter notebook --no-browser --port=8080

4) from your laptop or desktop, set up an ssh tunnel, e.g.:

ssh -N -L 8080:localhost:8080 {onid}@compute-{hostname}.hpc.engr.oregonstate.edu

This part can be tricky. The hostname given in the "srun" session is an internal cluster name, so if accessing from outside the cluster that name must start with "compute-" (see FAQ #A.3 for details). So if your srun session is on dgx2-1 then you would use "compute-dgx2-1", but if your session is on cn-gpu5 then you would use "compute-gpu5".

5) after providing the credentials for your ssh tunnel, open a browser on your laptop or desktop to:

http://localhost:8080

You should now have access to your Jupyter Notebook on the HPC cluster.

7. "Can I run Matlab in parallel on the cluster?"

8. "How can I install my own containers for use on the cluster?"

We currently support Apptainer for running containers. If you haven't already, create a link to your hpc share, then create a directory there to save your containers:

ln -s /nfs/hpc/share/username ~/hpc-share
mkdir -pv ~/hpc-share/apptainer
cd ~/hpc-share/apptainer

Pull a container image using apptainer, e.g. lolcow image from Docker:

apptainer pull docker://ghcr.io/apptainer/lolcow

Alternatively, if you have an apptainer definition file you may build the container image, e.g. lolcow:

apptainer build --fakeroot lolcow.sif lolcow.def

If you encounter an error on a submit node when trying to install a container, try reserving resources using Slurm.

To run your container, e.g. lolcow:

apptainer run lolcow.sif
- or to run a specific command - 
apptainer exec lolcow.sif cat /etc/os-release
- or to open a shell within the container - 
apptainer shell lolcow.sif

You may also run apptainer images in batch, see the examples in /apps/samples.

For more information on using Apptainer, see the docs.

1. "How do I transfer my data to and from the HPC cluster?"

You need a secure file transfer capable application like MobaXterm, WinSCP, Filezilla or Cyberduck. Alternatively, you can use the HPC portal for smaller file transfers. If you are using Windows and MobaXterm for your ssh sessions, then you can open an sftp session to one of the submit nodes.

If you are using a Mac or Linux, an alternative command line option is to open a terminal and use the sftp command, or scp to one of the submit nodes, e.g.:

sftp onid@submit-b.hpc.engr.oregonstate.edu
-or-
scp myLocalFile onid@submit-c.hpc.engr.oregonstate.edu:

Information Technology and Computing Support

1. "How can I get access to the HPC cluster?"

2. "How can I get assistance on an HPC cluster issue?"

3. "I try to ssh to a DGX or HPC node and I get the following message 'Access denied by pam_slurm_adopt...', why?"

**4. "I try to ssh to dgx* or cn* and I get 'ssh: Could not resolve hostname...', what is wrong?"**

5. "Why are my tmux and srun sessions being terminated?"

6. "When I try to access the HPC portal, I get a Bad Request error"

1. "What is Slurm, why do I need to use it, why can't I just ssh to a DGX or GPU server or other compute node?"

2. "When I type 'srun' or 'sinfo', I get 'command not found', did I do something wrong?"

3. "When I run 'srun', I get the message 'srun: error: Unable to allocate resources: Invalid account or account/partition combination specified'. What am I doing wrong?"

4. "What are the different partitions, which ones do I have access to, and what are their limits or policies?"

5. "Why is my job failing or getting killed or cancelled?"

6. "All of my data is on the scratch directory on dgx2-X, how can I select that host to run my jobs on?"

7. "Why is my Slurm job pending with the message ...?"

8. "Why is my Slurm job fail with the message 'Unable to allocate resources: Job violates accounting/QOS policy...'?"

9. "How can I find out when my job will start?"

10. "How do I run an MPI job on the cluster?"

1. "I need to run certain software on the HPC cluster, is it installed? If not, how can I get it installed?"

2. "Can you install a specific python version or python package for me?"

3. "How can I create my own python environment?"

4. "How can I install my own instance of conda?"

5. "How can I install my own R packages for use on the cluster?"

6. "How can I run Jupyter Notebook over an ssh tunnel on the HPC cluster?"

7. "Can I run Matlab in parallel on the cluster?"

8. "How can I install my own containers for use on the cluster?"

1. "How do I transfer my data to and from the HPC cluster?"

2. "I have run out of storage space on my home directory, can you increase my quota?

3. "I have exceeded my quota on my HPC share directory and need more storage space, can you help?"

4. "What other storage options are available?"

5. "How do I mount the HPC share "/nfs/hpc/share/{username}" on my local computer?"

1. "How can I access the HPC portal?"

2. "When I try to access the portal, why do I get the error message 'something went wrong'"?

3. "When I try to access the portal, why do I get the error message 'bad request'"?

HPC Training and Resources

Nvida Resources

Mark III Training Opportunities

Contact Info