A. General FAQs

1. "How can I get access to the HPC cluster?"

Easy!  Make sure you have an engineering account - you can create one here and click on "Create a new account (Enable your Engineering resources)".  Then activate your HPC account by clicking on the "High Performance Computing" button on the right hand side under "Account Tools".  Your account should be ready shortly. If not, or if you have trouble with your account, try opening a ticket through the COEIT support portal.  Please provide your ONID, department, and advisor, and you will be assigned to the appropriate Slurm account.

 

2. "How can I get assistance on an HPC cluster issue?"

You can request assistance on HPC related issues by opening a ticket through the COEIT support portal.

Please put "HPC" as part of the subject line to alert COEIT that this issue involves the HPC cluster, this will facilitate that the ticket is properly routed and will improve response time.

 

3. "I try to ssh to a DGX or HPC node and I get the following message 'Access denied by pam_slurm_adopt...', why?"

In order to access hosts on the CoE HPC cluster, you must ssh to one of the submit nodes (submit-a, b, c) and then use Slurm to request resources from these nodes. Once you have reserved resources in Slurm, you can ssh to that node. For more information on using Slurm, check out this link.

 

4. "I try to ssh to dgx* or cn* and I get 'ssh: Could not resolve hostname...', what is wrong?"

The pattern of names starting with dgx* (eg. dgx2-2) and cn* (e.g. cn-gpu5) are shortened internal cluster names that can be used within the cluster (e.g. the submit nodes) but are currently not resolvable outside the cluster.  The corresponding external names begin with "compute-" and is that way for historical reasons.  For internal names starting with "dgx", just add "compute-" to the name, e.g. for dgx2-2 you should use "compute-dgx2-2".  For names beginning with "cn", replace cn with "compute", e.g. for cn-gpu5 you should use "compute-gpu5".  Be advised that ssh access is not allowed until resources have already been reserved using Slurm

 

5. "Why are my tmux and srun sessions being terminated?"

The mostly likely cause is that your tmux session are exceeding memory or CPU utilization limits on the submit nodes. Monitor your tmux sessions closely using the "ps" command.  If you have a lot of srun commands running through a single tmux session, you might try spreading the load out among the three submit nodes.

If your tmux session is terminated, then any srun sessions will also be terminated.  If you have a well established workflow, you might try putting your commands into a script and use sbatch instead of srun.  Jobs submitted in batch mode will not be terminated even if your network or VPN connection fails or if your tmux session is terminated.  If your work flow is not suitable to batch mode, you might try the HPC portal instead (VPN or campus network connection required).

 

 

B. Slurm FAQs

1. "What is Slurm, why do I need to use it, why can't I just ssh to a DGX or GPU server or other compute node?"

The HPC cluster is a heavily shared compute resource, and Slurm is a cluster resource and workload manager.  When you request an allocation in Slurm (or any cluster manager), Slurm will search all compute resources (partitions) that are available to you so you do not have to find them yourself, and then Slurm allocates the resources to you that you requested (e.g. CPU, GPU, RAM) if immediately available (or schedule them if not currently available), and give you get dedicated access to those resources.  This protects your calculations or jobs from memory contention as well as CPU/GPU sharing with other users, which leads to much greater job stability and performance. This resource protection would not be possible if ssh access was granted to anybody at anytime.  Without a cluster manager, the best case for a busy cluster is that users jobs would slow down to a crawl due to CPU and GPU load sharing, or worse, crash or cause other jobs to crash due to memory contention.

 

2. "When I type 'srun' or 'sinfo', I get 'command not found', did I do something wrong?"

There is likely a problem with your shell environment.  There are a couple of potential solutions:

1) Log into the TEACH web site, then click on the link titled "Reset Unix Config Files", then open another login session to a submit node.  This should correct the problem.

2) Alternatively, if you have invested a lot of time and effort into your shell environment and do not want to reset it, edit your shell configuration file (~/.bashrc if you use bash, or ~/.cshrc if you use csh or tcsh) using your favorite linux editor (e.g. nano, vim, emacs).  When in editing these files, look for lines that are setting your executable path, then insert ($PATH for bash, $path for csh/tcsh) into that line, e.g.:

if you use bash:

export PATH=$PATH:/usr/local/bin:/usr/local/apps/bin

if you use csh/tcsh:

set path=( $path /usr/local/bin /usr/local/apps/bin)

After adding $PATH or $path to your shell configuration file, open a new session to a submit node and try running "srun" or "sinfo" again.

If neither #1 or #2 corrects the issue, let us know by opening up a ticket through the COEIT support portal.

 

3. "When I run 'srun', I get the message 'srun: error: Unable to allocate resources: Invalid account or account/partition combination specified'. What am I doing wrong?"

There are two possibilities:

1) You have not yet been added to a Slurm account.  Please open a ticket through the COEIT support portal with this error message, and provide your advisor's name and your department, and we'll add you to the appropriate account.

2) You are trying to access a restricted partition using the wrong Slurm account.  If you have access to that partition, try providing the correct account required for that partition using the "-A" option in srun or sbatch.

 

4. "What partitions do I have access to, and what are their limits or policies?"

Everybody has access to the "share", "dgxs" and "preempt" partitions.

The share partition can be used for both CPU, GPU, and large memory jobs, for up to 7 days.  

The dgxs partition is mostly for short GPU jobs (<24 hrs), or troubleshooting GPU jobs.  

The preempt partition is a low priority queue that usually consists of all HPC resources, and is a way to take advantage of unused resources outside the share partition, but jobs submitted to the preempt partition may be cancelled or "preempted" by higher priority jobs, so use this queue at your own risk.  The preempt partition may be useful for short jobs (e.g. few hours), or jobs that are checkpointed or restartable.  Long jobs (>24hrs) that are not restartable or checkpointed should not use the preempt partition.

Access to other partitions require being added to a research group, department, or class account.  

The limits for each partition may vary depending on demand of that partition. 

For more information on partition access and current limits, read the section "Summary of accounts, partitions and limits" located in the Slurm howto.

 

5. "Why is my job failing or getting killed or cancelled?"

There may be several reasons for your job failing or being killed. Please view the error and output files from your batch script or your application for clues to why it failed. Some common reasons are listed below:

1) Out of time.  Your job did not request enough time for your job.  For example, you may get a message like this from your srun command or sbatch output:

"slurmstepd: error: *** Step=123456 ON hostname CANCELLED AT ... DUE TO TIME LIMIT ***"

It means that your default or requested time was not enough for your job or application to complete.  To resolve, simply request additional time using the "--time" option in Slurm, e.g. to request 3.5 days using srun:

srun --time=3-12:00:00 --pty bash

or add this directive in your sbatch script:

#SBATCH --time=3-12:00:00

Note that the maximum time limit on most partitions is 7 days (see output of "sinfo" for timelimit on each partition).

 

2) Out-of-memory (OOM).  Your job did not have enough memory allocated.  For example, you may get a message like this from your srun command or sbatch output:

"slurmstepd: error: Detected 1 oom-kill event(s) in StepID=123456.batch. Some of your processes may have been killed by the cgroup out-of-memory handler"

It means that your default (or requested) allocated memory was not enough for your job or application.  To resolve, simply request additional memory using the "--mem" option in Slurm, e.g. to request 10GB of memory using srun:

srun --mem=10g --pty bash

or add this directive in your sbatch script:

#SBATCH --mem=10G

If you are not sure how much to request, try the "tracejob" command to view a record of your job which will show the amount of memory requested (e.g. mem=1700M) and State (e.g. OUT_OF_MEMORY).

tracejob -j {jobid}

 

3) Cancelled due to preemption.  There is nothing wrong with your job, it was running on the low priority preempt queue and was preempted by a higher priority job.  For example, you may get a message like this from your srun command:

"srun: Force terminated job 123456. srun: Job step aborted:... slurmstepd: error: Step 123456 ... CANCELLED AT YYYY-MM-DDTHH:mm:ss DUE TO PREEMPTION"

To avoid this result, do not use the preempt partition.

 

4) Unknown.  If it is not clear why your job failed, please submit a ticket to the COEIT support portal and provide any relevant output from your batch job or srun session.

 

6. "All of my data is on the scratch directory on dgx2-X, how can I select that host to run my jobs on?"

To reserve a specific host in Slurm, use the "--nodelist" option in srun or sbatch, e.g. for dgx2-X:

srun -A {account} -p dgx2 --nodelist=dgx2-X --pty bash

or add this directive in your sbatch script:

#SBATCH --nodelist=dgx2-X

 

7. "Why is my Slurm job pending with the message ...?"

  a) 'ReqNodeNotAvail, Reserved for maintenance':

The reason is that maintenance reservation window has been scheduled into Slurm, and your scheduled job would run into this window.  It will remain pending until after the maintenance period is over.  If your job can complete before the maintenance period begins, you can change the walltime of your pending job as follows:

scontrol update job {jobid} TimeLimit=D-HH:MM:SS

Note that you can only decrease your walltime, you cannot increase it.  

Check this link for details on any scheduled maintenance.

  b) 'Resources':

The job is waiting for resources to become available. 

  c) 'Priority':

The job is queued behind a higher priority job.

  d) 'QOSMax*PerUser':

The maximum resource limit (CPU, GPU, or RAM) has been reached for the user.

  e) 'QOSGrp*':

The maximum resource limit (CPU, GPU, or RAM) has been reached for the group account.

  f) 'QOSGrp*RunMins':

The maximum active running limit of resources (CPU, GPU, or RAM) has been reached for the group account.

For reasons b through f, please be patient and your job will eventually start after other jobs complete.

 

8. "Why is my Slurm job fail with the message 'Unable to allocate resources: Job violates accounting/QOS policy...'?"

Most likely this was the result of submitting a job which exceeds certain limits (e.g. GPU or CPU limits) of the partition you are submitting to. 

 

9. "How can I find out when my job will start?"

Try this option is squeue:

squeue -j {jobid} --Format=starttime

or look for the StartTime field in the output of this command:

showjob {jobid} 

If you feel your job is stuck in the queue, please leave your job in the queue, and open a ticket in the COEIT support portal and provide your job number.  It is important to leave your job in the queue to facilitate troubleshooting. 

 

10. "How do I run an MPI job on the cluster?"

First reserve a desired number of tasks (-n) or tasks per node (--ntasks-per-node) over a desired number of nodes (-N), e.g.: 

srun -p share -N 2 --ntasks-per-node=4 --pty bash 

Look for MPI modules:

module avail mpi

Load an MPI module, e.g. for OpenMPI:

module load openmpi/3.1

Compile the MPI code using the OpenMPI compiler wrapper:

mpicc simple.c

Run the OpenMPI executable using the amount of tasks you requested from Slurm:

mpirun -mca btl self,tcp -np 8 ./a.out

To use MPICH instead of OpenMPI, load an MPICH module:

module load mpich/3.3

Compile the MPI code using the MPICH compiler wrapper:

mpicc simple.c

Determine which hosts were reserved in Slurm:

echo $SLURM_NODELIST

Run the MPICH executable using the amount of tasks you requested from Slurm and the nodes assigned to this job which are listed in $SLURM_NODELIST, e.g. if the value of $SLURM_NODELIST was "cn-7-[4,5]" you would put:

mpirun -hosts=cn-7-4,cn-7-5 -np 8 ./a.out

A sample script for batch submissions might look like this:

#!/bin/bash
#SBATCH -p share
#SBATCH -N 2
#SBATCH --ntasks-per-node=4
# load an MPI module
module load openmpi/3.1
# compile MPI code
mpicc simple.c
# run MPI code
mpirun -mca btl self,tcp -np 8 ./a.out

OpenMPI is currently recommended for batch submissions requiring more than one node.  

Instructions for using IntelMPI will be added soon.

 

 

C. Software FAQs

1. "I need to run certain software on the HPC cluster, is it installed?  If not, how can I get it installed?"

Many commonly used software is available through your default executable path.  To confirm, try running the application.  If you get "command not found", it might be available through the modules system.

If you are unable to find your software through the modules system, you can request the software by opening a ticket through the COEIT support portal.

Please put "HPC" as part of the subject line to alert COEIT that this software request is for the HPC cluster.

 

2. "Can you install a specific python version or python package for me?"

There are a number of python versions available via the modules system.  Different python versions can be access through loading different python or anaconda modules.  To see what is available, do: 

module avail python anaconda

Due to the diversity of python needs among our python users, it is difficult to install specific packages in a way that meets everyone's needs, so we recommend that python users manage their own python environments.  First check one of our minimal anaconda modules which has numerous python packages pre-installed, and see if that has what you need:

module load anaconda/{version}
pip list
conda list

If the anaconda module does not contain the python package you need, try setting up your own python virtual environment.

You can use the anaconda module as the basis for your python virtual environment, just be sure to load the anaconda module before activating your environment.

Alternatively, you can install anaconda or miniconda into your own directory and customize python that way.  Python environments can take significant space, therefore we recommend you install them into your HPC share (/nfs/hpc/share/{onid}).

 

3. "How can I install my own R packages for use on the cluster?"

If you haven't already, create a link to your hpc share, then create a directory there to save your R packages:

ln -s /nfs/hpc/share/{myONID} ~/hpc-share
mkdir -pv ~/hpc-share/R/{R version, e.g. 4.1.3}

Copy this .Rprofile into your home directory:

cp -v /usr/local/apps/R/.Rprofile ~/

or copy and paste these contents into your .Rprofile:

# set libPaths to HPC share/R/{version number}
.libPaths(new='~/hpc-share/R/4.1.3/')
# set a CRAN mirror
local({r <- getOption("repos")
      r["CRAN"] <- "https://ftp.osuosl.org/pub/cran/"
      options(repos=r)})

The next time you load an R module (e.g. R/4.1.3) and run R, you should be able to save R libraries to your directory in ~/hpc-share/R/4.1.3.

 

4. "How can I run Jupyter Notebook over an ssh tunnel on the HPC cluster?"

First, you might consider using Jupyter Notebook through the HPC portal over https.  Just log in with your ONID credentials, and select "Jupyter Notebook" from the "Interactive Apps.  Then put in "anaconda" for a module to load (and also add "cuda" if you plan to reserve a GPU) then fill in the blanks with whatever time and resources you need.  This will launch the Jupyter Notebook app on your browser with the resources you have reserved.

If however, you still require running Jupyter over ssh rather than https then follow the steps below:

1) from a submit node, request resources (e.g. 4 cpu cores and gpu if needed) using srun, e.g.:

srun -c 4 --gres=gpu:1 --pty bash

2) from your active srun session, load an anaconda module which will have jupyter-notebook installed:

module load anaconda

Alternatively, you may activate your own conda instance or python virtual environment containing jupyter-notebook.

3) from your active srun session, launch Jupyter notebook without browser but using an open port like 8080:

jupyter notebook --no-browser --port=8080

4) from your laptop or desktop, set up an ssh tunnel, e.g.:

ssh -N -L 8080:localhost:8080 {onid}@compute-{hostname}.hpc.engr.oregonstate.edu

This part can be tricky.  The hostname given in the "srun" session is an internal cluster name, so if accessing from outside the cluster that name must start with "compute-" (see FAQ #A.3 for details).  So if your srun session is on dgx2-1 then you would use "compute-dgx2-1", but if your session is on cn-gpu5 then you would use "compute-gpu5". 

5) after providing the credentials for your ssh tunnel, open a browser on your laptop or desktop to:

http://localhost:8080

You should now have access to your Jupyter Notebook on the HPC cluster.

 

5. "Can I run Matlab in parallel on the cluster?"

Yes.  Matlab can make use of all the cores that you request on a single node, but if you need multiple workers over multiple nodes, you can try using Matlab Parallel Server using the following steps:

1) Launch Matlab 2021b on your Windows (or Mac or Linux) computer

2) Go to the "Add-Ons" icon, select "Get Add-Ons"

3) Within the Add-On Explorer, search for "Slurm". Select the "Parallel Computing Toolbox plugin for MATLAB Parallel Server with Slurm", and install that Add-On. You will be asked to authenticate to your Matlab account.

4) After installation has completed, you will be asked to configure this Add-On.  Proceed with the following options:

  Choose "non shared" profile.

  ClusterHost = any HPC submit host, e.g. "submit-b.hpc.engr.oregonstate.edu" (other choices can be submit-a or submit-c)

  RemoteJobStorageLocation = "/nfs/hpc/share/{onid} or "/nfs/hpc/share/{onid}/matlab"

  Username = {onid}

5) After that, you can go to "Parallel" and select "Create and Manage Clusters" to further edit your profile, e.g. you can set NumWorkers and NumThreads.  

6) Click "Done" to save changes, then validate your new profile. The validation may fail on the last step due to a name resolution error, but the Matlab job should still run.

At present, running Matlab Parallel Server in this way has some limitations.  For more options, it may be better to run Matlab Parallel Server via command line directly on the cluster.

 

 

D. Data/Storage FAQs

1. "How do I transfer my data to and from the HPC cluster?"

You need a secure file transfer capable application like MobaXterm, WinSCP, Filezilla or Cyberduck. Alternatively, you can use the HPC portal for file transfers. If you are using Windows and MobaXterm for your ssh sessions, then you can open an sftp session to one of the submit nodes. 

If you are using a Mac or Linux, an alternative command line option is to open a terminal and use the sftp command, or scp to one of the submit nodes, e.g.:

sftp onid@submit-b.hpc.engr.oregonstate.edu
-or-
scp myLocalFile onid@submit-c.hpc.engr.oregonstate.edu:

 

2. "I have run out of storage space on my home directory, how can I store data generated by my jobs?"

All researchers should have an HPC scratch share located in /nfs/hpc/share/{onid}, with a 1 TB quota. We encourage you to run your jobs from there and store your data there.  However, this should not be considered permanent storage and is subject to being purged.  The current purge policy is 90 days.  Users will be notified in advance if their files are scheduled to be purged.

 

3. "How do I mount the HPC share "/nfs/hpc/share/{onid}" on my local computer?"

The HPC share is only visible to the HPC cluster via the infiniband network, so it cannot be mounted the same way as other shares like guille.  However, the HPC share is easy to access via MobaXterm or other file transfer transfer application with an sftp connection to one of the submit nodes.  Alternatively, you can install the filesystem client SSHFS to mount the HPC share from a submit node over ssh.

 

4. "I have exceeded my quota on my HPC share directory and need more storage space, can you help?"

The HPC share is a limited resource and should be considered short term storage.  If you run out, you should copy your important data to long term storage and delete unnecessary files from your HPC share.  For long term storage we recommend Box or other cloud storage provider.