The College of Engineering operates a high performance computing (HPC) cluster that is supported by the engineering community. Our goal is to provide a resource that is available to all engineering faculty and students. The college supports the cluster by providing rack space, disk space, administration, and configuration of all systems, along with standard software. Furthermore, the college provides CPU and GPU systems for general research and class use. Faculty that purchase systems to add to the cluster get the benefit of reserved resources along with access to the rest of the cluster. COE IT maintains these resources to free faculty up from system administration needs.
Beginning this Fall 2023, non-COE faculty and students can request COE HPC accounts! Students not enrolled in engineering courses must have sponsorship from a faculty member in their college or department.
All new HPC users must attend an Intro to HPC training session with COE HPC Manager, Robert Yelle. Non-COE users will also receive a link to sign up information in the email confirmation after submitting the HPC access request form.
As a new HPCC user, you may wish to subscribe to the Cluster mailing list to receive news or status updates regarding the cluster. You may also check for news or the status of the cluster here.
There are two ways to get on the campus network
Connect to OSU VPN. This is the recommended method. Once you are on the OSU VPN you can connect directly to the CoE HPC cluster using your SSH client, or to the HPC portal using your web browser.
Connect to one of the CoE gateway servers.
Alternatively, you can first connect to a CoE gateway host (access.engr.oregonstate.edu) via SSH. If you are using a Mac or a Linux computer, then you can just launch a terminal window and use the ssh command, e.g.:
ssh myONID@access.engr.oregonstate.edu
where myONID = your ONID, or OSU Network ID. If you are using Windows, you need to run an SSH client like MobaXterm or Putty, then open an SSH session to access.engr.oregonstate.edu.
If you are on campus, or are connected to the OSU VPN or to a COE Gateway host as described in Step 3, then you may connect directly to one of the cluster login or submit nodes via SSH or via the HPC portal using your ONID credentials. If you are using a Mac, or a Linux host such as one of the flip servers, then from a terminal window or shell prompt you can SSH directly to one of three submit hosts (submit-a,b,c) as follows.:
ssh username@submit.hpc.engr.oregonstate.edu
If you are connecting from a Windows computer, you need to run an SSH client like MobaXterm or Putty, and open an SSH session to one of the submit nodes (e.g. submit.hpc.engr.oregonstate.edu).
To access the HPC Portal, launch a web browser put in this URL:
https://submit.hpc.engr.oregonstate.edu
Note that the submit nodes are not for running long or large jobs or calculations, but serve as a gateway for the rest of the cluster. From a submit node, you may request compute resources (e.g. CPUs, RAM and GPUs) from available compute nodes, either via an interactive session, or by submitting one or more batch jobs. See the Slurm section below (Step 5) for how to reserve resources and run jobs on the cluster.
Once you are connected to a submit host, you can reserve and use cluster resources. Be advised that direct ssh access to a cluster compute node is not required, and in fact is not permitted unless you are granted access to a compute node using Slurm. Slurm Workload Manager is the batch-queue system used to gain access to or run jobs on the COE HPC cluster, including the Nvidia DGX systems and other compute nodes. To use Slurm, you will need to be assigned to a Slurm account corresponding to your department, class, or research group, which is done by enabling your HPC account in TEACH (see Step 2).
For quick, interactive shell access to a compute node (e.g. using bash), do:
srun --pty bash
***if you get the error "srun: command not found", check out FAQ #B.2 for how to resolve.
If you want interactive access to a GPU, and prefer tcsh shell over bash, do:
srun --gres=gpu:1 --pty tcsh
To confirm that you have access to one or more GPUs:
nvidia-smi
Many more options are available which include reserving multiple cores, multiple compute nodes, more memory, additional time and requesting other pools of resources. In addition, jobs may be submitted in batch mode instead of interactive mode. Check out the Slurm HOWTO for more examples and information on using Slurm on the COE cluster.
If you have completed the above steps, you should now have a functioning COE HPC environment. However if you would like to get the most out of your use of the cluster, here are some more helpful hints to optimize your experience.