Introduction: The Kelley Engineering Center server room opened in 2005 with 30 full-height computer racks that provided ample space for all the college’s computational research needs. To encourage usage, the room was made available for all suitable purposes. Policies governing the room were minimal in the past because of the abundance of space, which is now in high demand. This asset for the college now needs to be actively managed to provide the most efficient and effective utilization with a limited footprint and utilities.
Overview: The Kelley Engineering Center (KEC) server room is managed to provide access to high-quality computing resources. A ‘Condo’ model provides the basic framework for the management of computational resources. A condo model promotes shared ownership and usage of the systems, enhancing system power through collaborative contributions. This model prioritizes access for resource contributors while allowing all participants to use idle resources.
General Policies:
- Hardware (Node) Specification: College of Engineering (COE) IT will collaborate with researchers to identify and pursue the most appropriate resources and hardware, new or existing, that will meet the project requirements.
- Hardware Purchasing: Once hardware has been identified that aligns with the researcher's needs and meets COE IT standards, the researcher will provide an index or other funding source to COE IT for the purchase.
- Warranty: All new hardware purchases will be made with at least a 5-year next business day hardware warranty.
- Server Class Hardware: All nodes purchased will include out-of-band management capabilities, redundant power supplies, InfiniBand network card, and rack mounting system. All nodes will be considered enterprise class.
- Lifecycle: Nodes will be managed on a maximum 6-year lifecycle. At the end of the 6th year, researchers must choose to either replace the nodes or have them retired. If nodes are part of a project or contract that ends in under 6 years, those nodes will be retired after completion of the associated project or contract. Retired nodes will be removed from the server room.
Condo Participants (Contributors):
- Priority Access: Based on the number and specifications of the nodes a contributing researcher or research group provides, the researcher or research group will be granted priority access to that equivalent computation in the cluster.
- Open Queues: Researchers or research groups that contribute nodes to the cluster will be granted prioritized access to the open queues.
- Idle Nodes: Nodes that are idle will be accessible for use by all users. The ‘Condo’ model encourages this shared usage, granting access to resources that would otherwise go unused.
- Job Scheduling: Contributing researchers or research groups will be able to customize job scheduling parameters for their queues and will have first prioritization to those queues. However, submitted jobs will not be able to preempt currently running jobs, regardless of priority.
All Users:
- Access: All members of OSU can obtain access to use the cluster.
- Resource Allocation: Access to the Open queues for non-contributing users of the cluster will be handled in a fair-share manner, ensuring all users have an opportunity to run jobs on the cluster.
- Home Directories: Each user is provided with a cluster-accessible home directory that can be used for software configuration files, data storage, and software installation. This home directory is only available to the specified user and is not accessible by other users.
- Storage: All users can request access to high-performance storage space. The high-performance storage is a DDN appliance that has a Lustre-based file system. All researchers or research groups can request that project space be set up. By default, the quote on this space will be set to 1 TB. The project space is hosted on a Dell Isilon storage device (NAS).
- Network: The cluster systems and high-performance storage are connected via an InfiniBand HDR network.
- Server Management: All OS installation, OS updates, BIOS updates, hardware maintenance, hardware installation, and other server management needs will be handled by COE IT staff only.
- Job Scheduling Environment: Slurm, a workload management software, is used for job scheduling on the cluster.
- Software: COE IT supports and maintains a basic set of software that is available on the cluster. Users can request software be added to the central repository. Users can also install software in their home directory or project space for use on the cluster.
- Job Scheduling Parameters: Slurm default job queue parameters are set and maintained by COE IT. Requests for changes and exceptions will be handled on a case-by-case basis.
These policies ensure efficient and fair use of resources, balancing shared access with prioritized contributions.