Apollo GPU Nodes

Access to GPU resources is restricted and requires a cluster account. To request an account please contact your school's help desk.

Hardware

The Apollo cluster contains 2xA100 nodes and 2xH200 nodes with the following configurations:

A100

A100 Apollo nodes have 8 NVIDIA A100 40GB GPUs, 2 x 64 core AMD Epyc 7742 processors, 1024 GB RAM, and 15TB of local scratch space running Springdale Linux 8.

H200

H200 Apollo nodes have 8 NVIDIA H200 141GB GPUs, NVLink, 2 x 32 core Intel Xeon-G 6430 processors, 2048 GB RAM, and 3.5 TB of local scratch space running Red Hat Enterprise Linux 8.

Configuration

All nodes mount the same /home and /data filesystems as the other computers in SNS. Scratch space locations have been tweaked to help identify local vs network resources. /scratch/lustre is the new mount point for the parallel file system and /scratch/local/ will be for any system local storage.

Scheduler

Job queuing is provided by SLURM; the following hosts have been configured as SLURM submit hosts for the Apollo nodes:

  • Apollo-login1.sns.ias.edu

Access to the Apollo nodes is restricted and requires a cluster account.

Submitting / Connecting to Apollo Nodes

You can submit jobs to the Apollo nodes from apollo-login1.sns.ias.edu by requesting a gpu resource. A job submit script will automatically assign you to the appropriate queue. At this time we are enforcing a maximum of four gpus per job.

A100 Nodes

A100 GPU resources can be requested by using --gpus=1, --gres=gpu:1, or --gpus-per-node=1 :
  srun --time=1:00 --gpus=1 nvidia-smi
  srun --time=1:00 --gres=gpu:1 nvidia-smi
  srun --time=1:00 --gpus-per-node=1 nvidia-smi
 

H200 Nodes

H200 GPU resources are requested similarly to A100 nodes with the inclusion of the partition flag: "partition=h200"
  srun --time=1:00 --gres=gpu:1 --partition=h200 nvidia-smi
  srun --time=1:00 --gpus-per-node=1 --partition=h200 nvidia-smi

You can ssh to an Apollo node once you have an active job or allocation on said node:
  apollo-login1$> salloc --time=5:00 --gpus=1
    salloc: Granted job allocation 134
    salloc: Waiting for resource configuration
    salloc: Nodes apollo01 are ready for job

Qos Limits 

QOS

Time Limit

GPUs per user

Jobs per user

Nodes

gpu-short

24 hours

4

2

A100

gpu-long

72 hours

4

2

A100

gpu-h200

24 hours

8

1

H200

 

Checking GPU Usage

You can check GPU usage by using the nvidia-smi command. Please be aware that you will need to be logged in to a GPU node to run nvidia-smi or use srun to use an already allocated GPU job with srun --jobid=$<JOBID> nvidia-smi. You are able to ssh interactively to any node that you have an active job assigned.

Apollo GPU nvidia-smi output
nvidia-smi output for Apollo GPU nodes