Apollo GPU Nodes
Hardware
Each Apollo node has 8 NVIDIA A100 40GB GPUs, 2 x 64 core AMD Epyc 7742 processors, 1024 GB RAM, and 15TB of local scratch space running Springdale Linux 8.
Configuration
All nodes mount the same /home and /data filesystems as the other computers in SNS. Scratch space locations have been tweaked to help identify local vs network resources. /scratch/lustre is the new mount point for the parallel file system and /scratch/local/ will be for any system local storage.
Scheduler
Job queuing is provided by SLURM; the following hosts have been configured as SLURM submit hosts for the Apollo nodes:
- Apollo-login1.sns.ias.edu
Access to the Apollo nodes is restricted and requires a cluster account.
Submitting / Connecting to Apollo Nodes
You can submit jobs to the Apollo nodes from apollo-login1.sns.ias.edu by requesting a gpu resource. A job submit script will automatically assign you to the appropriate queue. At this time we are enforcing a maximum of four gpus per job.
GPU resources can be requested by using --gpus=1, --gres=gpu:1, or --gpus-per-node=1 :
srun --time=1:00 --gpus=1 nvidia-smi
srun --time=1:00 --gres=gpu:1 nvidia-smi
srun --time=1:00 --gpus-per-node=1 nvidia-smi
You can ssh to an Apollo node once you have an active job or allocation on said node:
apollo-login1$> salloc --time=5:00 --gpus=1
salloc: Granted job allocation 134
salloc: Waiting for resource configuration
salloc: Nodes apollo01 are ready for job
Checking GPU Usage
You can check GPU usage by using the nvidia-smi command. Please be aware that you will need to be logged in to a GPU node to run nvidia-smi or use srun to use an already allocated GPU job with srun --jobid=$<JOBID> nvidia-smi. You are able to ssh interactively to any node that you have an active job assigned.