GPU Programming
Some programs can take advantage of the unique hardware architecture in a graphics processing unit (GPU). GPUs can be used for specialized scientific computing work, including 3D modelling and machine learning. CARC’s Discovery cluster offers a few different models of GPUs for use with your jobs. In addition, Condo Cluster Program users participating in the traditional purchase model have the option to include GPUs in their dedicated resources.
0.0.1 Requesting GPU resources
On Discovery, most GPU nodes are available on the gpu partition. Some GPU nodes are also available on the main and debug partitions. Enter the nodeinfo
command for more information.
To request a GPU on the gpu partition, for batch jobs first add the following line to your Slurm job script:
#SBATCH --partition=gpu
Or similarly, use the main or debug partition where other GPUs may be available.
Remember to add one of the following options to your Slurm job script to request the type and number of GPUs you would like to use:
#SBATCH --gpus-per-task=<number>
or
#SBATCH --gpus-per-task=<gpu_type>:<number>
where:
<number>
is the number of GPUs per task requested, and<gpu_type>
is a GPU model.
Please note that requesting more than 1 GPU does not necessarily mean that your job will use more than 1 GPU. Your program may need to be modified in order to make use of more than 1 GPU.
If using more than 1 GPU and 1 task, you can also use the options --gpus-per-node
and --gpus-per-socket
if desired. You may also want to use the --gpu-bind
option to bind tasks to specific
GPUs in order to improve performance; for example, --gpu-bind=single:1
to bind each task to a single, unique GPU.
For Discovery nodes, use the chart below to determine which GPU type to specify:
GPU type | GPU model | Partitions | Max number of GPUs per node |
---|---|---|---|
a100 | NVIDIA Tesla A100 | gpu | 2 |
a40 | NVIDIA Tesla A40 | gpu | 2 |
v100 | NVIDIA Tesla V100 | gpu | 2 |
p100 | NVIDIA Tesla P100 | gpu, debug | 2 |
k40 | NVIDIA Tesla K40 | main, debug | 2 |
Also note that some A100 GPUs have 40 GB of GPU memory and some have 80 GB of GPU memory. To request a specific A100 model, add one of the following options:
#SBATCH --constraint=a100-40gb
or
#SBATCH --constraint=a100-80gb
On Endeavour, there may be different GPU types or more than 2 GPUs per node, depending on what the condo group has selected.
For interactive jobs, use similar options with the salloc
command:
salloc --partition=gpu --ntasks=1 --gpus-per-task=<gpu_type>:<number>
To see a list of currently available GPUs, enter noderes -f -g
.
The maximum number of GPUs that can be used at one time per user, in one job or across multiple jobs, is 36.
There are a few commands you can use for more detailed node and GPU information. For CPUs, the lscpu
command will provide information about CPUs. For GPUs, the nvidia-smi
command and its various options will provide information about GPUs. Also, after module load nvhpc
you can then use the nvaccelinfo
command to view information about GPUs. In addition, after module load gcc/11.3.0 hwloc
you can then use the lstopo
command to view a node’s topology.
0.0.1.1 System Unit (SU) charges
Each job will subtract from your project’s allocated System Units (SUs) depending on the types of resources you request:
Resource reserved for 1 minute | SUs charged |
---|---|
1 CPU | 1 |
4 GB memory | 1 |
1 A100 or A40 GPU | 8 |
1 V100 or P100 GPU | 4 |
1 K40 GPU | 2 |
0.0.2 Loading GPU-related modules
GPU-enabled software often requires the CUDA Toolkit or the cuDNN library. These are available as modules and can be found by running:
module spider cuda
module spider cudnn
There are multiple versions available. To load the modules, for example, run:
module purge
module load gcc/13.3.0
module load cuda/12.6.3
In addition, the NVIDIA HPC SDK with associated compilers, libraries, and related tools is available as a module:
module purge
module load nvhpc/24.5
If you require a different version of one of these modules that is not currently installed on CARC systems, please submit a help ticket and we will install it for you.
0.0.3 Compiling CUDA programs
After a cuda
or nvhpc
module is loaded, you can then use the nvcc
command to compile a CUDA C/C++ program:
nvcc program.cu -o program
Enter nvcc --help
for more information on the available compiler options.
For the nvhpc
module, in addition to nvcc
, there are NVIDIA’s HPC compilers nvc
, nvc++
, and nvfortran
. For example, to compile a CUDA Fortran program:
nvfortran program.cuf -o program
One advantage of these HPC compilers is that they provide GPU-acceleration of standard C++ and Fortran programs that are not explicitly written for GPUs.
0.0.4 Example Slurm job script
The following is an example Slurm job script for GPU jobs:
#!/bin/bash
#SBATCH --account=<project_id>
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gpus-per-task=a40:1
#SBATCH --mem=16G
#SBATCH --time=1:00:00
module purge
module load nvhpc/24.5
./program
Each line is described below:
Command or Slurm argument | Meaning |
---|---|
#!/bin/bash |
Use Bash to execute this script |
#SBATCH |
Syntax that allows Slurm to read your requests (ignored by Bash) |
--account=<project_id> |
Charge compute resources used to <project_id>; enter myaccount to view your available project IDs |
--partition=gpu |
Submit job to the gpu partition |
--nodes=1 |
Use 1 compute node |
--ntasks=1 |
Run 1 task (e.g., running a CUDA program) |
--cpus-per-task=4 |
Reserve 4 CPUs for your exclusive use |
--gpus-per-task=a40:1 |
Reserve 1 A40 GPU for your exclusive use |
--mem=16G |
Reserve 16 GB of memory for your exclusive use |
--time=1:00:00 |
Reserve resources described for 1 hour |
module purge |
Clear environment modules |
module load nvhpc/24.5 |
Load the nvhpc compilers and libraries environment module |
./program |
Run program |
Make sure to adjust the resources requested based on your needs, but keep in mind that requesting fewer resources should lead to less queue time for your job.
0.0.5 Additional resources
If you have questions about or need help, please submit a help ticket and we will assist you.