Discovery Resource Overview
CARC’s general-use HPC cluster Discovery has over 20,000 cores across 500 compute nodes available for researchers to use.
Discovery is a shared resource, so there are limits in place on size and duration of jobs. This ensures that everyone has a chance to run jobs. For details on the limits, see Running Jobs.
0.0.1 Slurm partitions
There are different Slurm partitions available on Discovery for different purposes and with different types of compute nodes. Each partition has a separate job queue. These are general-use partitions available to all researchers. The table below describes the intended purpose for each partition:
Partition | Purpose |
---|---|
main | Serial and small-to-medium parallel jobs |
epyc-64 | Serial and medium-to-large parallel jobs |
gpu | Jobs requiring GPUs |
oneweek | Long-running jobs (up to 7 days) |
largemem | Jobs requiring larger amounts of memory (up to 1.5 TB) |
debug | Short-running jobs for debugging purposes |
0.0.2 Node specifications
Each partition has a different mix of compute nodes. The table below describes the available node types and which partitions they are located in. Each node typically has two sockets with one multi-core processor each and an equal number of cores per processor. There are varying numbers of nodes per partition and this may change over time. In the table below, the CPUs/node column refers to logical CPUs such that 1 logical CPU = 1 core = 1 thread.
CPU model | Microarchitecture | CPU frequency | CPUs/node | Memory/node | GPU model | GPUs/node | Partitions |
---|---|---|---|---|---|---|---|
epyc-9534 | zen4 | 2.45 GHz | 64 | 748 GB | L40S | 3 | gpu |
epyc-9354 | zen4 | 3.25 GHz | 64 | 1498 GB | — | — | largemem |
epyc-7513 | zen3 | 2.60 GHz | 64 | 248 GB | — | — | main, epyc-64, largemem |
epyc-7513 | zen3 | 2.60 GHz | 64 | 248 GB | A100 | 2 | gpu |
epyc-7313 | zen3 | 3.00 GHz | 64 | 248 GB | A40 | 2 | gpu, debug |
epyc-7542 | zen2 | 2.90 GHz | 64 | 248 GB | — | — | main |
epyc-7282 | zen2 | 2.80 GHz | 32 | 248 GB | A40 | 2 | gpu |
xeon-6130 | skylake_avx512 | 2.10 GHz | 32 | 184 GB | V100 | 2 | gpu |
xeon-4116 | skylake_avx512 | 2.10 GHz | 24 | 185 GB | — | — | main, oneweek, debug |
xeon-4116 | skylake_avx512 | 2.10 GHz | 24 | 89 GB | — | — | main, oneweek |
xeon-2640v4 | broadwell | 2.40 GHz | 20 | 123 GB | P100 | 2 | gpu, debug |
xeon-2640v4 | broadwell | 2.40 GHz | 20 | 60 GB | — | — | oneweek |
Use the noderes -c
command to see a list of nodes and their configured resources. To see this information by partition, add the partition filter option. For example, to only see nodes in the gpu partition, use the command noderes -c -p gpu
. For help information, use the noderes -h
command.
There are a few commands you can use for more detailed node information. For CPUs, the lscpu
command will provide information about CPUs. For nodes with GPUs, the nvidia-smi
command and its various options will provide information about GPUs. Alternatively, after module load nvhpc
, use the nvaccelinfo
command to view information about GPUs. After module load gcc/13.3.0 hwloc
, use the lstopo
command to view a node’s topology.
0.0.3 GPU specifications
The following is a summary table for GPU specifications:
GPU Model | Partitions | Architecture | Memory | Memory Bandwidth | Base Clock Speed | CUDA Cores | Tensor Cores | Single Precision Performance (FP32) | Double Precision Performance (FP64) |
---|---|---|---|---|---|---|---|---|---|
L40S | gpu | ada lovelace | 48 GB | 864 GB/s | 1110 MHz | 18176 | 568 | 91.6 TFLOPS | 1.4 TFLOPS |
A100 | gpu | ampere | 80 GB | 1.9 TB/s | 1065 MHz | 6912 | 432 | 19.5 TFLOPS | 9.7 TFLOPS |
A100 | gpu | ampere | 40 GB | 1.6 TB/s | 765 MHz | 6912 | 432 | 19.5 TFLOPS | 9.7 TFLOPS |
A40 | gpu, debug | ampere | 48 GB | 696 GB/s | 1305 MHz | 10752 | 336 | 37.4 TFLOPS | 584.6 GFLOPS |
V100 | gpu | volta | 32 GB | 900 GB/s | 1230 MHz | 5120 | 640 | 14 TFLOPS | 7 TFLOPS |
P100 | gpu, debug | pascal | 16 GB | 732 GB/s | 1189 MHz | 3584 | n/a | 9.3 TFLOPS | 4.7 TFLOPS |