Supported Compilers
CARC provides multiple C/C++/Fortran compilers, each with their own benefits.
The 3 main parts of a compiler tool set include:
- Front end
- Language specific
- Checks source code for syntax errors
- Converts source code into format that can be interpreted by next stage
- Optimizer
- Normally language agnostic
- Attempts to speed up code
- Back end
- Hardware specific
- Creates binary/executable
- Takes advantage of unique hardware features
0.0.1 GNU Compiler Collection (GCC)
The GCC is an open source set of tools for compiling source code. A majority of CARC’s software stack is built with GCC because it’s compatible with most packages and hardware.
Load the latest version available:
$ module load gcc
0.0.2 LLVM
LLVM is an open source “collection of modular and reusable complier and toolchain technologies”. By being modular, LLVM attempts to make it possible for users to modify their work at various stages. For example, you could create a front end for your own programming language, but still use LLVM’s existing optimizer and backend.
The Intel, AMD, and NVIDIA compiler suites are based on LLVM which provide backend optimizations for the respective hardware architecture they support.
Load the latest version available:
$ module load llvm
0.0.3 Intel
Intel provides compiler tools, an MPI library, and performance optimization tools. It can provide enhanced performance on Intel hardware.
Load the latest version available:
$ module load intel-oneapi
0.0.4 NVIDIA High Performance Computing Software Development Kit (NVIDIA HPC SDK)
CARC offers GPUs from NVIDIA to facilitate diverse HPC workloads, including the P100, V100, A100, and A40 GPU models. To complement the GPU hardware, we offer NVIDIA programming tools essential to maximizing productivity and optimize GPU acceleration. The latest programming tools are all included in the NVIDIA HPC SDK, available via module load nvhpc
. They include both the former NVIDIA CUDA compilers and PGI compilers, as well as state-of-the-art NVIDIA GPU libraries, debugger, and profilers.
Load the latest version available:
$ module load nvhpc
Once you have loaded the nvhpc module, the NVIDIA CUDA Compiler (nvcc) becomes available:
nvcc --version
The NVIDIA HPC SDK provides several applications:
- compilers: nvfortran/nvc/nvc++
- nvcc
- NCCL
- NVSHMEM
- cuBLAS
- cuFFT
- cuFFTMp
- cuRAND
- cuSOLVER
- cuSOLVERMp
- cuSPARSE
- cuTENSOR
- Nsight Compute
- Nsight Systems
- OpenMPI
- HPC-X
- UCX
- OpenBLAS
- Scalapack
- Thrust
- CUB
- libcu++
0.0.4.1 Compute Capability
Determine the compute capability of the available NVIDIA GPUs to compile and execute CUDA code efficiently.
Compute capability is represented by a version number (sometimes called its “SM version”) and identifies the the GPU hardware’s supported features. It is used by applications at runtime to determine which hardware features (such as tensor cores and L2 cache) and instructions (such as Bfloat16-precision floating-point operations) are available on the GPU device.
In CUDA, GPUs are named sm_xy, where x denotes the GPU generation number and y the version.
The compute capability version of a particular GPU should not be confused with the CUDA version (e.g. CUDA 10.2, CUDA 11.0, CUDA 11.8).
The compute capability of a NVIDIA GPU compute node can be checked with nvidia-smi
.
The following commands are an example of an interactive session on a A40 GPU compute node and a query of its compute capability with nvidia-smi
:
$ salloc --partition gpu --gres=gpu:a40:1
$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
8.6
The following table lists the compute capability of the four different GPU device types available on CARC HPC systems:
GPU Model | Compute Capability | Architecture |
---|---|---|
Tesla P100 | 6.0 | Pascal |
NVIDIA V100 | 7.0 | Volta |
NVIDIA A100 | 8.0 | NVIDIA Ampere GPU |
NVIDIA A40 | 8.6 | NVIDIA Ampere GPU |
0.0.4.2 CUDA Compiler Options
-arch
: Specifies the virtual compute architecture that the PTX code should be generated against. The valid format is:-arch=compute_XY
-code
: Specifies the actual sm architecture the SASS code should be generated against and included in the binary. The valid format is:-code=sm_XY
-code
: Can also specify which PTX code should be included in the binary for forward compatibility. The valid format is:-code=compute_XY
-gencode
: combines both-arch
and-code
. The valid format is:-gencode=arch=compute_XY,code=sm_XY
To compile CUDA code so that it runs on all of the four types of GPUs available on CARC HPC, use the following CUDA-compiler flags: -code
, -arch
, -gencode
.
Compile-time Compatibility:
-arch=compute_Xa
is compatible with-code=sm_Xb
when a≤b-arch=compute_X*
is incompatible with-code=sm_Y*
Runtime Compatibility:
- binaries built with
-code=sm_XY
will only run on the X.Y architecture - binaries built with
-code=compute_Xa
will run on Xb architecture with JIT when b≥a - binaries built with
-code=compute_ab
will run on cd architecture with JIT when c,d≥a,b
Compile CUDA code so that it runs on all of the four types of GPU architecture available on CARC HPC with --generate-code
:
nvcc cuda_code.cu \
--generate-code arch=compute_60,code=sm_60 \
--generate-code arch=compute_70,code=sm_70 \
--generate-code arch=compute_80,code=sm_80 \
--generate-code arch=compute_86,code=sm_86
0.0.4.3 CUDA example code
The following commands will initiate an interactive session on a P100 GPU compute node, download (wget) and compile (nvcc) the devicequery.cu
CUDA code, and run the generated executable.
$ salloc -p debug --gres=gpu:p100:1
$ module purge
$ module load nvhpc
$ wget <https://raw.githubusercontent.com/welcheb/CUDA_examples/master/devicequery.cu>
$ nvcc -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 devicequery.cu -o devicequery.x
$ srun -n 1 ./devicequery.x
The devicequery.cu
CUDA code example can help new users get familiar with the GPU resources available on the HPC cluster. The output for a P100 GPU node should look something similar to the following:
CUDA Device Query...
There are 1 CUDA devices.
CUDA Device #0
Major revision number: 6
Minor revision number: 0
Name: Tesla P100-PCIE-16GB
Total global memory: 4186898432
Total shared memory per block: 49152
Total registers per block: 65536
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 2147483647
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1328500
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 56
Kernel execution timeout: No
0.0.5 Additional resources
If you have questions about using compilers, please submit a help ticket and we will assist you.