Supported Compilers

Last updated July 06, 2023

CARC provides multiple C/C++/Fortran compilers, each with their own benefits.

The 3 main parts of a compiler tool set include:

  • Front end
    • Language specific
    • Checks source code for syntax errors
    • Converts source code into format that can be interpreted by next stage
  • Optimizer
    • Normally language agnostic
    • Attempts to speed up code
  • Back end
    • Hardware specific
    • Creates binary/executable
    • Takes advantage of unique hardware features

0.0.1 GNU Compiler Collection (GCC)

The GCC is an open source set of tools for compiling source code. A majority of CARC’s software stack is built with GCC because it’s compatible with most packages and hardware.

Versions available: 8.3.0, 9.2.0, 11.3.0, 12.3.0

0.0.2 LLVM

LLVM is an open source “collection of modular and reusable complier and toolchain technologies”. By being modular, LLVM attempts to make it possible for users to modify their work at various stages. For example, you could create a front end for your own programming language, but still use LLVM’s existing optimizer and backend.

The Intel, AMD, and NVIDIA compiler suites are based on LLVM which provide backend optimizations for the respective hardware architecture they support.

Versions available: 14.0.2

0.0.3 Intel

Intel provides compiler tools, an MPI library, and performance optimization tools. It can provide enhanced performance on Intel hardware.

Versions available: 18.0.4, 19.0.4, one-api: 2021.3

0.0.4 AOCC

AMD provides the AMD Optimizing C/C++ and Fortran Compiler (AOCC), which provides enhanced performance on AMD hardware.

Versions available: 3.1.0

0.0.5 NVIDIA High Performance Computing Software Development Kit (NVIDIA HPC SDK)

CARC offers GPUs from NVIDIA to facilitate diverse HPC workloads, including the P100, V100, A100, and A40 GPU models. To complement the GPU hardware, we offer NVIDIA programming tools essential to maximizing productivity and optimize GPU acceleration. The latest programming tools are all included in the NVIDIA HPC SDK, available via module load nvhpc. They include both the former NVIDIA CUDA compilers and PGI compilers, as well as state-of-the-art NVIDIA GPU libraries, debugger, and profilers.

Load the NVIDIA HPC SDK on CARC HPC systems:

$ module purge
$ module load nvhpc

Print the path of the SDK installation location:

$ echo $NVHPC_ROOT
/spack/compilers/nvhpc/22.11/Linux_x86_64/22.11

Once you have loaded the nvhpc module, the NVIDIA CUDA Compiler (nvcc) becomes available:

$ nvcc --version

The NVIDIA HPC SDK provides several applications:

  • compilers: nvfortran/nvc/nvc++
  • nvcc
  • NCCL
  • NVSHMEM
  • cuBLAS
  • cuFFT
  • cuFFTMp
  • cuRAND
  • cuSOLVER
  • cuSOLVERMp
  • cuSPARSE
  • cuTENSOR
  • Nsight Compute
  • Nsight Systems
  • OpenMPI
  • HPC-X
  • UCX
  • OpenBLAS
  • Scalapack
  • Thrust 1.9.7
  • CUB
  • libcu++

0.0.5.1 CUDA Version

The NVIDIA HPC SDK supports three different versions of CUDA: 10.2; 11.0; 11.8. The default version of CUDA used by the nvhpc/22.11 module is 11.0. To use a different compatible CUDA version, set the following environment variables in your working environment (e.g. SLURM job script) by substituting with the desired CUDA version (10.2 or 11.8):

export NVCUDADIR=/spack/compilers/nvhpc/22.11/Linux_x86_64/22.11/cuda/<version>
export PATH=/spack/compilers/nvhpc/22.11/Linux_x86_64/22.11/cuda/<version>/bin:$PATH
export LD_LIBRARY_PATH=/spack/compilers/nvhpc/22.11/Linux_x86_64/22.11/cuda/<version>/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/spack/compilers/nvhpc/22.11/Linux_x86_64/22.11/cuda/<version>/extras/CUPTI/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=/spack/compilers/nvhpc/22.11/Linux_x86_64/22.11/cuda/<version>/lib64:$LIBRARY_PATH
export LIBRARY_PATH=/spack/compilers/nvhpc/22.11/Linux_x86_64/22.11/cuda/<version>/extras/CUPTI/lib64:$LIBRARY_PATH
export CPATH=/spack/compilers/nvhpc/22.11/Linux_x86_64/22.11/cuda/<version>/include:$CPATH

0.0.5.2 Compute Capability

Determine the compute capability of the available NVIDIA GPUs to compile and execute CUDA code efficiently.

Compute capability is represented by a version number (sometimes called its “SM version”) and identifies the the GPU hardware’s supported features. It is used by applications at runtime to determine which hardware features (such as tensor cores and L2 cache) and instructions (such as Bfloat16-precision floating-point operations) are available on the GPU device.

In CUDA, GPUs are named sm_xy, where x denotes the GPU generation number and y the version.

The compute capability version of a particular GPU should not be confused with the CUDA version (e.g. CUDA 10.2, CUDA 11.0, CUDA 11.8).

The compute capability of a NVIDIA GPU compute node can be checked with nvidia-smi.

The following commands are an example of an interactive session on a A40 GPU compute node and a query of its compute capability with nvidia-smi:

$ salloc --partition gpu --gres=gpu:a40:1
$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
8.6

The following table lists the compute capability of the four different GPU device types available on CARC HPC systems:

GPU Model Compute Capability Architecture
Tesla P100 6.0 Pascal
NVIDIA V100 7.0 Volta
NVIDIA A100 8.0 NVIDIA Ampere GPU
NVIDIA A40 8.6 NVIDIA Ampere GPU

0.0.5.3 CUDA Compiler Options

  • -arch : Specifies the virtual compute architecture that the PTX code should be generated against. The valid format is: -arch=compute_XY
  • -code: Specifies the actual sm architecture the SASS code should be generated against and included in the binary. The valid format is: -code=sm_XY
  • -code: Can also specify which PTX code should be included in the binary for forward compatibility. The valid format is: -code=compute_XY
  • -gencode: combines both -arch and -code. The valid format is: -gencode=arch=compute_XY,code=sm_XY

To compile CUDA code so that it runs on all of the four types of GPUs available on CARC HPC, use the following CUDA-compiler flags: -code, -arch, -gencode.

Compile-time Compatibility:

  • -arch=compute_Xa is compatible with -code=sm_Xb when a≤b
  • -arch=compute_X* is incompatible with -code=sm_Y*

Runtime Compatibility:

  • binaries built with -code=sm_XY will only run on the X.Y architecture
  • binaries built with -code=compute_Xa will run on Xb architecture with JIT when b≥a
  • binaries built with -code=compute_ab will run on cd architecture with JIT when c,d≥a,b

Compile CUDA code so that it runs on all of the four types of GPU architecture available on CARC HPC with --generate-code:

nvcc cuda_code.cu \
--generate-code arch=compute_60,code=sm_60 \
--generate-code arch=compute_70,code=sm_70 \
--generate-code arch=compute_80,code=sm_80 \
--generate-code arch=compute_86,code=sm_86

0.0.5.4 CUDA example code

The following commands will initiate an interactive session on a P100 GPU compute node, download (wget) and compile (nvcc) the devicequery.cu CUDA code, and run the generated executable.

$ salloc -p debug --gres=gpu:p100:1
$ module purge
$ module load nvhpc
$ wget <https://raw.githubusercontent.com/welcheb/CUDA_examples/master/devicequery.cu>
$ nvcc -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 devicequery.cu -o devicequery.x
$ srun -n 1 ./devicequery.x

The devicequery.cu CUDA code example can help new users get familiar with the GPU resources available on the HPC cluster. The output for a P100 GPU node should look something similar to the following:

CUDA Device Query...
There are 1 CUDA devices.
CUDA Device #0
Major revision number:         6
Minor revision number:         0
Name:                          Tesla P100-PCIE-16GB
Total global memory:           4186898432
Total shared memory per block: 49152
Total registers per block:     65536
Warp size:                     32
Maximum memory pitch:          2147483647
Maximum threads per block:     1024
Maximum dimension 0 of block:  1024
Maximum dimension 1 of block:  1024
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   2147483647
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   65535
Clock rate:                    1328500
Total constant memory:         65536
Texture alignment:             512
Concurrent copy and execution: Yes
Number of multiprocessors:     56
Kernel execution timeout:      No

Versions available: 22.11

0.0.6 Additional resources

If you have questions about using compilers, please submit a help ticket and we will assist you.