Using MPI

The Message Passing Interface (MPI) is a message-passing standard used in parallel programming, typically for multi-node, distributed processor and distributed memory use cases. The MPI standard exists in numerous implementations, four of which CARC provides support for:

OpenMPI
MPICH
MVAPICH2
Intel MPI

Loading MPI implementations

Begin by logging in. You can find instructions for this in the Getting Started with Discovery or Getting Started with Endeavour user guides.

Each MPI implementation has advantages and disadvantages.

MPI Implementation	Advantages	Disadvantages	Modules
OpenMPI	Flexible usage	Not ABI-compatible with other MPI implementations	`module load gcc/11.3.0 openmpi/4.1.4`
MPICH	Reference implementation, flexible usage	Not optimized for a specific platform	`module load gcc/11.3.0 mpich/4.0.2`
MVAPICH2	Optimized for InfiniBand networks	Less flexible than other MPI implementations	`module load gcc/11.3.0 mvapich2/2.3.7`
Intel MPI	Optimized for Intel processors	Optimized for Intel processors	`module load intel-oneapi/2021.3`

Load the appropriate MPI implementation for your program. Other versions of the MPI implementations may be available. To see all available versions, search with module spider <MPI implementation>.

MPICH, MVAPICH2, and Intel MPI can be freely interchanged because of their common Application Binary Interface (ABI). This means, for example, that you can compile a program with MPICH but run it using the Intel MPI libraries, thus taking advantage of the functionality of Intel MPI.

These four MPI implementations are stable and thread-safe and, prior to the advent of the Unified Communication X (UCX) framework, they exhibited similar performance. The UCX framework is a collaboration between industry, laboratories, and academia formed to create an open-source, production grade communication framework for data-centric and high-performance applications. UCX is performance-oriented, enabling a low overhead in communication paths, allowing a near native-level performance while establishing a cross-platform unified API supporting various network Host Card Adapters and processor technologies.

Note: CARC clusters use InfiniBand networks.

MPI-UCX libraries are available only under the GCC programming environment because the Intel compilers cannot build the UCX framework. Both the OpenMPI and MPICH modules have been built with UCX support under the gcc compiler module trees. We have found the OpenMPI-UCX and MPICH-UCX libraries to exhibit the best performance out of the four different MPI libraries available on CARC clusters.

Compiling MPI programs

Below is a list of MPI-specific compiler commands with their equivalent standard command versions:

Language	MPI Command (GCC, Intel)	Standard Command (GCC, Intel)
C	`mpicc, mpiicc`	`gcc, icc`
C++	`mpicxx, mpiicpc`	`g++, icpc`
Fortran 77/90	`mpifort/mpif90/mpif77, mpiifort`	`gfortran, ifort`

Note: Intel MPI supplies separate compiler commands (wrappers) for the Intel compilers, in the form of mpiicc, mpiicpc, and mpiifort. Using mpicc, mpicxx, and mpif90 with Intel will call the GCC compilers.

When you compile an MPI program, record the module and version of MPI used and load the same module in your Slurm job script.

Running single-threaded MPI programs

On CARC clusters, we recommend running MPI programs with Slurm's srun command, which launches the parallel tasks. For help with srun, please consult the manual page by entering man srun or view the available options by entering srun --help. You can also use the mpiexec or mpirun commands to run MPI programs, which may provide more options depending on the MPI implementation used.

To run single-threaded MPI programs, use a command like the following within a job script:

srun --mpi=pmix_v2 -n $SLURM_NTASKS ./mpi_program

The important parameter to include is the number of MPI processes (-n). The Slurm-provided environment variable SLURM_NTASKS corresponds to the Slurm task count (i.e., the number of MPI ranks) requested with the #SBATCH --ntasks option in the job script (or the product of #SBATCH --nodes and #SBATCH --ntasks-per-node).

The default MPI process management interface for srun is pmix_v2, so srun ... is equivalent to srun --mpi=pmix_v2 .... The pmix_v2 interface is only used for the OpenMPI implementation, however. For other MPI implementations, use pmi2 instead: srun --mpi=pmi2 ....

An example job script for a single-threaded program:

#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=epyc-64
#SBATCH --constraint=epyc-7542
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=0
#SBATCH --time=24:00:00

module purge
module load gcc/11.3.0
module load openmpi/4.1.4

ulimit -s unlimited

srun --mpi=pmix_v2 -n $SLURM_NTASKS ./mpi_program

The --constraint option specifies a node feature (CPU model) to use. The --nodes option specifies how many nodes to use, and the --ntasks-per-node option specifies the number of tasks (MPI ranks) to run per node. The --cpus-per-task option specifies the number of CPUs (threads) to use per task. There is 1 thread per CPU, so only 1 CPU per task is needed for a single-threaded MPI job. The --mem=0 option requests all available memory per node. Alternatively, you could use the --mem-per-cpu option.

For best performance, MPI jobs should typically use entire nodes, ideally of the same type (CPU model, network interface, etc.). This can be achieved by using the --constraint option to specify a CPU model to use, which will ensure a homogeneous hardware environment and consistency of communications for tasks and reduce latency. The other job options used should reflect the specific node characteristics (e.g., matching the number of tasks per node to the total number of CPUs per node) so that nodes are fully utilized. Use the nodeinfo command to see node types by partition (the CPU model column) and other node details (number of CPUs, amount of memory, etc.).

When running on the main partition using OpenMPI or MPICH modules, which use the UCX framework for network communication, you may need to add export UCX_TLS=sm,tcp,self to the job script to restrict the network transports that can be used.

Running multi-threaded MPI programs

Multi-threaded MPI programs use multi-threaded tasks, typically hybrid MPI/OpenMP programs. If using OpenMP for threading, the environment variable OMP_NUM_THREADS should be set, which specifies the number of threads to parallelize over. The OMP_NUM_THREADS count should equal the requested --cpus-per-task option in the job script. You can use the Slurm-provided environment variable SLURM_CPUS_PER_TASK to set OMP_NUM_THREADS and then use srun:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --mpi=pmix_v2 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program

An example job script for a multi-threaded program:

#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=epyc-64
#SBATCH --constraint=epyc-7542
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=32
#SBATCH --mem=0
#SBATCH --time=24:00:00

module purge
module load gcc/11.3.0
module load openmpi/4.1.4

ulimit -s unlimited

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --mpi=pmix_v2 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program

The --constraint option specifies a node feature (CPU model) to use. The --nodes option specifies how many nodes to use, and the --ntasks-per-node option specifies the number of tasks (MPI ranks) to run per node. The --cpus-per-task option specifies the number of CPUs (threads) to use per task. The --mem=0 option requests all available memory per node. Alternatively, you could use the --mem-per-cpu option.

For best performance, MPI jobs should typically use entire nodes, ideally of the same type (CPU model, network interface, etc.). This can be achieved by using the --constraint option to specify a CPU model to use, which will ensure a homogeneous hardware environment and consistency of communications for tasks and reduce latency. The other job options used should reflect the specific node characteristics (e.g., matching the number of tasks per node and CPUs per task to the total number of CPUs per node and setting CPUs per task to stay within NUMA domains) so that nodes are fully utilized. Use the nodeinfo command to see node types by partition (the CPU model column) and other node details (number of CPUs, amount of memory, etc.).

Process and thread affinity

On CARC clusters, compute nodes have a Non-Uniform Memory Access (NUMA) shared-memory architecture, so the performance of MPI programs can be improved by pinning MPI tasks and/or OpenMP threads to CPUs (cores) or NUMA domains (equivalent to processor sockets). The pinning prevents the MPI tasks and OpenMP threads from migrating to CPUs that have a more distant path to data in memory. This is known as process and thread affinity.

All nodes on CARC clusters are configured with 2 NUMA domains per node (2 processor sockets with 1 multi-core processor and 1 NUMA domain per processor socket and an equal number of cores and amount of memory per processor). Each core can be considered a logical CPU. For example, an epyc-7542 node has 2 NUMA domains with 32 CPUs (cores) and 128 GB of memory per NUMA domain.

To view more detailed information about a node's topology, enter module load gcc/11.3.0 hwloc and run the lstopo command on that node.

Slurm's srun command will automatically bind MPI tasks to CPUs (cores). This is typically the best strategy for compute-bound applications. Memory-bound applications may also want to use the #SBATCH --ntasks-per-socket or #SBATCH --distribution options in the job script to spread tasks across NUMA domains. To see all binding options available with srun, enter srun --cpu-bind=help. If using mpiexec or mpirun, consult the documentation of the MPI implementation that you use for more information about process and thread affinity and default settings. For hybrid MPI/OpenMP programs, also consult the OpenMP documentation for more information about thread affinity. You may need to experiment and compare profiling results from different affinity strategies to find the optimal strategy for your specific programs.

Running OpenMPI programs

OpenMPI is an open source implementation developed by a consortium of academic, research, and industry partners. It supports both InfiniBand and UCX. OpenMPI-UCX will automatically select the optimal network interface on compute nodes.

For single-threaded MPI programs, use the following setup:

module purge
module load gcc/11.3.0
module load openmpi/4.1.4

srun --mpi=pmix_v2 -n $SLURM_NTASKS ./mpi_program

For multi-threaded MPI programs, use the following setup:

module purge
module load gcc/11.3.0
module load openmpi/4.1.4

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --mpi=pmix_v2 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program

Running MPICH programs

MPICH is an open source reference implementation developed at Argonne National Laboratories. The newer versions support both InfiniBand and UCX. MPICH-UCX will automatically select the optimal network interface on compute nodes.

For single-threaded MPI programs, use the following setup:

module purge
module load gcc/11.3.0
module load mpich/4.0.2

srun --mpi=pmi2 -n $SLURM_NTASKS ./mpi_program

For multi-threaded MPI programs, use the following setup:

module purge
module load gcc/11.3.0
module load mpich/4.0.2

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --mpi=pmi2 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program

Running MVAPICH2 programs

MVAPICH2, developed by the Network-Based Computing Laboratory at the Ohio State University, is an open source implementation based on MPICH and optimized for InfiniBand networks. MVAPICH2 will automatically select the optimal network interface on compute nodes.

For single-threaded MPI programs, use the following setup:

module purge
module load gcc/11.3.0
module load mvapich2/2.3.7

export MV2_USE_RDMA_CM=0
export MV2_HOMOGENEOUS_CLUSTER=1

srun --mpi=pmi2 -n $SLURM_NTASKS ./mpi_program

For multi-threaded MPI programs, use the following setup:

module purge
module load gcc/11.3.0
module load mvapich2/2.3.7

export MV2_USE_RDMA_CM=0
export MV2_HOMOGENEOUS_CLUSTER=1
export MV2_ENABLE_AFFINITY=0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --mpi=pmi2 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program

Running Intel MPI programs

Intel MPI is an implementation based on MPICH that is optimized for Intel processors and integrates with other Intel tools (e.g., compilers and performance tools such as VTune). Intel MPI will automatically select the optimal network interface on compute nodes.

For single-threaded MPI programs, use the following setup:

module purge
module load intel-oneapi/2021.3

srun --mpi=pmi2 -n $SLURM_NTASKS ./mpi_program

For multi-threaded MPI programs, use the following setup:

module purge
module load intel-oneapi/2021.3

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --mpi=pmi2 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program

Additional resources

If you have questions about or need help with MPI, please submit a help ticket and we will assist you.

Slurm srun guide
Slurm MPI guide
Slurm multicore support

Tutorials:

LLNL parallel computing tutorial
LLNL MPI tutorial

User Guides

Using MPI

Loading MPI implementations

Compiling MPI programs

Running single-threaded MPI programs

Running multi-threaded MPI programs

Process and thread affinity

Running OpenMPI programs

Running MPICH programs

Running MVAPICH2 programs

Running Intel MPI programs

Additional resources