OpenMP

Last updated July 20, 2023

Table of Contents

OpenMP supports C, C++, and Fortran. Within the OpenMP programming model, all spawned threads share memory and data. The OpenMP functions are defined in a header file called omp.h, which allows programs to utilize the OpenMP libraries when included in the source code (#include <omp.h>). An OpenMP program progresses through sections that are sequential and sections that are parallel, using one thread in the sequential sections and multiple threads in the parallel sections. In general, an OpenMP program will start with a sequential section in which it sets up the environment, initializes the variables, and performs other tasks. The parallel sections of the program will cause additional threads to fork, commonly referred to as worker threads.

A section of code that is to be executed in parallel is marked by a special directive (#pragma omp). When the execution reaches a parallel section (designated by pragma omp), this directive will spawn worker threads. Each thread is assigned an ID which can be queried using a runtime library function (omp_get_thread_num()). By convention, the ID of the master thread is taken to be zero. Each thread independently executes the same copy of the code contained in the OpenMP parallel section. When a thread completes its tasks, it joins the master thread. When all threads complete their assigned work, the master thread continues executing the code following the parallel section.

OpenMP allows the programmer to describe the parallel code with high-level constructs, which is a simple yet powerful approach for attaining code performance speed-ups.

OpenMP provides directives that allow the programmer to:

specify the parallel region
specify how to parallelize loops
specify whether the variables in the parallel section are private or shared
specify how/if the threads are synchronized
specify how the work is divided among threads

ADD: runtime parameters (–cpus-per-task; OMP_NUM_THREADS)

OpenMP parallel Construct: Spawning Worker Threads

The basic OpenMP directive is:

#pragma omp parallel 
{


}

When the master thread reaches this line, it forks additional threads to carry out the work enclosed in the block following the #pragma omp construct. The block is executed by all threads in parallel. The original thread will be denoted as master thread with thread-id 0.

An example C program (hello_world.c) to print “Hello world” using multiple threads:

#include <stdio.h>
#include <omp.h>

int main(void)
{
  #pragma omp parallel
  {
    printf("Hello World\n");
  }
  return 0;
}

Use the -fopenmp flag to compile using GCC:

$ module purge
$ module load usc
$ gcc -fopenmp hello_world.c -o hello_world.x

Use the -qopenmp flag to compile using Intel oneAPI:

$ module purge
$ module load intel-oneapi
$ icx -qopenmp hello_world.c -o hello_world.x

Run the ‘hello_world.x’ executable on a debug node using four cores (i.e. four threads):

$ salloc --partition=debug --ntasks=1 --cpus-per-task=4
$ export OMP_NUM_THREADS=4
$ srun ./hello_world.x

Hello World
Hello World
Hello World
Hello World

Private vs Shared Variables

Within a parallel section, variables can be declared as private or shared:

private: The variable is private to each thread, which means that each thread will have its own local copy. A private variable is not initialized and the value is not maintained for use outside of the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.
shared: The variable is shared among spawned threads, which means that it is visible to and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter. Shared variables must be used with care because they cause race conditions.

The variable type (private or shared) is specified following the #pragma omp construct, as show in the following example code.

An example C program (hello_world_id.c) to have each parallel thread print “Hello World” along with its identifier:

#include <stdio.h>
#include <omp.h>

int main (int argc, char *argv[]) {

  int th_id;

  #pragma omp parallel private(th_id)

  //th_id is declared above.  It is is specified as private; so each
  //thread will have its own copy of th_id

  {
    th_id = omp_get_thread_num();
    printf("Hello World from thread %d\n", th_id);
  }
  return 0;
}

Use the -fopenmp flag to compile using GCC:

$ module purge
$ module load usc
$ gcc -fopenmp hello_world_id.c -o hello_world_id.x

Use the -qopenmp flag to compile using Intel oneAPI:

$ module purge
$ module load intel-oneapi
$ icx -qopenmp hello_world_id.c -o hello_world_id.x

Run the ‘hello_world_id.x’ executable on a debug node using four cores (i.e. four threads):

$ salloc --partition=debug --ntasks=1 --cpus-per-task=4
$ export OMP_NUM_THREADS=4
$ srun ./hello_world_id.x

Hello World from thread 0
Hello World from thread 1
Hello World from thread 3
Hello World from thread 2

Synchronization Mechanisms

OpenMP provides programmers with several methods to synchronize parallel threads. Here are some of the most commonly used synchronization constructs:

critical: The enclosed code block will be executed by one thread at a time, as opposed to simultaneous execution by multiple threads. It is often used to protect shared data from race conditions.
atomic: The memory update (e.g. read-write) in the next instruction will be performed atomically. It does not make the entire statement atomic; only the memory update is atomic. A compiler might use special hardware instructions for better performance than when using critical.
ordered: The structured block is executed in the order in which iterations would be executed in a sequential loop
barrier: Each thread waits until all of the other threads of a spawned team have reached this point. A work-sharing construct (e.g. #pragma omp parallel) imposes an implicit barrier synchronization at the end.
nowait: Specifies that threads completing assigned work can proceed without waiting for all threads in the team to finish. In the absence of this clause, threads encounter a barrier synchronization at the end of the work sharing construct.

The following example C code demonstrates the use of a barrier synchronization where the master thread (th_id = 0) waits until all of the spawned threads print “Hello World from thread th_id” and then proceeds to print the total number of threads (nthreads).

Barrier example (hello_world_barrier.c):

#include <stdio.h>
#include <omp.h>

int main (int argc, char *argv[]) {

  int th_id, nthreads;
  
  #pragma omp parallel private(th_id)
  {
    th_id = omp_get_thread_num();
    printf("Hello World from thread %d\n", th_id);

  #pragma omp barrier
    if ( th_id == 0 )
    {
      nthreads = omp_get_num_threads();
      printf("There are %d threads\n",nthreads);
    }
  }
}

Use the -fopenmp flag to compile using GCC:

$ module purge
$ module load usc
$ gcc -fopenmp hello_world_barrier.c -o hello_world_barrier.x

Use the -qopenmp flag to compile using Intel oneAPI:

$ module purge
$ module load intel-oneapi
$ icx -qopenmp hello_world_barrier.c -o hello_world_barrier.x

Run the ‘hello_world_barrier.x’ executable on a debug node using four cores (i.e. four threads):

$ salloc --partition=debug --ntasks=1 --cpus-per-task=4
$ export OMP_NUM_THREADS=4
$ srun ./hello_world_barrier.x

Hello World from thread 0
Hello World from thread 1
Hello World from thread 3
Hello World from thread 2
There are 4 threads

Loop Parallelization

Parallelizing loops with OpenMP is straightforward. A programmer simply denotes the loop to be parallelized and adds a few parameters. The OpenMP runtime then handles the parallel execution.

The directive is called a work-sharing construct, and must be placed inside a parallel section:

#pragma omp for 
//specify a for loop to be parallelized

The #pragma omp for directive distributes the loop among the threads. It must be used inside a parallel block:

#pragma omp parallel 
{
  //Comment: for loop to parallelize
  #pragma omp for
}

The following example C code demonstrates the use of the #pragma omp for construct to compute the sum of two arrays in parallel.

Loop parallelization example (sum_arrays_omp.c):

#include <stdio.h>
#include <omp.h>

#define N 1000000

int main(void) { 
  float a[N], b[N], c[N];
  int i;

  /* Initialize arrays a and b */
  for (i = 0; i < N; i++) {
    a[i] = i * 2.0;
    b[i] = i * 3.0;
  }

 /* Compute values of array c = a+b in parallel */
  #pragma omp parallel shared(a, b, c) private(i)
  { 
    #pragma omp for             
    for (i = 0; i < N; i++) {
      c[i] = a[i] + b[i];
      printf ("%f\n", c[10]);
    }
  }
}

Use the -fopenmp flag to compile using GCC:

$ module purge
$ module load usc
$ gcc -fopenmp sum_arrays_omp.c -o sum_arrays_omp.x

Use the -qopenmp flag to compile using Intel oneAPI:

$ module purge
$ module load intel-oneapi
$ icx -qopenmp sum_arrays_omp.c -o sum_arrays_omp.x

Run the ‘sum_arrays_omp.x’ executable on a debug node using four cores (i.e. four threads):

$ salloc --partition=debug --ntasks=1 --cpus-per-task=4
$ export OMP_NUM_THREADS=4
$ srun ./sum_arrays_omp.x

50