Function-Level Parallelism in
OpenMP
Instructor: Dr. Fahed Jubair
Computer Engineering Department
University of Jordan
Function-Level Parallelism
• Programs with function-level (aka task-level) parallelism
focuses on distributing parallel functions (i.e., tasks) on
threads
• OpenMP provides two directives that are commonly used
for writing task-parallel programs:
1. The SECTIONS directive We will focus on this directive
2. The TASK directive Beyond the scope of this class
© All Rights Reserved.
Quick Review
The SECTIONS Directive
• A SECTIONS directive specifies that all enclosed
section(s) of code are to be divided between threads such
that each section is executed once by a thread
© All Rights Reserved.
Challenges
• There are two primarily challenges that programmers need
to tackle when converting a sequential algorithm into a
task-parallel algorithm:
§ How to maximize the number of tasks that can run in
parallel?
§ How to assign these parallel tasks to threads such that
parallelism overhead is minimized?
© All Rights Reserved.
Algorithms With
Function-Level Parallelism
We will study two sorting algorithms that take
advantage of function-level parallelism:
• Merge Sort
• Quick Sort
© All Rights Reserved.
Merge Sort
• Split the unsorted list into n sublists, each containing one element
• Repeatedly merge sublists to produce new sorted sublists until
there is only 1 sublist remaining, which will be the sorted list
// A is the input unsorted list and B is the output sorted list
void merge_sort (int A[ ], int begin, int end, int B[ ])
{
if ( begin == end ) // if A has only one integer
return A[begin] ;
middle = (begin + end) / 2 ;
merge_sort ( A, begin, middle – 1, left) ; // sort left half [begin : middle - 1]
merge_sort ( A, middle, end, right); // sort right half [middle : end]
merge (left, right, B); // merge left and right sublists into B
}
© All Rights Reserved.
Merge Sort Example
Split Phase
99 6 86 15 58 35 86 4 0
99 6 86 15 58 35 86 4 0
99 6 86 15 58 35 86 4 0
99 6 86 15 58 35 86 4 0
4 0
© All Rights Reserved.
Merge Sort Example
Merge Phase
0 4 6 15 35 58 86 86 99
6 15 86 99 0 4 35 58 86
6 99 15 86 35 58 0 4 86
99 6 86 15 58 35 86 0 4
4 0
© All Rights Reserved.
Parallelism In Merge Sort
• Upon careful examine of the merge sort code, we can observe
that the two recursive calls of merge sort can run in parallel
void merge_sort (int A[ ], int begin, int end, int B[ ])
{
if ( begin == end )
return A[begin] ;
middle = (begin + end) / 2 ;
#pragma omp parallel sections num_threads(2)
{ /* start of parallel region */
#pragma omp section
merge_sort ( A, begin, middle – 1, left) ;
#pragma omp section Run in parallel
merge_sort ( A, middle, end, right);
} /* end of parallel region */
merge (left, right, B); © All Rights Reserved.
}
Hold On!
• The recursive call of merge_sort in
the previous slide will cause
parallel regions to be created within
each other!
• Does OpenMP even allow this? If
so, then how does the parallel
execution occur?
© All Rights Reserved.
Nested Parallelism
Nested Parallelism
• OpenMP parallel
Some OpenMP regions can
implementations be nested
support nested inside each other
parallelism
– A thread
• A thread withinaa team
within team ofofthreads maymay
threads fork spawning
spawn aa team
child team of
of threads
threads
The encountering thread creates a team composed of
itself and some additional (possibly zero) number of
Imagethreads
Source: (the encountering
John thread becomes
Mellor-Crummey’s the master
Lecture Notes
thread)
© All Rights Reserved.
More about Nested Parallelism
• It is possible to enable/disable nested parallelism in
OpenMP using the environment variable OMP_NESTED or
the library routine omp_set_nested( )
• If nested parallelism is enabled, when a thread on a team
within a parallel region encounters a new parallel construct,
an additional team of threads is forked off of it, and it
becomes their master
• If nested parallelism is disabled, when a thread on a team
within a parallel region encounters a new parallel construct,
execution continues on this additional single thread only
© All Rights Reserved.
Example on
Nested Parallel Regions
© All Rights Reserved.
Example on
Nested Parallel Regions (cont.)
$ gcc –fopenmp nested.c –o a.out
$ export OMP_NESTED = TRUE
$ ./a.out
© All Rights Reserved.
Remarks on Nested Parallelism
• Nesting parallel regions provides an immediate way to
allow more threads to participate in the computation
• However, nesting parallel regions can easily create too
many threads and oversubscribe the system
• In addition, creating nested parallel regions adds overhead
• To avoid the above problems, programmers usually try to
impose some discipline on the execution using appropriate
OpenMP environment variables
§ For example, “OMP_MAX_ACTIVE_LEVELS” environment
variable controls the maximum allowed number of active nested
parallel regions
© All Rights Reserved.
Quick Sort
Sorts an array of integers as follows:
• Pick an element, called a pivot, from the array
• Reorder the array so that all elements with values less than
the pivot come before the pivot, while all elements with
values greater than the pivot come after it (equal values can
go either way)
• Recursively apply the above steps to the sub-array of
elements with smaller values and separately to the sub-
array of elements with greater values
Source:
Wikipedia
© All Rights Reserved.
Quick Sort (cont.)
void quick_sort (int A[ ], int begin, int end) {
if ( begin < end ) {
p = partition (A, begin, end ) ;
quick_sort ( A, begin, p – 1 ) ;
quick_sort ( A, p + 1, end ) ;
}
}
int partition (int A[ ], int begin, int end) {
pivot = A[end] ; // let us pick the last element as our pivot
i = begin ;
for ( j = begin; j < end; j++ ) {
if ( A[ j ] <= pivot ) {
swap ( A[ j ] , A[ i ] ); // swap the content of A[ j ] and A[ i ]
i++ ;
}
}
swap ( A[ end ] , A[ i ] );
return i ; © All Rights Reserved.
}
Quick Sort Example
99 6 86 15 58 0 86 4 35
6 15 0 4 35 86 86 99 58
0 4 6 15 58 86 86 99
6 15 86 86 99
© All Rights Reserved.
Parallelism In Quick Sort
• Upon careful examine of the quick sort code, we can observe
that the two recursive calls of quick sort can run in parallel
void quick_sort (int A[ ], int begin, int end) {
if ( begin < end ) {
p = partition (A, begin, end ) ;
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
quick_sort ( A, begin, p – 1 ) ;
#pragma omp section Run in parallel
quick_sort ( A, p + 1, end ) ;
}
}
} © All Rights Reserved.
Exercise
• Write a C function that returns the occurrences count of an
integer 𝑥 inside A, an array of randomly generated integers
between [0: 99]. The algorithm should work as follows:
1. Split the array A into two halves: left and right
2. Count 𝑥 occurences in the left sub-array
3. Count 𝑥 occurences in the right sub-array
4. The total count is the summation of the left and right counts
5. Do the above steps recursively
• Now, parallelize your code using OpenMP
© All Rights Reserved.