Privatization is a technique used in shared-memory programming to enable parallelism, by removing dependencies that occur across different threads in a parallel program. Dependencies between threads arise from two or more threads reading or writing a variable at the same time. Privatization gives each thread a private copy, so it can read and write it independently and thus, simultaneously.
Each parallel algorithm specifies whether a variable is shared or private. Many errors in implementation can arise if the variable is declared to be shared but the algorithm requires it to be private, or vice versa.
Traditionally, parallelizing compilers could apply privatization to scalar elements only. To exploit parallelism that occurs across iterations within a parallel program (loop-level parallelism), the need grew for compilers that can also perform array variable privatization.[1] Most of today's compilers can performing array privatization with more features and functions to enhance the performance of the parallel program in general. An example is the Polaris parallelizing compiler.
A shared-memory multiprocessor is a "computer system composed of multiple independent processors that execute different instruction streams".[2] The shared memory programming model is the most widely used for parallel processor designs.[3] This programming model starts by identifying possibilities for parallelism within a piece of code and then mapping these parallel tasks into threads.
The next step is to determine the scope of variables used in a parallel program, which is one of the key steps and main concerns within this model.
Right after this step, the use of variables within tasks needs to be analyzed. This step determines whether each variable should be shared-by-all or private-to-each thread. This step is unique to shared-memory programming. (An alternative is message passing, in which all variables are private.)
According to their behavior, the variables are then categorized as:
As it appears from their definition, Read/Write Conflicting variables introduce dependencies between different execution threads and hence prevent the automatic parallelization of the program. The two major techniques used to remove these dependencies are privatization and reduction. In reduction, each thread is provided with a copy of the R/W Conflicting variable to operate on it and produce a partial result, which is then combined with other threads' copies to produce a global result. Another technique similar to privatization is called expansion, in which a scalar variable is expanded into an array, which makes each thread access a different array element.[4] If the variable to be expanded is an array, then expansion adds another dimension to the array.[5]
Dependencies - potential conflicts between different threads during execution - prevent parallelization, and these conflicts appear when we have Read/Write Conflicting variables. One technique to remove these conflicts is privatization. The basic principle involves making a private copy of a variable for each thread, rather than share one instance. This changes the category of the variable from Read/Write Conflicting to Read/Write Non-conflicting.
The actual local (private) instances of the Read/Write Conflicting variables are created at compile time, by allocating several areas of memory for the variables stored at different memory locations. The architecture of shared-memory multiprocessors helps, as threads share an address space.
There are two situations in which a variable can be described as privatizable:
Because Read/Write Conflicting variables are the only category that prevents parallelization, there is no need explicitly to declare Read-only and Read/Write Non-conflicting variables as private. Doing so will not affect the correctness of the program, but may use more memory for unnecessary copies.
This problem can sometimes be solved by changing the scope of parallelism to explore a different parallel region. This might produce good results, as it is often that after reanalyzing the code, some Read/Write Conflicting variables may change to Read/Write Non-conflicting. If the variable still causes conflicts, the last resort is to declaring it as shared and protecting its access by some form of mutual exclusion, and providing synchronization if accesses to the variable needs to happen in a specified order to ensure correctness.
When applied to scalar variables, the additional space and overhead introduced by making the extra private copies per thread is relatively small, because scalars are small. However, applying privatization on arrays, matrices or other compound types is much more complex.
When dealing with arrays, the compiler tries to analyze the behavior of each array element separately and check for the order it is read and written. If each element is written before it is read in the same iteration, this array can be privatized. To do this, the compiler needs to further analyze the array to combine its accesses into sections. Moreover, the compiler should have extra functions, to be able to manipulate and deal with the array elements. For example, some array expressions may have symbolic terms, hence, to be able to privatize such array, the compiler needs to have some advanced symbolic manipulation functions.
A variable can be privatized if each task will write to it before reading from it. In this case, it does not matter whether other threads are doing so. In the code below, the variable x
is used to help swap three different pairs of variables. Because it is always written to before being read, it can be privatized.
//Sequential Code: x = a; a = b; b = x; x = c; c = d; d = x; x = e; e = f; b = x;
This code cannot be made parallel without privatizing x
. With x
privatized, it can run on three different threads, each with its own private x
:
//Parallel Code: //Thread 1: x[1] = a; a = b; b = x[1]; // Thread 2: x[2] = c; c = d; d = x[2]; // Thread 3: x[3]= e; e = f; b = x[3];
Privatization is possible when a variable's value is known before it is used – even if it is written to by a different task. The code below demonstrates this. The variable x
is written to in the middle of each task, but that value could be computed when the program was compiled. By making x
private and defining it at the beginning of each task, the code can be run in parallel:
//Sequential Code: x = 1; y = x * 3; x = 4; z = y/x; a = x * 9; x = 3; b = a/x; c = x * 1; x = 11; d = c/x;
To make the sequential code above parallel, a few lines of code must be added so that x
can be privatized:
//Parallel Code //Thread 0: x[0] = 1; y = x[0] * 3; x[0] = 4; z = y/x[0]; //Thread 1: x[1] = 4; a = x[1] * 9; x[1] = 3; b = a/x[1]; //Thread 2: x[2] = 3; c = x[2] * 1; x[2] = 11; d = c/x[2];
Because of the extra code, this short example may not actually see much of a speed up. But in real-life, longer code this technique can greatly improve performance.
Privatization fails is when a variable is written in one task and read in another, and the value is not known ahead of time. An example is summing the elements of an array. The sum is a shared variable and is read/written in each iteration of the loop.
In sequential code, this works fine. But if the iterations were each done in a different thread, the wrong sum would be calculated. In this case, privatization does not work. The sum
cannot be made private because it relies on its value from the previous iteration.
//Sequential Code: sum = 0; for (i = 0; i < 100; i++) sum += a[i];
This problem can be solved in part by loop unrolling. Because it does not matter which order the elements are added, the loop can be split into an arbitrary number of parts:
// Thread 0: sum[0] = 0; for (i[0] = 0; i[0] < 100; i[0] += 3) sum[0] += a[i[0]]; // Thread 1: sum[1] = 0; for (i[1] = 1; i[1] < 100; i[1] += 3) sum[1] += a[i[1]]; // Thread 2: sum[2] = 0; for (i[2] = 2; i[2] < 100; i[2] += 3) sum[2] += a[i[2]]; // "Master" thread: wait_for_all(thread[0], thread[1], thread[2]); sum = sum[0] + sum[1] + sum[2];
Given the sequential code:
do i = 10, N - 1 x = (b(i) + c(i))/2 b(i) = a(i + 1) + x enddo
For each iteration of the loop, x
is written to and then read from. Because x
is only a scalar variable, the loop cannot be executed in parallel because it would be overwritten in different threads, and b(i)
would not always be assigned the correct value.
Equivalent parallelized code using privatization is:
!$omp parallel do shared(a, b) private(x) do i = 10, N - 1 x = (b(i) + c(i))/2 b(i) = a(i + 1) + x enddo
Because x
is declared as private, each thread gets its own copy and the dependence is removed.[6]
Normally, when a variable is Read/Write Conflicting, the solution will be declaring it as shared and protecting access to it by mutual exclusion, providing synchronization when needed. Because mutual exclusion slows things down, this technique is avoided as much as possible.
Thus, the compiler or programmer first checks whether the variable can be reduced. If it cannot, the next check is for privatization. Privatization trades space for time, so mutual exclusion can be a better option when memory is limited.
Compared to reduction, privatization requires one modelling step: merely to analyze the code to identify the privatizable variables. On the other hand, the two steps required by reduction are: identifying the reduction variable, and then parallelizing the reduction operator.[7] By observing each of the two techniques, it is easy to tell what type of overhead each one adds to the parallel program; reduction increases the computation overhead while privatization increases the memory consumed by the program.
Compared to expansion, privatization has less memory overhead. The memory space needed for privatization is proportional to the number of processors, while in expansion, it is proportional to the number of iterations. Since the number of tasks is typically higher than the number of processors, the memory required by expansion is much larger than by privatization.
Changing the scope of parallelism can be done to explore a different parallel region. Doing so may sometimes greatly change the behavior of the variables. So reanalyzing the code and performing this technique may often change Read/Write Conflicting variables to be Non-conflicting.