Collective operations are building blocks for interaction patterns, that are often used in SPMD algorithms in the parallel programming context. Hence, there is an interest in efficient realizations of these operations.
A realization of the collective operations is provided by the Message Passing Interface[1] (MPI).
In all asymptotic runtime functions, we denote the latency
\alpha
\beta
p
n
pi\in\{p0,p1,...,pp\}
If we do not have an equal distribution, i.e. node
pi
ni
n=max(n0,n1,...,np-1)
A distributed memory model is assumed. The concepts are similar for the shared memory model. However, shared memory systems can provide hardware support for some operations like broadcast for example, which allows convenient concurrent read.[2] Thus, new algorithmic possibilities can become available.
See main article: Broadcast (parallel pattern).
The broadcast pattern[3] is used to distribute data from one processing unit to all processing units, which is often needed in SPMD parallel programs to dispense input or global values. Broadcast can be interpreted as an inverse version of the reduce pattern . Initially only root
r
id
0
m
m
m
Since an implementation by means of a sequential for-loop with
p-1
p
m
i..j
m
\left\lceil(i+j)/2\right\rceil
\left\lceil(i+j)/2\right\rceil..j
i..\left\lceil(i+j)/2\right\rceil-1
Binomial trees have a problem with long messages
m
m
m
k
\left\lceiln/k\right\rceil
Pipelined broadcast on balanced binary tree is possible in
l{O}(\alphalogp+\betan)
l{O}((\alpha+\betan)logp)
See main article: Reduce (parallel pattern).
The reduce pattern[4] is used to collect data or partial results from different processing units and to combine them into a global result by a chosen operator. Given
p
mi
pi
mi
⊗
p0
⊗
sum
min
max
Implementation considerations are similar to broadcast . For pipelining on binary trees the message must be representable as a vector of smaller object for component-wise reduction.
Pipelined reduce on a balanced binary tree is possible in
l{O}(\alphalogp+\betan)
The all-reduce pattern[5] (also called allreduce) is used if the result of a reduce operation must be distributed to all processing units. Given
p
mi
pi
mi
⊗
pi
⊗
All-reduce can be interpreted as a reduce operation with a subsequent broadcast . For long messages a corresponding implementation is suitable, whereas for short messages, the latency can be reduced by using a hypercube topology, if
p
All-reduce is possible in
l{O}(\alphalogp+\betan)
l{O}(\alphalogp+\betan)
See main article: Prefix sum.
The prefix-sum or scan operation[7] is used to collect data or partial results from different processing units and to compute intermediate results by an operator, which are stored on those processing units. It can be seen as a generalization of the reduce operation . Given
p
mi
pi
⊗
sum
min
max
pi
⊗ i'
mi'
pi
⊗ i'
mi'
For short messages, this can be achieved with a hypercube topology if
p
p
Prefix-sum on a binary tree can be implemented with an upward and downward phase. In the upward phase reduction is performed, while the downward phase is similar to broadcast, where the prefix sums are computed by sending different data to the left and right children. With this approach pipelining is possible, because the operations are equal to reduction and broadcast .
Pipelined prefix sum on a binary tree is possible in
l{O}(\alphalogp+\betan)
See main article: Barrier (computer science).
The barrier[8] as a collective operation is a generalization of the concept of a barrier, that can be used in distributed computing. When a processing unit calls barrier, it waits until all other processing units have called barrier as well. Barrier is thus used to achieve global synchronization in distributed computing.
One way to implement barrier is to call all-reduce with an empty/ dummy operand. We know the runtime of All-reduce is
l{O}(\alphalogp+\betan)
n
l{O}(\alphalogp)
The gather communication pattern[9] is used to store data from all processing units on a single processing unit. Given
p
mi
pi
pj
m1 ⋅ m2 ⋅ \ldots ⋅ mp
pj
l{O}(\alphalogp+\betapn)
l{O}(\alphalogp+\betan)
\betan
min
The all-gather communication pattern[9] is used to collect data from all processing units and to store the collected data on all processing units. Given
p
pi
mi
pi
m1 ⋅ m2 ⋅ \ldots ⋅ mp
pj
It can be thought of in multiple ways. The first is as an all-reduce operation with concatenation as the operator, in the same way that gather can be represented by reduce. The second is as a gather-operation followed by a broadcast of the new message of size
pn
l{O}(\alphalogp+\betapn)
The scatter communication pattern[10] is used to distribute data from one processing unit to all the processing units. It differs from broadcast, in that it does not send the same message to all processing units. Instead it splits the message and delivers one part of it to each processing unit.
Given
p
pi
pj
m=m1 ⋅ m2 ⋅ \ldots ⋅ mp
mi
pi
l{O}(\alphalogp+\betapn)
See main article: All-to-all (parallel pattern).
All-to-all[11] is the most general communication pattern. For
0\leqi,j<p
mi,
i
j
m
pk
mi,=m
i=k
ml,
l ≠ k
Assuming we have a fully connected network, the best possible runtime for all-to-all is in
l{O}(p(\alpha+\betan))
p
p
k
pi
pj,j=i ⊕ k
If the message size is small and latency dominates the communication, a hypercube algorithm can be used to distribute the messages in time
l{O}(logp(\alpha+\betapn))
This table[12] gives an overview over the best known asymptotic runtimes, assuming we have free choice of network topology.
Example topologies we want for optimal runtime are binary tree, binomial tree, hypercube.
In practice, we have to adjust to the available physical topologies, e.g. dragonfly, fat tree, grid network (references other topologies, too).
More information under Network topology.
For each operation, the optimal algorithm can depend on the input sizes
n
The complexities stated in the table depend on the latency
\alpha
\beta
p
n
Broadcast | 1 | p | 1 | no | l{O}(\alphalogp+\betan) | |
Reduce | p | 1 | p | yes | l{O}(\alphalogp+\betan) | |
All-reduce | p | p | p | yes | l{O}(\alphalogp+\betan) | |
Prefix sum | p | p | p | yes | l{O}(\alphalogp+\betan) | |
Barrier | p | p | 0 | no | l{O}(\alphalogp) | |
Gather | p | 1 | p | no | l{O}(\alphalogp+\betapn) | |
All-Gather | p | p | p | no | l{O}(\alphalogp+\betapn) | |
Scatter | 1 | p | p | no | l{O}(\alphalogp+\betapn) | |
All-To-All | p | p | p2 | no | l{O}(logp(\alpha+\betapn)) l{O}(p(\alpha+\betan)) |
Book: Sanders. Peter. Sequential and Parallel Algorithms and Data Structures - The Basic Toolbox. Mehlhorn. Kurt. Dietzfelbinger. Martin. Dementiev. Roman. 2019. Springer Nature Switzerland AG. 978-3-030-25208-3. Peter Sanders (computer scientist). Kurt Mehlhorn.