Collective operation explained

Collective operations are building blocks for interaction patterns, that are often used in SPMD algorithms in the parallel programming context. Hence, there is an interest in efficient realizations of these operations.

A realization of the collective operations is provided by the Message Passing Interface^[1] (MPI).

Definitions

In all asymptotic runtime functions, we denote the latency

\alpha

(or startup time per message, independent of message size), the communication cost per word

\beta

, the number of processing units

and the input size per node

. In cases where we have initial messages on more than one node we assume that all local messages are of the same size. To address individual processing units we use

p_i\in\{p_0,p_1,...,p_p\}

If we do not have an equal distribution, i.e. node

p_i

has a message of size

n_i

, we get an upper bound for the runtime by setting

n=max(n_0,n_1,...,n_p-1)

A distributed memory model is assumed. The concepts are similar for the shared memory model. However, shared memory systems can provide hardware support for some operations like broadcast for example, which allows convenient concurrent read.^[2] Thus, new algorithmic possibilities can become available.

Broadcast

See main article: Broadcast (parallel pattern).

The broadcast pattern^[3] is used to distribute data from one processing unit to all processing units, which is often needed in SPMD parallel programs to dispense input or global values. Broadcast can be interpreted as an inverse version of the reduce pattern . Initially only root

with

stores message

. During broadcast

is sent to the remaining processing units, so that eventually

is available to all processing units.

Since an implementation by means of a sequential for-loop with

p-1

iterations becomes a bottleneck, divide-and-conquer approaches are common. One possibility is to utilize a binomial tree structure with the requirement that

has to be a power of two. When a processing unit is responsible for sending

to processing units

i..j

, it sends

to processing unit

\left\lceil(i+j)/2\right\rceil

and delegates responsibility for the processing units

\left\lceil(i+j)/2\right\rceil..j

to it, while its own responsibility is cut down to

i..\left\lceil(i+j)/2\right\rceil-1

Binomial trees have a problem with long messages

. The receiving unit of

can only propagate the message to other units, after it received the whole message. In the meantime, the communication network is not utilized. Therefore pipelining on binary trees is used, where

is split into an array of

packets of size

\left\lceiln/k\right\rceil

. The packets are then broadcast one after another, so that data is distributed fast in the communication network.

Pipelined broadcast on balanced binary tree is possible in

l{O}(\alphalogp+\betan)

, whereas for the non-pipelined case it takes

l{O}((\alpha+\betan)logp)

cost.

Reduce

See main article: Reduce (parallel pattern).

The reduce pattern^[4] is used to collect data or partial results from different processing units and to combine them into a global result by a chosen operator. Given

processing units, message

m_i

is on processing unit

p_i

initially. All

m_i

are aggregated by

⊗

and the result is eventually stored on

p₀

. The reduction operator

⊗

must be associative at least. Some algorithms require a commutative operator with a neutral element. Operators like

sum

min

max

are common.

Implementation considerations are similar to broadcast . For pipelining on binary trees the message must be representable as a vector of smaller object for component-wise reduction.

Pipelined reduce on a balanced binary tree is possible in

l{O}(\alphalogp+\betan)

All-Reduce

The all-reduce pattern^[5] (also called allreduce) is used if the result of a reduce operation must be distributed to all processing units. Given

processing units, message

m_i

is on processing unit

p_i

initially. All

m_i

are aggregated by an operator

⊗

and the result is eventually stored on all

p_i

. Analog to the reduce operation, the operator

⊗

must be at least associative.

All-reduce can be interpreted as a reduce operation with a subsequent broadcast . For long messages a corresponding implementation is suitable, whereas for short messages, the latency can be reduced by using a hypercube topology, if

is a power of two. All-reduce can also be implemented with a butterfly algorithm and achieve optimal latency and bandwidth.^[6]

All-reduce is possible in

l{O}(\alphalogp+\betan)

, since reduce and broadcast are possible in

l{O}(\alphalogp+\betan)

with pipelining on balanced binary trees. All-reduce implemented with a butterfly algorithm achieves the same asymptotic runtime.

Prefix-Sum/Scan

See main article: Prefix sum.

The prefix-sum or scan operation^[7] is used to collect data or partial results from different processing units and to compute intermediate results by an operator, which are stored on those processing units. It can be seen as a generalization of the reduce operation . Given

processing units, message

m_i

is on processing unit

p_i

. The operator

⊗

must be at least associative, whereas some algorithms require also a commutative operator and a neutral element. Common operators are

sum

min

and

max

. Eventually processing unit

p_i

stores the prefix sum

⊗ _i'

m_i'

. In the case of the so-called exclusive prefix sum, processing unit

p_i

stores the prefix sum

⊗ _i'

m_i'

. Some algorithms require to store the overall sum at each processing unit in addition to the prefix sums.

For short messages, this can be achieved with a hypercube topology if

is a power of two. For long messages, the hypercube topology is not suitable, since all processing units are active in every step and therefore pipelining can't be used. A binary tree topology is better suited for arbitrary

and long messages .

Prefix-sum on a binary tree can be implemented with an upward and downward phase. In the upward phase reduction is performed, while the downward phase is similar to broadcast, where the prefix sums are computed by sending different data to the left and right children. With this approach pipelining is possible, because the operations are equal to reduction and broadcast .

Pipelined prefix sum on a binary tree is possible in

l{O}(\alphalogp+\betan)

Barrier

See main article: Barrier (computer science).

The barrier^[8] as a collective operation is a generalization of the concept of a barrier, that can be used in distributed computing. When a processing unit calls barrier, it waits until all other processing units have called barrier as well. Barrier is thus used to achieve global synchronization in distributed computing.

One way to implement barrier is to call all-reduce with an empty/ dummy operand. We know the runtime of All-reduce is

l{O}(\alphalogp+\betan)

. Using a dummy operand reduces size

to a constant factor and leads to a runtime of

l{O}(\alphalogp)

Gather

The gather communication pattern^[9] is used to store data from all processing units on a single processing unit. Given

processing units, message

m_i

on processing unit

p_i

. For a fixed processing unit

p_j

, we want to store the message

m₁ ⋅ m₂ ⋅ \ldots ⋅ m_p

p_j

. Gather can be thought of as a reduce operation that uses the concatenation operator. This works due to the fact that concatenation is associative. By using the same binomial tree reduction algorithm we get a runtime of

l{O}(\alphalogp+\betapn)

. We see that the asymptotic runtime is similar to the asymptotic runtime of reduce

l{O}(\alphalogp+\betan)

, but with the addition of a factor p to the term

\betan

. This additional factor is due to the message size increasing in each step as messages get concatenated. Compare this to reduce where message size is a constant for operators like

min

All-Gather

The all-gather communication pattern^[9] is used to collect data from all processing units and to store the collected data on all processing units. Given

processing units

p_i

, message

m_i

initially stored on

p_i

, we want to store the message

m₁ ⋅ m₂ ⋅ \ldots ⋅ m_p

on each

p_j

It can be thought of in multiple ways. The first is as an all-reduce operation with concatenation as the operator, in the same way that gather can be represented by reduce. The second is as a gather-operation followed by a broadcast of the new message of size

. With this we see that all-gather in

l{O}(\alphalogp+\betapn)

is possible.

Scatter

The scatter communication pattern^[10] is used to distribute data from one processing unit to all the processing units. It differs from broadcast, in that it does not send the same message to all processing units. Instead it splits the message and delivers one part of it to each processing unit.

Given

processing units

p_i

, a fixed processing unit

p_j

that holds the message

m=m₁ ⋅ m₂ ⋅ \ldots ⋅ m_p

. We want to transport the message

m_i

onto

p_i

. The same implementation concerns as for gather apply. This leads to an optimal runtime in

l{O}(\alphalogp+\betapn)

All-to-all

See main article: All-to-all (parallel pattern).

All-to-all^[11] is the most general communication pattern. For

0\leqi,j<p

, message

m_i,

is the message that is initially stored on node

and has to be delivered to node

. We can express all communication primitives that do not use operators through all-to-all. For example, broadcast of message

from node

p_k

is emulated by setting

m_i,=m

for

i=k

and setting

m_l,

empty for

l ≠ k

Assuming we have a fully connected network, the best possible runtime for all-to-all is in

l{O}(p(\alpha+\betan))

. This is achieved through

rounds of direct message exchange. For

power of 2, in communication round

, node

p_i

exchanges messages with node

p_j,j=i ⊕ k

If the message size is small and latency dominates the communication, a hypercube algorithm can be used to distribute the messages in time

l{O}(logp(\alpha+\betapn))

Runtime Overview

This table^[12] gives an overview over the best known asymptotic runtimes, assuming we have free choice of network topology.

Example topologies we want for optimal runtime are binary tree, binomial tree, hypercube.

In practice, we have to adjust to the available physical topologies, e.g. dragonfly, fat tree, grid network (references other topologies, too).

More information under Network topology.

For each operation, the optimal algorithm can depend on the input sizes

. For example, broadcast for short messages is best implemented using a binomial tree whereas for long messages a pipelined communication on a balanced binary tree is optimal.

The complexities stated in the table depend on the latency

\alpha

and the communication cost per word

\beta

in addition to the number of processing units

and the input message size per node

. The # senders and # receivers columns represent the number of senders and receivers that are involved in the operation respectively. The # messages column lists the number of input messages and the Computations? column indicates if any computations are done on the messages or if the messages are just delivered without processing. Complexity gives the asymptotic runtime complexity of an optimal implementation under free choice of topology.

!Name!# senders!# receivers!# messages!Computations?!Complexity
Broadcast	1	p	1	no	l{O}(\alphalogp+\betan)
Reduce	p	1	p	yes	l{O}(\alphalogp+\betan)
All-reduce	p	p	p	yes	l{O}(\alphalogp+\betan)
Prefix sum	p	p	p	yes	l{O}(\alphalogp+\betan)
Barrier	p	p	0	no	l{O}(\alphalogp)
Gather	p	1	p	no	l{O}(\alphalogp+\betapn)
All-Gather	p	p	p	no	l{O}(\alphalogp+\betapn)
Scatter	1	p	p	no	l{O}(\alphalogp+\betapn)
All-To-All	p	p	p²	no	l{O}(logp(\alpha+\betapn)) or l{O}(p(\alpha+\betan))

References

Book: Sanders. Peter. Sequential and Parallel Algorithms and Data Structures - The Basic Toolbox. Mehlhorn. Kurt. Dietzfelbinger. Martin. Dementiev. Roman. 2019. Springer Nature Switzerland AG. 978-3-030-25208-3. Peter Sanders (computer scientist). Kurt Mehlhorn.

Notes and References

http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node144.htm Intercommunicator Collective Operations
Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 395
Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 396-401
Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 402-403
Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 403-404
Yuan . Xin . February 2009 . Bandwidth optimal all-reduce algorithms for clusters of workstations . Journal of Parallel and Distributed Computing . 69 . 2.
Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 404-406
Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 408
Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 412-413
Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 413
Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 413-418
Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 394