In parallel computing, all-to-all (also known as index operation or total exchange) is a collective operation, where each processor sends an individual message to every other processor.
Initially, each processor holds p messages of size m each, and the goal is to exchange the i-th message of processor j with the j-th message of processor i.
The number of communication rounds and the overall communication volume are measures to evaluate the quality of an all-to-all algorithm. We consider a single-ported full-duplex machine throughout this article. On such a machine, an all-to-all algorithm requires at least
\lceillog2n\rceil
\left\lceilm(n-1)\right\rceil
Depending on the network topology (fully connected, hypercube, ring), different all-to-all algorithms are required.
We consider a single-ported machine. The way the data is routed through the network depends on its underlying topology. We take a look at all-to-all algorithms for common network topologies.
A hypercube is a network topology, where two processors share a link, if the hamming distance of their indices is one. The idea of an all-to-all algorithm is to combine messages belonging to the same subcube, and then distribute them.
An all-to-all algorithm in a ring topology is very intuitive. Initially a processor sends a message of size m(p-1) to one of its neighbors. Communication is performed in the same direction on all processors. When a processor receives a message, it extracts the part that belongs to it and forwards the remainder of the message to the next neighbor. After (p-1) communication rounds, every message is distributed to its destination.
The time taken by this algorithm is
(t | ||||
|
twmp)(p-1)
ts
tw
For a mesh we look at a
\sqrt{p} x \sqrt{p}
\sqrt{p}
\sqrt{p}
The overall time of communication for this algorithm is
(2ts+twmp)(\sqrt{p}-1)
Again, we consider a single-ported machine. A trivial algorithm, is to send (p-1) asynchronous messages into the network for each processor. The performance of this algorithm is poor, which is due to congestion arising because of the bisection width of the network.[3] More sophisticated algorithms combine messages to reduce the number of send operations and try to control congestion.
For large messages, the cost of a startup is small compared to the cost of transmitting the payload. It is faster to send messages directly to their destination. In the following algorithm an all-to-all algorithm is performed using (p-1) one-to-one routings.
// p odd: // pe index
j\in\{0,...,p-1\}
(i-j)\modp
// p even: // pe index
j\in\{0,...,p-1\}
p | |
2 |
i\mod(p-1)
(i-j)mod(p-1)
The algorithm has a different behavior, whether p is odd or even. In case p is odd, one processor is idle in each iteration. For an even p, this idle processor communicates with the processor with index p-1. The total time taken is
(p-1)(ts+twm)
p(ts+twm)
Instead of pairing processor j with processor
(i-j)\bmodp