Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs (in terms of time and energy): arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic.
A common computational model in analyzing communication-avoiding algorithms is the two-level memory model:
M
[1] Corollary 6.2:
More general results for other numerical linear algebra operations can be found in.[2] The following proof is from.[3]
Consider the following running-time model:
⇒ Total running time = γ·(no. of FLOPs) + β·(no. of words)
From the fact that β >> γ as measured in time and energy, communication cost dominates computation cost. Technological trends indicate that the relative cost of communication is increasing on a variety of platforms, from cloud computing to supercomputers to mobile devices. The report also predicts that gap between DRAM access time and FLOPs will increase 100× over coming decade to balance power usage between processors and DRAM.
Energy consumption increases by orders of magnitude as we go higher in the memory hierarchy.
United States president Barack Obama cited communication-avoiding algorithms in the FY 2012 Department of Energy budget request to Congress:
Communication-avoiding algorithms are designed with the following objectives:
The following simple example demonstrates how these are achieved.
Let A, B and C be square matrices of order n × n. The following naive algorithm implements C = C + A * B:
for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)
Arithmetic cost (time-complexity): n2(2n − 1) for sufficiently large n or O(n3).
Rewriting this algorithm with communication cost labelled at each step
for i = 1 to n - n2 reads for j = 1 to n - n2 reads - n3 reads for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) - n2 writes
Fast memory may be defined as the local processor memory (CPU cache) of size M and slow memory may be defined as the DRAM.
Communication cost (reads/writes): n3 + 3n2 or O(n3)
Since total running time = γ·O(n3) + β·O(n3) and β >> γ the communication cost is dominant. The blocked (tiled) matrix multiplication algorithm reduces this dominant term:
Consider A, B and C to be n/b-by-n/b matrices of b-by-b sub-blocks where b is called the block size; assume three b-by-b blocks fit in fast memory.
for i = 1 to n/b for j = 1 to n/b - b2 × (n/b)2 = n2 reads for k = 1 to n/b - b2 × (n/b)3 = n3/b reads - b2 × (n/b)3 = n3/b reads C(i,j) = C(i,j) + A(i,k) * B(k,j) - - b2 × (n/b)2 = n2 writes
Communication cost: 2n3/b + 2n2 reads/writes << 2n3 arithmetic cost
Making b as large possible:
3b2 ≤ Mwe achieve the following communication lower bound:
31/2n3/M1/2 + 2n2 or Ω (no. of FLOPs / M1/2)
Most of the approaches investigated in the past to address this problem rely on scheduling or tuning techniques that aim at overlapping communication with computation. However, this approach can lead to an improvement of at most a factor of two. Ghosting is a different technique for reducing communication, in which a processor stores and computes redundantly data from neighboring processors for future computations. Cache-oblivious algorithms represent a different approach introduced in 1999 for fast Fourier transforms, and then extended to graph algorithms, dynamic programming, etc. They were also applied to several operations in linear algebra as dense LU and QR factorizations. The design of architecture specific algorithms is another approach that can be used for reducing the communication in parallel algorithms, and there are many examples in the literature of algorithms that are adapted to a given communication topology.