In computer science, Cannon's algorithm is a distributed algorithm for matrix multiplication for two-dimensional meshes first described in 1969 by Lynn Elliot Cannon.[1] [2]
It is especially suitable for computers laid out in an N × N mesh.[3] While Cannon's algorithm works well in homogeneous 2D grids, extending it to heterogeneous 2D grids has been shown to be difficult.[4]
The main advantage of the algorithm is that its storage requirements remain constant and are independent of the number of processors.[2]
The Scalable Universal Matrix Multiplication Algorithm (SUMMA)[5] is a more practical algorithm that requires less workspace and overcomes the need for a square 2D grid. It is used by the ScaLAPACK, PLAPACK, and Elemental libraries.
When multiplying two n×n matrices A and B, we need n×n processing nodes p arranged in a 2D grid. // PE(i, j) k := (i + j) mod N; a := a[i][k]; b := b[k][j]; c[i][j] := 0; for (l := 0; l < N; l++) We need to select k in every iteration for every Processor Element (PE) so that processors don't access the same data for computing
aik*bkj
Therefore processors in the same row / column must begin summation with different indexes. If for example PE(0,0) calculates
a00*b00
a01*b11
In the first step we distribute the input matrices between the processors based on the previous rule.
In the next iterations we choose a new k' := (k + 1) mod n for every processor. This way every processor will continue accessing different values of the matrices. The needed data is then always at the neighbour processors. A PE(i,j) needs then the
a
b
a
b
aik*bkj
cij
After the initial distribution of each processor, only the data for the next step has to be stored. These are the intermediate result of the previous sum, a
aik
bkj
In practice we have much fewer processors than the matrix elements. We can replace the matrix elements with submatrices, so that every processor processes more values. The scalar multiplication and addition become sequential matrix multiplication and addition. The width and height of the submatrices will be
N=n/\sqrt{p}
The runtime of the algorithm is
Tl{(n,p)}=Tcoll(n/N,p)+N*Tseq(n/N)+2(N-1)(Tstart+Tbyte(n/N)2)
Tcoll
Tseq
Tstart
Tbyte
A disadvantage of the algorithm is that there are many connection setups, with small message sizes. It would be better to be able to transmit more data in each message.