Parallel external memory explained
In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine.[1] It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.
__TOC__
Model
Definition
The PEM model[2] is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of
processors and a two-level
memory hierarchy. This memory hierarchy consists of a large
external memory (main memory) of size
and
small
internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size
which is partitioned in blocks of size
. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size
.
I/O complexity
The complexity measure of the PEM model is the I/O complexity,[3] which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if
processors load parallelly a data block of size
form the main memory into their caches, it is considered as an I/O complexity of
not
. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.
Read/write conflicts
In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts[4] occur. Like in the PRAM model, three different variations of this problem are considered:
- Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
- Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
- Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.
The following two algorithms[5] solve the CREW and EREW problem if
processors write to the same block simultaneously.A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of
parallel block transfers. A second approach needs
parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a
binary tree fashion and gradually combine the data into a single block. In the first round
processors combine their blocks into
blocks. Then
processors combine the
blocks into
. This procedure is continued until all the data is combined in one block.
Comparison to other models
Examples
Multiway partitioning
Let
be a vector of d-1 pivots sorted in increasing order. Let be an unordered set of N elements. A d-way partition
[6] of is a set
, where
and
for
.
is called the i-th bucket. The number of elements in
is greater than
and smaller than
. In the following algorithm
[7] the input is partitioned into N/P-sized contiguous segments
in main memory. The processor i primarily works on the segment
. The multiway partitioning algorithm (
PEM_DIST_SORT
[8]) uses a PEM
prefix sum algorithm
[9] to calculate the prefix sum with the optimal
I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm. // Compute parallelly a d-way partition on the data segments
for each processor i
in parallel do Read the vector of pivots into the cache. Partition
into d buckets and let vector
be the number of items in each bucket.
end for Run PEM prefix sum on the set of vectors
simultaneously. // Use the prefix sum vector to compute the final partition
for each processor i
in parallel do Write elements
into memory locations offset appropriately by
and
.
end for Using the prefix sums stored in
the last processor P calculates the vector of bucket sizes and returns it.
If the vector of
pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with
+\left\lceil
\right\rceil>log(P)+dlog(B)\right)
I/O complexity. The content of the final buckets have to be located in contiguous memory.
Selection
The selection problem is about finding the k-th smallest item in an unordered list of size .The following code makes use of PRAMSORT
which is a PRAM optimal sorting algorithm which runs in
, and
SELECT
, which is a cache optimal single-processor selection algorithm.
if
then
return
end if //Find median of each
for each processor
in parallel do
end for // Sort medians
tt{PRAMSORT}(\lbracem1,...,m2\rbrace,P)
// Partition around median of medians
t=tt{PEMPARTITION}(A,mP/2,P)
if
then return tt{PEMSELECT}(A[1:t],P,k)
else return tt{PEMSELECT}(A[t+1:N],P,k-t)
end ifUnder the assumption that the input is stored in contiguous memory, PEMSELECT
has an I/O complexity of:
Distribution sort
Distribution sort partitions an input list of size into disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.
If
the task is delegated to a cache-optimal single-processor sorting algorithm.
Otherwise the following algorithm is used: // Sample
elements from
for each processor
in parallel do if
then
Load
in -sized pages and sort pages individually
else
Load and sort
as single page
end if Pick every
'th element from each sorted memory page into contiguous vector
of samples
end for in parallel do Combine vectors
into a single contiguous vector
Make
copies of
:
}
end do // Find
pivots
for
to
in parallel do l{M}[j]=tt{PEMSELECT}(l{R}i,\tfrac{P}{\sqrt{d}},\tfrac{j ⋅ 4N}{d})
end for Pack pivots in contiguous array
// Partition around pivots into buckets
l{B}=tt{PEMMULTIPARTITION}(A[1:N],l{M},\sqrt{d},P)
// Recursively sort buckets
for
to
in parallel do recursively call
on bucket of size
using
O\left(\left\lceil\tfrac{l{B}[j]}{N/P}\right\rceil\right)
processors responsible for elements in bucket
end forThe I/O complexity of PEMDISTSORT
is:
O\left(\left\lceil
\right\rceil\left(logdP+logM/B
\right)+f(N,P,d) ⋅ logdP\right)
where
} \log \frac + \left \lceil \frac \log P + \sqrt \log B \right \rceil \right)
If the number of processors is chosen that
f(N,P,d)=O\left(\left\lceil\tfrac{N}{PB}\right\rceil\right)
and
the I/O complexity is then:
Other PEM algorithms
Where
is the time it takes to sort items with processors in the PEM model.
See also
Notes and References
- Book: Arge. Lars. Goodrich. Michael T.. Nelson. Michael. Sitchinava. Nodari. Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures . Fundamental parallel algorithms for private-cache chip multiprocessors . 2008. 197–206. New York, New York, USA. ACM Press. 10.1145/1378533.1378573. 9781595939739. 11067041 .
- Book: Arge. Lars. Goodrich. Michael T.. Nelson. Michael. Sitchinava. Nodari. Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures . Fundamental parallel algorithms for private-cache chip multiprocessors . 2008. 197–206. New York, New York, USA. ACM Press. 10.1145/1378533.1378573. 9781595939739. 11067041 .
- Book: Arge. Lars. Goodrich. Michael T.. Nelson. Michael. Sitchinava. Nodari. Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures . Fundamental parallel algorithms for private-cache chip multiprocessors . 2008. 197–206. New York, New York, USA. ACM Press. 10.1145/1378533.1378573. 9781595939739. 11067041 .
- Book: Arge. Lars. Goodrich. Michael T.. Nelson. Michael. Sitchinava. Nodari. Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures . Fundamental parallel algorithms for private-cache chip multiprocessors . 2008. 197–206. New York, New York, USA. ACM Press. 10.1145/1378533.1378573. 9781595939739. 11067041 .
- Book: Arge. Lars. Goodrich. Michael T.. Nelson. Michael. Sitchinava. Nodari. Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures . Fundamental parallel algorithms for private-cache chip multiprocessors . 2008. 197–206. New York, New York, USA. ACM Press. 10.1145/1378533.1378573. 9781595939739. 11067041 .
- Book: Arge. Lars. Goodrich. Michael T.. Nelson. Michael. Sitchinava. Nodari. Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures . Fundamental parallel algorithms for private-cache chip multiprocessors . 2008. 197–206. New York, New York, USA. ACM Press. 10.1145/1378533.1378573. 9781595939739. 11067041 .
- Book: Arge. Lars. Goodrich. Michael T.. Nelson. Michael. Sitchinava. Nodari. Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures . Fundamental parallel algorithms for private-cache chip multiprocessors . 2008. 197–206. New York, New York, USA. ACM Press. 10.1145/1378533.1378573. 9781595939739. 11067041 .
- Book: Arge. Lars. Goodrich. Michael T.. Nelson. Michael. Sitchinava. Nodari. Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures . Fundamental parallel algorithms for private-cache chip multiprocessors . 2008. 197–206. New York, New York, USA. ACM Press. 10.1145/1378533.1378573. 9781595939739. 11067041 .
- Book: Arge. Lars. Goodrich. Michael T.. Nelson. Michael. Sitchinava. Nodari. Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures . Fundamental parallel algorithms for private-cache chip multiprocessors . 2008. 197–206. New York, New York, USA. ACM Press. 10.1145/1378533.1378573. 9781595939739. 11067041 .
- Book: Arge. Lars. Goodrich. Michael T.. Sitchinava. Nodari. 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) . Parallel external memory graph algorithms . 2010. 1–11. IEEE. 10.1109/ipdps.2010.5470440. 9781424464425. 587572 .