Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems.[1] Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size from the -element sequence, and determining the range of the buckets by sorting the sample and choosing elements from the result. These elements (called splitters) then divide the array into approximately equal-sized buckets.[2] Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.
Samplesort is a generalization of quicksort. Where quicksort partitions its input into two parts at each step, based on a single value called the pivot, samplesort instead takes a larger sample from its input and divides its data into buckets accordingly. Like quicksort, it then recursively sorts the buckets.
To devise a samplesort implementation, one needs to decide on the number of buckets . When this is done, the actual algorithm operates in three phases:
The full sorted output is the concatenation of the buckets.
A common strategy is to set equal to the number of processors available. The data is then distributed among the processors, which perform the sorting of buckets using some other, sequential, sorting algorithm.
The following listing shows the above mentioned three step algorithm as pseudocode and shows how the algorithm works in principle.[3] In the following, is the unsorted data, is the oversampling factor, discussed later, and is the number of splitters.
function sampleSort(A[1..n],,) // if average bucket size is below a threshold switch to e.g. quicksort if n / k < threshold then smallSort(A) /* Step 1 */ select S = [''S''<sub>1</sub>, ..., ''S''<sub>(''p''−1)''k''</sub>] randomly from // select samples sort // sort sample [''s''<sub>0</sub>, ''s''<sub>1</sub>, ..., s<sub>''p''−1</sub>, ''s''<sub>''p''</sub>] <- [-∞, ''S''<sub>''k''</sub>, ''S''<sub>2''k''</sub>, ..., ''S''<sub>(''p''−1)''k''</sub>, ∞] // select splitters /* Step 2 */ for each a in A find such that sj−1 < a <= sj place in bucket bj /* Step 3 and concatenation */ return concatenate(sampleSort(b1), ..., sampleSort(bk))
The pseudo code is different from the original Frazer and McKellar algorithm.[4] In the pseudo code, samplesort is called recursively. Frazer and McKellar called samplesort just once and used quicksort in all following iterations.
The complexity, given in Big O notation, for a parallelized implementation with
p
Find the splitters.
O\left( | n |
p |
+log(p)\right)
Send to buckets.
O(p)
O(log(p))
O\left( | n |
p |
log(p)\right)
O\left( | n |
p |
\right)
Sort buckets.
O\left(c\left( | n |
p |
\right)\right)
c(n)
c(n)=nlog(n)
The number of comparisons, performed by this algorithm, approaches the information theoretical optimum
log2(n!)
The data may be sampled through different methods. Some methods include:
The oversampling ratio determines how many times more data elements to pull as samples, before determining the splitters. The goal is to get a good representation of the distribution of the data. If the data values are widely distributed, in that there are not many duplicate values, then a small sampling ratio is sufficient. In other cases where there are many duplicates in the distribution, a larger oversampling ratio will be necessary. In the ideal case, after step 2, each bucket contains
n/p
After pulling
k
k,2k,3k,...,(p-1)k
-infty
infty
p
With the resulting sample size, the expected bucket size and especially the probability of a bucket exceeding a certain size can be estimated. The following will show that for an oversampling factor of
S\in\Theta\left(\dfrac{logn}{\epsilon2}\right)
(1+\epsilon) ⋅ \dfrac{n}{p}
1-\dfrac{1}{n}
To show this let
\langlee1,...,en\rangle
(1+\epsilon) ⋅ n/p
(1+\epsilon) ⋅ n/p
Pfail
For the expected value of
Xi
This will be used to estimate
Pfail
Using the Chernoff bound now, it can be shown:
In case of many identical keys, the algorithm goes through many recursion levels where sequences are sorted, because the whole sequence consists of identical keys. This can be counteracted by introducing equality buckets. Elements equal to a pivot are sorted into their respective equality bucket, which can be implemented with only one additional conditional branch. Equality buckets are not further sorted. This works, since keys occurring more than
n/k
Samplesort is often used in parallel systems, including distributed systems such as bulk synchronous parallel machines.[5] [6] [7] Due to the variable amount of splitters (in contrast to only one pivot in Quicksort), Samplesort is very well suited and intuitive for parallelization and scaling. Furthermore Samplesort is also more cache-efficient than implementations of e.g. quicksort.
Parallelization is implemented by splitting the sorting for each processor or node, where the number of buckets is equal to the number of processors
p
n/p
On distributed systems, the splitters are chosen by taking
k
kp
k
Tsort(kp,p)
kp
p
Tallgather(p,p)
p
p
With the resulting splitters, each processor places its own input data into local buckets. This takes
lO(n/plogp)
i
bi
Tall-to-all(N,p)
N
Tlocalsort(N)
Experiments performed in the early 1990s on Connection Machine supercomputers showed samplesort to be particularly good at sorting large datasets on these machines, because its incurs little interprocessor communication overhead.[8] On latter-day GPUs, the algorithm may be less effective than its alternatives.[9]
As described above, the samplesort algorithm splits the elements according to the selected splitters. An efficient implementation strategy is proposed in the paper "Super Scalar Sample Sort".[3] The implementation proposed in the paper uses two arrays of size
n
In each recursion step, the data gets copied to the other array in a partitioned fashion. If the data is in the temporary array in the last recursion step, then the data is copied back to the original array.
In a comparison based sorting algorithm the comparison operation is the most performance critical part. In Samplesort this corresponds to determining the bucket for each element. This needs
logk
Super Scalar Sample Sort uses a balanced search tree which is implicitly stored in an array . The root is stored at 0, the left successor of
ti
t2i
t2i+1
ai
ai>tj
j := 1 repeat log2(p) times j := 2j + (a > tj) j := j − p + 1
Since the number of buckets is known at compile time, this loop can be unrolled by the compiler. The comparison operation is implemented with predicated instructions. Thus, there occur no branch mispredictions, which would slow down the comparison operation significantly.
For an efficient partitioning of the elements, the algorithm needs to know the sizes of the buckets in advance. To partition the elements of the sequence and put them into the array, we need to know the size of the buckets in advance. A naive algorithm could count the number of elements of each bucket. Then the elements could be inserted to the other array at the right place. Using this, one has to determine the bucket for each elements twice (one time for counting the number of elements in a bucket, and one time for inserting them).
To avoid this doubling of comparisons, Super Scalar Sample Sort uses an additional array
o
o
o
o
n ⋅ logk
A key disadvantage of the efficient Samplesort implementation shown above is that it is not in-place and requires a second temporary array of the same size as the input sequence during sorting. Efficient implementations of e.g. quicksort are in-place and thus more space efficient. However, Samplesort can be implemented in-place as well.[10]
The in-place algorithm is separated into four phases:
One obvious disadvantage of this algorithm is that it reads and writes every element twice, once in the classification phase and once in the block permutation phase. However, the algorithm performs up to three times faster than other state of the art in-place competitors and up to 1.5 times faster than other state of the art sequential competitors. As sampling was already discussed above, the three later stages will be further detailed in the following.
In a first step, the input array is split up into
p
k
Firstly, a prefix sum operation is performed that calculates the boundaries of the buckets. However, since only full blocks are moved in this phase, the boundaries are rounded up to a multiple of the block size and a single overflow buffer is allocated. Before starting the block permutation, some empty blocks might have to be moved to the end of its bucket. Thereafter, a write pointer
wi
bi
ri
bi
To limit work contention, each processor is assigned a different primary bucket
bprim
rprim
rprim
bdest
wdest
wdest
wdest>rdest
If all blocks in the subarray of the primary bucket of a processor are in the correct bucket, the next bucket is chosen as the primary bucket. If a processor chose all buckets as primary bucket once, the processor is finished.
Since only whole blocks were moved in the block permutation phase, some elements might still be incorrectly placed around the bucket boundaries. Since there has to be enough space in the array for each element, those incorrectly placed elements can be moved to empty spaces from left to right, lastly considering the overflow buffer.
Frazer and McKellar's samplesort and derivatives:
Adapted for use on parallel computers: