Range mode query explained

In data structures, the range mode query problem asks to build a data structure on some input data to efficiently answer queries asking for the mode of any consecutive subset of the input.

Problem statement

Given an array

A[1:n]=[a_1,a_2,...,a_n]

, we wish to answer queries of the form

mode(A,i:j)

, where

1\leqi\leqj\leqn

. The mode

mode(S)

of any array

S=[s_1,s_2,...,s_k]

is an element

s_i

such that the frequency of

s_i

is greater than or equal to the frequency of

s_j \forallj\in\{1,...,k\}

. For example, if

S=[1,2,4,2,3,4,2]

, then

mode(S)=2

because it occurs three times, while all other values occur fewer times. In this problem, the queries ask for the mode of subarrays of the form

A[i:j]=[a_i,a_i+1,...,a_j]

Theorem 1

Let

and

be any multisets. If

is a mode of

A\cupB

and

c\notinA

, then

is a mode of

Proof

Let

c\notinA

be a mode of

C=A\cupB

and

f_c

be its frequency in

. Suppose that

is not a mode of

. Thus, there exists an element

with frequency

f_b

that is the mode of

. Since

is the mode of

and that

c\notinA

, then

f_b>f_c

. Thus,

should be the mode of

which is a contradiction.

Results

Space	Query Time	Restrictions	Source
O(n)	O(\sqrt{n})		^[1]
O(n)	O(\sqrt{n/w})	w is the word size
O(n^2loglogn/logn)	O(1)
O(n^2-2\epsilon/logn)	O(n^\epsilon)	0\leq\epsilon\leq1/2
O(n^2-2\epsilon)	O(n^\epsilonlogn)	0\leq\epsilon\leq1/2	^[2]

Lower bound

Any data structure using

cells of

bits each needs

\Omega\left(	logn
	log(Sw/n)

\right)

time to answer a range mode query.^[3]

This contrasts with other range query problems, such as the range minimum query which have solutions offering constant time query time and linear space. This is due to the hardness of the mode problem, since even if we know the mode of

A[i:j]

and the mode of

A[j+1:k]

, there is no simple way of computing the mode of

A[i:k]

. Any element of

A[i:j]

A[j+1:k]

could be the mode. For example, if

mode(A[i:j])=a

and its frequency is

f_a

, and

mode(A[j+1:k])=b

and its frequency is also

f_a

, there could be an element

with frequency

f_a-1

A[i:j]

and frequency

f_a-1

A[j+1:k]

a\not=c\not=b

, but its frequency in

A[i:k]

is greater than the frequency of

and

, which makes

a better candidate for

mode(A[i:k])

than

Linear space data structure with square root query time

This method by Chan et al. uses

O(n+s²⁾

space and

O(n/s)

query time. By setting

s=\sqrt{n}

, we get

O(n)

and

O(\sqrt{n})

bounds for space and query time.

Preprocessing

Let

A[1:n]

be an array, and

D[1:\Delta]

be an array that contains the distinct values of A, where

\Delta

is the number of distinct elements. We define

B[1:n]

to be an array such that, for each

B[i]

contains the rank (position) of

A[i]

. Arrays

B,D

can be created by a linear scan of

Arrays

Q_1,Q_2,...,Q_\Delta

are also created, such that, for each

a\in\{1,...,\Delta\}

Q_a=\{b | B[b]=a\}

. We then create an array

B'[1:n]

, such that, for all

b\in\{1,...,n\}

B'[b]

contains the rank of

Q_B[b]

. Again, a linear scan of

suffices to create arrays

Q_1,Q_2,...,Q_\Delta

and

It is now possible to answer queries of the form "is the frequency of

B[i]

B[i:j]

at least

" in constant time, by checking whether

Q_B[i][B'[i]+q-1]\leqj

The array is split B into

blocks

b_1,b_2,...,b_s

, each of size

t=\lceiln/s\rceil

. Thus, a block

b_i

spans over

B[i ⋅ t+1:(i+1)t]

. The mode and the frequency of each block or set of consecutive blocks will be pre-computed in two tables

and

S[b_i,b_j]

is the mode of

b_i\cupb_i+1\cup...\cupb_j

, or equivalently, the mode of

B[b_it+1:(b_j+1)t]

, and

stores the corresponding frequency. These two tables can be stored in

O(s²⁾

space, and can be populated in

O(s ⋅ n)

by scanning

times, computing a row of

S,S'

each time with the following algorithm: algorithm computeS_Sprime is input: Array B = [0:n - 1], Array D = [0:Delta - 1], Integer s output: Tables S and Sprime let S ← Table(0:n - 1, 0:n - 1) let Sprime ← Table(0:n - 1, 0:n - 1) let firstOccurence ← Array(0:Delta - 1) for all i in do firstOccurence[i] ← -1 end for for i ← 0:s - 1 do let j ← i × t let c ← 0 let fc ← 0 let noBlock ← i let block_start ← j let block_end ← min while j < n do if firstOccurence[B[j]] = -1 then firstOccurence[B[j]] ← j end if if atLeastQInstances(firstOccurence[B[j]], block_end, fc + 1) then c ← B[j] fc ← fc + 1 end if if j = block_end then S[i * s + noBlock] ← c Sprime[i × s + noBlock] ← fc noBlock ← noBlock + 1 block_end ← min end if end while for all j in do firstOccurence[j] ← -1 end for end for

Query

We will define the query algorithm over array

. This can be translated to an answer over

, since for any

a,i,j

B[a]

is a mode for

B[i:j]

if and only if

A[a]

is a mode for

A[i:j]

. We can convert an answer for

to an answer for

in constant time by looking in

at the corresponding index.

Given a query

mode(B,i,j)

, the query is split in three parts: the prefix, the span and the suffix. Let

b_i=\lceil(i-1)/t\rceil

and

b_j=\lfloorj/t\rfloor-1

. These denote the indices of the first and last block that are completely contained in

. The range of these blocks is called the span. The prefix is then

B[i:min\{b_it,j\}]

(the set of indices before the span), and the suffix is

B[max\{(b_{j+1)t+1,i\}:j]}

(the set of indices after the span). The prefix, suffix or span can be empty, the latter is if

b_j<b_i

For the span, the mode

is already stored in

S[b_i,b_j]

. Let

f_c

be the frequency of the mode, which is stored in

S'[b_i,b_j]

. If the span is empty, let

f_c=0

. Recall that, by Theorem 1, the mode of

B[i:j]

is either an element of the prefix, span or suffix. A linear scan is performed over each element in the prefix and in the suffix to check if its frequency is greater than the current candidate

, in which case

and

f_c

are updated to the new value. At the end of the scan,

contains the mode of

B[i:j]

and

f_c

its frequency.

Scanning procedure

The procedure is similar for both prefix and suffix, so it suffice to run this procedure for both:

Let

be the index of the current element. There are three cases:

Q_B[x][B'[x]-1]\geqi

, then it was present in

B[i:x-1]

and its frequency has already been counted. Pass to the next element.

Otherwise, check if the frequency of

B[x]

B[i:j]

is at least

f_c

(this can be done in constant time since it is the equivalent of checking it for

B[x:j]

1. If it is not, then pass to the next element.
2. If it is, then compute the actual frequency

f_x

B[x]

B[i:j]

by a linear scan (starting at index

B'[x]+f_c-1

) or a binary search in

Q_B[x]

. Set

c:=B[x]

and

f_c:=f_x

This linear scan (excluding the frequency computations) is bounded by the block size

, since neither the prefix or the suffix can be greater than

. A further analysis of the linear scans done for frequency computations shows that it is also bounded by the block size. Thus, the query time is

O(t)=O(n/s)

Subquadratic space data structure with constant query time

This method by uses

O\left(	n²log{log{n

}}\right) space for a constant time query. We can observe that, if a constant query time is desired, this is a better solution than the one proposed by Chan et al., as the latter gives a space of

O(n²⁾

for constant query time if

s=n

Preprocessing

Let

A[1:n]

be an array. The preprocessing is done in three steps:

Split the array

blocks

b_1,b_2,...,b_s

, where the size of each block is

t=\lceiln/s\rceil

. Build a table

of size

s x s

where

S[i,j]

is the mode of

b_i\cupb_i+1\cup...\cupb_j

. The total space for this step is

O(s²⁾

For any query

mode(A,i,j)

, let

b_i'

be the block that contains

and

b_j'

be the block that contains

. Let the span be the set of blocks completely contained in

A[i:j]

. The mode

of the block can be retrieved from

. By Theorem 1, the mode can be either an element of the prefix (indices of

A[i:j]

before the start of the span), an element of the suffix (indices of

A[i:j]

after the end of the span), or

. The size of the prefix plus the size of the suffix is bounded by

, thus the position of the mode isstored as an integer ranging from

, where

[0:2t-1]

indicates a position in the prefix/suffix and

indicates that the mode is the mode of the span. There are

\binom{t}{2}

possible queries involving blocks

b_i'

and

b_j'

, so these values are stored in a table of size

t²

. Furthermore, there are

	t²
(2t+1)

such tables, so the total space required for this step is

O(t²

	t²
(2t+1)

)

. To access those tables, a pointer is added in addition to the mode in the table

for each pair of blocks.

To handle queries

mode(A,i,j)

where

and

are in the same block, all such solutions are precomputed. There are

O(st²⁾

of them, they are stored in a three dimensional table

of this size.

The total space used by this data structure is

O(s²+t^2(2t+1)

	t²

+st²⁾

, which reduces to

O\left(	n²log{log{n

}}\right) if we take

t=\sqrt{log{n}/log{log{n}}}

Query

Given a query

mode(A,i,j)

, check if it is completely contained inside a block, in which case the answer is stored in table

. If the query spans exactly one or more blocks, then the answer is found in table

. Otherwise, use the pointer stored in table

at position

S[b_i',b_j']

, where

b_i',b_j'

are the indices of the blocks that contain respectively

and

, to find the table

U
	b_i',b_j'

that contains the positions of the mode for these blocks and use the position to find the mode in

. This can be done in constant time.

Notes and References

Chan. Timothy M.. Stephane. Durocher. Kasper Green. Larsen. Jason. Morrison. Bryan T.. Wilkinson. Linear-Space Data Structures for Range Mode Query in Arrays. Theory of Computing Systems. 2013. Springer. 1–23.
Danny. Krizanc. Pat. Morin. Pat Morin . Michiel H. M.. Smid. Range Mode and Range Median Queries on Lists and Trees. ISAAC. 2003. 517–526. cs/0307034. 2003cs........7034K.
Greve. M. Jørgensen. A.. Larsen. K.. Truelsen. J.. Cell probe lower bounds and approximations for range mode. Automata, Languages and Programming. 2010. 605–616.