Chomsky–Schützenberger enumeration theorem explained

In formal language theory, the Chomsky - Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

Statement

In order to state the theorem, a few notions from algebra and formal language theory are needed.

Let

N

denote the set of nonnegative integers. A power series over

N

is an infinite series of the form

f=f(x)=

infty
\sum
k=0

akxk=a0+a1x1+a2x2+a3x3+

with coefficients

ak

in

N

. The multiplication of two formal power series

f

and

g

is defined in the expected way as the convolution of the sequences

an

and

bn

:

f(x)g(x)=

infty
\sum
k=0
k
\left(\sum
i=0

aibk-i\right)xk.

In particular, we write

f2=f(x)f(x)

,

f3=f(x)f(x)f(x)

, and so on. In analogy to algebraic numbers, a power series

f(x)

is called algebraic over

Q(x)

, if there exists a finite set of polynomials

p0(x),p1(x),p2(x),\ldots,pn(x)

each with rational coefficients such that

p0(x)+p1(x)f+p2(x)f2++pn(x)fn=0.

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree or, equivalently, only one leftmost derivation.Having established the necessary notions, the theorem is stated as follows.

Chomsky - Schützenberger theorem. If

L

is a context-free language admitting an unambiguous context-free grammar, and

ak:=|L\cap\Sigmak|

is the number of words of length

k

in

L

, then
infty
G(x)=\sum
k=0

akxk

is a power series over

N

that is algebraic over

Q(x)

.

Proofs of this theorem are given by, and by .

Usage

Asymptotic estimates

The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by : the unambiguous context-free grammar G over the alphabet has start symbol S and the following rules

SM | U

M → 0M1M | ε

U → 0S | 0M1U.

To obtain an algebraic representation of the power series associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by x, each occurrence of ε by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:

S = M + U

M = M²x² + 1

U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write,, and . The equation system can be resolved after S, resulting in a single algebraic equation:

.

This quadratic equation has two solutions for S, one of which is the algebraic power series . By applying methods from complex analysis to this equation, the number

an

of words of length n generated by G can be estimated, as n grows large. In this case, one obtains

an\inO(2+\epsilon)n

but

an\notinO(2-\epsilon)n

for each

\epsilon>0

.[1]

The following example is from :\left\

Notes and References

  1. See for a detailed exposition.