Pumping lemma for regular languages explained

In the theory of formal languages, the pumping lemma for regular languages is a lemma that describes an essential property of all regular languages. Informally, it says that all sufficiently long strings in a regular language may be pumped—that is, have a middle section of the string repeated an arbitrary number of times—to produce a new string that is also part of the language. The pumping lemma is useful for proving that a specific language is not a regular language, by showing that that the language does not have the property.

Specifically, the pumping lemma says that for any regular language

, there exists a constant

such that any string

with length at least

can be split into three substrings

and

(

w=xyz

, with

being non-empty), such that the strings

xz,xyz,xyyz,xyyyz,...

are also in

. The process of repeating

zero or more times is known as "pumping". Moreover, the pumping lemma guarantees that the length of

will be at most

, thus giving a "small" substring

that has the desired property.

Languages with a finite number of strings vacuously satisfy the pumping lemma by having

equal to the maximum string length in

plus one. By doing so, zero strings in

have length greater than

The pumping lemma was first proven by Michael Rabin and Dana Scott in 1959,^[1] and rediscovered shortly after by Yehoshua Bar-Hillel, Micha A. Perles, and Eli Shamir in 1961, as a simplification of their pumping lemma for context-free languages.^[2]

Formal statement

Let

be a regular language. Then there exists an integer

p\geq1

depending only on

such that every string

of length at least

(

is called the "pumping length")^[3] can be written as

w=xyz

(i.e.,

can be divided into three substrings), satisfying the following conditions:

|y|\geq1

|xy|\leqp

(\foralln\geq0)(xy^nz\inL)

is the substring that can be pumped (removed or repeated any number of times, and the resulting string is always in

). (1) means the loop

to be pumped must be of length at least one, that is, not an empty string; (2) means the loop must occur within the first

characters.

|x|

must be smaller than

(conclusion of (1) and (2)), but apart from that, there is no restriction on

and

In simple words, for any regular language

, any sufficiently long string

(in

) can be split into 3 parts.i.e.

w=xyz

, such that all the strings

xy^nz

for

n\geq0

are also in

Below is a formal expression of the pumping lemma.

\begin{array}{l} \forallL\subseteq\Sigma^*,regular(L)\implies\\ \existsp\geq1,\forallw\inL,|w|\geqp\implies\\ \existsx,y,z\in\Sigma^*,(w=xyz)\land(|y|\geq1)\land(|xy|\leqp)\land(\foralln\geq0,xy^nz\inL) \end{array}

Use of the lemma to prove non-regularity

The pumping lemma is often used to prove that a particular language is non-regular: a proof by contradiction may consist of exhibiting a string (of the required length) in the language that lacks the property outlined in the pumping lemma.

Example: The language

L=\{aⁿbⁿ:n\geq0\}

over the alphabet
\Sigma=\{a,b\}

can be shown to be non-regular as follows:
Let

w,x,y,z

, and
n

be as used in the formal statement for the pumping lemma above.
Assume that some constant

p

exists as required by the lemma.
Let

w

in
L

be given by
w=a^pb^p

, which is a string longer than
p

.
By the pumping lemma, there must exist a decomposition

w=xyz

with
|xy|\leqp

and
|y|\geq1

such that
xy^iz

in
L

for every
i\geq0

.
Since

|xy|\leqp

, the string
y

only consists of instances of
a

.
Because

|y|\geq1

, it contains at least one instance of the letter
a

.
Pumping

y

to give
xy^2z

gives a word with more instances of the letter
a

than the letter
b

, since some instances of
a

but none of
b

were added.
Therefore,

xy^2z

is not in
L

which contradicts the pumping lemma.
Therefore,

L

cannot be regular.

The proof that the language of balanced (i.e., properly nested) parentheses is not regular follows the same idea. Given

, there is a string of balanced parentheses that begins with more than

left parentheses, so that

will consist entirely of left parentheses. By repeating

, a string can be produced that does not contain the same number of left and right parentheses, and so they cannot be balanced.

Proof of the pumping lemma

For every regular language there is a finite state automaton (FSA) that accepts the language. The number of states in such an FSA are counted and that count is used as the pumping length

. For a string of length at least

, let

q₀

be the start state and let

q_1,...,q_p

be the sequence of the next

states visited as the string is emitted. Because the FSA has only

states, within this sequence of

p+1

visited states there must be at least one state that is repeated. Write

q_s

for such a state. The transitions that take the machine from the first encounter of state

q_s

to the second encounter of state

q_s

match some string. This string is called

in the lemma, and since the machine will match a string without the

portion, or with the string

repeated any number of times, the conditions of the lemma are satisfied.

For example, the following image shows an FSA.

The FSA accepts the string: abcd. Since this string has a length at least as large as the number of states, which is four (so the total number of states that the machine passes through to scan abcd would be 5), the pigeonhole principle indicates that there must be at least one repeated state among the start state and the next four visited states. In this example, only

q₁

is a repeated state. Since the substring bc takes the machine through transitions that start at state

q₁

and end at state

q₁

, that portion could be repeated and the FSA would still accept, giving the string . Alternatively, the bc portion could be removed and the FSA would still accept giving the string ad. In terms of the pumping lemma, the string abcd is broken into an

portion a, a

portion bc and a

portion d.

As a side remark, the problem of checking whether a given string can be accepted by a given nondeterministic finite automaton without visiting any state repeatedly, is NP hard.

General version of pumping lemma for regular languages

If a language

is regular, then there exists a number

p\geq1

(the pumping length) such that every string

uwv

with

|w|\gep

can be written in the form

uwv=uxyzv

with strings

and

such that

|xy|\lep

|y|\ge1

and

uxy^izv

is in

for every integer

i\geq0

.^[4] From this, the above standard version follows a special case, with both

and

being the empty string.

Since the general version imposes stricter requirements on the language, it can be used to prove the non-regularity of many more languages.

Invalidity of the lemma converse

While the pumping lemma states that all regular languages satisfy the conditions described above, the converse of this statement is not true: a language that satisfies these conditions may still be non-regular. In other words, both the original and the general version of the pumping lemma give a necessary but not sufficient condition for a language to be regular.

For example, consider the following language:

\begin{matrix}L&=&\{uvwxy:u,y\in\{0,1,2,3\}^*;v,w,x\in\{0,1,2,3\}\land(v=w\lorv=x\lorx=w)\}\ &&\cup \{w:w\in\{0,1,2,3\}^*\landprecisely\tfrac17ofthecharactersinware3's\}\end{matrix}

.In other words,

contains all strings over the alphabet

\{0,1,2,3\}

with a substring of length 3 including a duplicate character, as well as all strings over this alphabet where precisely 1/7 of the string's characters are 3's. This language is not regular but can still be "pumped" with

p=5

. Suppose some string s has length at least 5. Then, since the alphabet has only four characters, at least two of the first five characters in the string must be duplicates. They are separated by at most three characters.

If the duplicate characters are separated by 0 characters, or 1, pump one of the other two characters in the string, which will not affect the substring containing the duplicates.
If the duplicate characters are separated by 2 or 3 characters, pump 2 of the characters separating them. Pumping either down or up results in the creation of a substring of size 3 that contains 2 duplicate characters.
The second condition of

ensures that

is not regular: Consider the string

(013)^3m(012)ⁱ

. This string is in

exactly when

i=4m

and thus

is not regular by the Myhill–Nerode theorem.

The Myhill–Nerode theorem provides a test that exactly characterizes regular languages. The typical method for proving that a language is regular is to construct either a finite state machine or a regular expression for the language.

Notes

Rabin . Michael . Michael O. Rabin. Scott . Dana . Dana Scott . Apr 1959 . Finite Automata and Their Decision Problems . IBM Journal of Research and Development . 3 . 2 . 114–125 . 10.1147/rd.32.0114 . unfit . https://web.archive.org/web/20101214122150/http://www.cse.chalmers.se/~coquand/AUTOMATA/rs.pdf . December 14, 2010 . Here: Lemma 8, p.119
Book: John E. Hopcroft . Rajeev Motwani . Jeffrey D. Ullman . Introduction to Automata Theory, Languages, and Computation . 2003 . Addison Wesley. Here: Sect.4.6, p.166
Book: Berstel . Jean . Lauve . Aaron . Reutenauer . Christophe . Saliola . Franco V. . Combinatorics on words. Christoffel words and repetitions in words . CRM Monograph Series . 27 . Providence, RI . . 2009 . 978-0-8218-4480-9 . 1161.68043 . 86.
Book: Savitch . Walter . 1982 . Abstract Machines and Grammars . 978-0-316-77161-0 . 49 . Little, Brown .

References

Book: Lawson, Mark V. . Finite automata . Chapman and Hall/CRC . 2004 . 978-1-58488-255-8 . 1086.68074 .
Book: Sipser, Michael . Michael Sipser

. Michael Sipser . 1997 . Introduction to the Theory of Computation . PWS Publishing . 978-0-534-94728-6 . 1169.68300 . 1.4: Nonregular Languages . 77–83 . Introduction to the Theory of Computation .

Book: John E. . Hopcroft . John Hopcroft . Jeffrey D. . Ullman . Jeffrey Ullman . Introduction to Automata Theory, Languages, and Computation . Addison-Wesley Publishing . Reading, Massachusetts . 1979 . 978-0-201-02988-8 . 0426.68001 . Introduction to Automata Theory, Languages, and Computation . (See chapter 3.)
Book: Bakhadyr Khoussainov. Anil Nerode. Anil Nerode. Automata Theory and its Applications. 6 December 2012. Springer Science & Business Media. 978-1-4612-0171-7.

Pumping lemma for regular languages explained

Formal statement

Use of the lemma to prove non-regularity

Proof of the pumping lemma

General version of pumping lemma for regular languages

Invalidity of the lemma converse

See also

Notes

References