In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language (on the left) and a suffix (on the right). For instance,
1+2+3
1+2
{}+3
In terms of context-free grammar, a nonterminal is left-recursive if the leftmost symbol in one of its productions is itself (in the case of direct left recursion) or can be made itself by some sequence of substitutions (in the case of indirect left recursion).
A grammar is left-recursive if and only if there exists a nonterminal symbol
A
A ⇒ +A\alpha
⇒ +
\alpha
Direct left recursion occurs when the definition can be satisfied with only one substitution. It requires a rule of the form
A\toA\alpha
\alpha
Expression\toExpression+Term
Indirect left recursion occurs when the definition of left recursion is satisfied via several substitutions. It entails a set of rules following the pattern
A0\to\beta0A1\alpha0
A1\to\beta1A2\alpha1
…
An\to\betanA0\alphan
\beta0,\beta1,\ldots,\betan
\alpha0,\alpha1,\ldots,\alphan
A0 ⇒ \beta0A1\alpha
+ | |
0 ⇒ |
A1\alpha0 ⇒ \beta1A2\alpha1\alpha
+ … ⇒ | |
0 ⇒ |
+A0\alphan...\alpha1\alpha0
A0
Left recursion is commonly used as an idiom for making operations left-associative: that an expression a+b-c-d+e
is evaluated as (((a+b)-c)-d)+e
. In this case, that evaluation order could be achieved as a matter of syntax via the three grammatical rules
Expression\toTerm
Expression\toExpression+Term
Expression\toExpression-Term
Expression
a+b-c-d+e
as consisting of the Expression
a+b-c-d
and Term
e
, where a+b-c-d
in turn consists of the Expression
a+b-c
and Term
d
, while a+b-c
consists of the Expression
a+b
and Term
c
, etc.Left recursion often poses problems for parsers, either because it leads them into infinite recursion (as in the case of most top-down parsers) or because they expect rules in a normal form that forbids it (as in the case of many bottom-up parsers). Therefore, a grammar is often preprocessed to eliminate the left recursion.
The general algorithm to remove direct left recursion follows. Several improvements to this method have been made.[2] For a left-recursive nonterminal
A
A → A
A → A\alpha1\mid\ldots\midA\alphan\mid\beta1\mid\ldots\mid\betam
\alpha
\beta
A
A
A →
\prime | |
\beta | |
1A |
\mid\ldots\mid
\prime | |
\beta | |
mA |
A'
A\prime →
\prime | |
\alpha | |
1A |
\mid\ldots\mid
\prime | |
\alpha | |
nA |
\mid\epsilon
As an example, consider the rule set
Expression → Expression+Expression\midInteger\midString
Expression → IntegerExpression'\midStringExpression'
Expression' → {}+ExpressionExpression'\mid\epsilon
The above process can be extended to eliminate all left recursion, by first converting indirect left recursion to direct left recursion on the highest numbered nonterminal in a cycle.
Inputs A grammar: a set of nonterminals
A1,\ldots,An
Output A modified grammar generating the same language but without left recursion
Ai
Ai → \alphai
\alphai
\alphai
Aj
j<i
\betai
\alphai
Aj
Ai → \alphai
Aj → \alphaj
Ai → \alphaj\betai
Ai
Aj
Ai\toAj\beta
j<i
Ai\toAj\beta
The algorithm may be viewed as establishing a topological ordering on nonterminals: afterwards there can only be a rule
Ai\toAj\beta
j>i
Although the above transformations preserve the language generated by a grammar, they may change the parse trees that witness strings' recognition. With suitable bookkeeping, tree rewriting can recover the originals, but if this step is omitted, the differences may change the semantics of a parse.
Associativity is particularly vulnerable; left-associative operators typically appear in right-associative-like arrangements under the new grammar. For example, starting with this grammar:
Expression → Expression-Term\midTerm
Term → Term*Factor\midFactor
Factor → (Expression)\midInteger
Expression → Term Expression'
Expression' → {}-Term Expression'\mid\epsilon
Term → Factor Term'
Term' → {}*Factor Term'\mid\epsilon
Factor → (Expression)\midInteger
Parsing the string "1 - 2 - 3" with the first grammar in an LALR parser (which can handle left-recursive grammars) would have resulted in the parse tree:This parse tree groups the terms on the left, giving the correct semantics (1 - 2) - 3.
Parsing with the second grammar giveswhich, properly interpreted, signifies 1 + (-2 + (-3)), also correct, but less faithful to the input and much harder to implement for some operators. Notice how terms to the right appear deeper in the tree, much as a right-recursive grammar would arrange them for 1 - (2 - 3).
A formal grammar that contains left recursion cannot be parsed by a LL(k)-parser or other naive recursive descent parser unless it is converted to a weakly equivalent right-recursive form. In contrast, left recursion is preferred for LALR parsers because it results in lower stack usage than right recursion. However, more sophisticated top-down parsers can implement general context-free grammars by use of curtailment. In 2006, Frost and Hafiz described an algorithm which accommodates ambiguous grammars with direct left-recursive production rules.[3] That algorithm was extended to a complete parsing algorithm to accommodate indirect as well as direct left recursion in polynomial time, and to generate compact polynomial-size representations of the potentially exponential number of parse trees for highly ambiguous grammars by Frost, Hafiz and Callaghan in 2007.[4] The authors then implemented the algorithm as a set of parser combinators written in the Haskell programming language.[5]