Class: | Parsing grammars that are PEG |
Data: | String |
Time: | O(n) O(n2) |
Best-Time: |
|
Average-Time: | O(n) |
Space: | O(n) |
The Packrat parser is a type of parser that shares similarities with the recursive descent parser in its construction. However, it differs because it takes parsing expression grammars (PEGs) as input rather than LL grammars.[1]
In 1970, Alexander Birman laid the groundwork for packrat parsing by introducing the "TMG recognition scheme" (TS), and "generalized TS" (gTS). TS was based upon Robert M. McClure's TMG compiler-compiler, and gTS was based upon Dewey Val Schorre's META compiler-compiler.Birman's work was later refined by Aho and Ullman; and renamed as Top-Down Parsing Language (TDPL), and Generalized TDPL (GTDPL), respectively. These algorithms were the first of their kind to employ deterministic top-down parsing with backtracking.[2] [3]
Bryan Ford developed PEGs as an expansion of GTDPL and TS. Unlike CFGs, PEGs are unambiguous and can match well with machine-oriented languages. PEGs, similar to GTDPL and TS, can also express all LL(k) and LR(k). Bryan also introduced Packrat as a parser that uses memoization techniques on top of a simple PEG parser. This was done because PEGs have an unlimited lookahead capability resulting in a parser with exponential time performance in the worst case.
Packrat keeps track of the intermediate results for all mutually recursive parsing functions. Each parsing function is only called once at a specific input position. In some instances of packrat implementation, if there is insufficient memory, certain parsing functions may need to be called multiple times at the same input position, causing the parser to take longer than linear time.[4]
The packrat parser takes in input the same syntax as PEGs: a simple PEG is composed of terminal and nonterminal symbols, possibly interleaved with operators that compose one or several derivation rules.
\{S,E,F,D\}
\{a,b,z,e,g\}
\{\alpha,\beta,\gamma,\omega,\tau\}
Sequence \alpha\beta | Success: If \alpha \beta Failure: If \alpha \beta Consumed: \alpha \beta | |
Ordered choice \alpha/\beta/\gamma | Success: If any of \{\alpha,\beta,\gamma\} Failure: All of \{\alpha,\beta,\gamma\} Consumed: The atomic expression that has generated a successso if multiple succeed the first one is always returned | |
And predicate \&\alpha | Success: If \alpha Failure: If \alpha Consumed: No input is consumed | |
Not predicate !\alpha | Success: If \alpha Failure: If \alpha Consumed: No input is consumed | |
One or more \alpha+ | Success: Try to recognize \alpha Failure: If \alpha Consumed: The maximum number that \alpha | |
Zero or more \alpha* | Success: Try to recognize \alpha Failure: Cannot fail Consumed: The maximum number that \alpha | |
Zero or one \alpha? | Success: Try to recognize \alpha Failure: Cannot fail Consumed: \alpha | |
Terminal range[<math>a-b</math>] | Success: Recognize any terminal c [a-b] [bf{'}hbf{'}-bf{'}zbf{'}] c Failure: If no terminal inside of [a-b] Consumed: c | |
Any character . | Success: Recognize any character in the inputFailure: If no character in the input Consumed: any character in the input |
A derivation rule is composed by a nonterminal symbol and an expression
S → \alpha
A special expression
\alphas
\alphas
An input string is considered accepted by the parser if the
\alphas
x
An extreme case of this rule is that the grammar
S → x*
This can be avoided by rewriting the grammar as
S → x*!.
\begin{cases} S → A/B/D\\ A → bf{'a'} S bf{'a'}\\ B → bf{'b'} S bf{'b'}\\ D → (bf{'0'}-bf{'9'})? \end{cases}
This grammar recognizes a palindrome over the alphabet
\{a,b\}
Example strings accepted by the grammar include:
bf{'aa'}
bf{'aba3aba'}
Left recursion happens when a grammar production refers to itself as its left-most element, either directly or indirectly. Since Packrat is a recursive descent parser, it cannot handle left recursion directly.[5] During the early stages of development, it was found that a production that is left-recursive can be transformed into a right-recursive production.[6] This modification significantly simplifies the task of a Packrat parser. Nonetheless, if there is an indirect left recursion involved, the process of rewriting can be quite complex and challenging. If the time complexity requirements are loosened from linear to superlinear, it is possible to modify the memoization table of a Packrat parser to permit left recursion, without altering the input grammar.
The iterative combinator
\alpha+
\alpha*
S → \alpha+ | S → \alphaS/\alpha | |
S → \alpha* | S → \alphaS/\epsilon |
Memoization is an optimization technique in computing that aims to speed up programs by storing the results of expensive function calls. This technique essentially works by caching the results so that when the same inputs occur again, the cached result is simply returned, thus avoiding the time-consuming process of re-computing.[7] When using packrat parsing and memoization, it's noteworthy that the parsing function for each nonterminal is solely based on the input string. It does not depend on any information gathered during the parsing process. Essentially, memoization table entries do not affect or rely on the parser's specific state at any given time.[8]
Packrat parsing stores results in a matrix or similar data structure that allows for quick look-ups and insertions. When a production is encountered, the matrix is checked to see if it has already occurred. If it has, the result is retrieved from the matrix. If not, the production is evaluated, the result is inserted into the matrix, and then returned.[9] When evaluating the entire
m*n
\Theta(mn)
m
n
In a naïve implementation, the entire table can be derived from the input string starting from the end of the string.
The Packrat parser can be improved to update only the necessary cells in the matrix through a depth-first visit of each subexpression tree. Consequently, using a matrix with dimensions of
m*n
Another operator called cut has been introduced to Packrat to reduce its average space complexity even further. This operator utilizes the formal structures of many programming languages to eliminate impossible derivations. For instance, control statements parsing in a standard programming language is mutually exclusive from the first recognized token, e.g.,
\{if,do,while,switch\}
Sketch of an implementation of a Packrat algorithm in a Lua-like pseudocode.
INPUT(n) -- return the character at position n
RULE(R : Rule, P : Position) entry = GET_MEMO(R,P) -- return the number of elements previously matched in rule R at position P if entry
EVAL(R : Rule, P : Position) start = P; for choice in R.choices -- Return a list of choice acc=0; for symbol in choice then -- Return each element of a rule, terminal and nonterminal if symbol.is_terminal then if INPUT(start+acc)
fail then break; end acc = acc + res; end if symbol
Given the following context, a free grammar that recognizes simple arithmetic expressions composed of single digits interleaved by sum, multiplication, and parenthesis.
\begin{cases} S → A\\ A → M bf{'+'} A / M\\ M → P bf{'*'} M / P\\ P → bf{'('} A bf{')'} / D\\ D → (bf{'0'}-bf{'9'}) \end{cases}
Denoted with
\dashv
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
D(1) = 1; P(1) = 1; | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
D(4) = 1; P(4) = 1; | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Update M(4) = 1 as M was recognized | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
D(6) = 1; P(6) = 1; | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Update M(6) = 1 as M was recognized | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Update A(4) = 3 as A was recognized Update P(3)=5 as P was recognized | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Update M(1)=7 as M was recognized | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Update A(1)=7 as A was recognized Update S(1)=7 as S was recognized |
Name | Parsing algorithm | Grammar, code | Development platform | License | ||
---|---|---|---|---|---|---|
AustenX | Packrat (modified) | , BSD | ||||
Aurochs | Packrat | , GNU GPL | ||||
Canopy | Packrat | , GNU GPL | ||||
CL-peg | Packrat | , MIT | ||||
Drat! | Packrat | , GNU GPL | ||||
Frisby | Packrat | , BSD | ||||
Packrat | , BSD | |||||
IronMeta | Packrat | , BSD | ||||
PEGParser | Packrat (supporting left-recursion and grammar ambiguity) | Identical | , BSD | |||
Narwhal | Packrat | , BSD | ||||
neotoma | Packrat | , MIT | ||||
Packrat (modified, partial memoization) | , MIT | |||||
PackCC | Packrat (modified, left-recursion support) | , MIT | ||||
Packrat | Packrat | , MIT | ||||
Packrat | , BSD | |||||
Parsnip | Packrat | , GNU GPL | ||||
PEG.js | Packrat (partial memoization) | , MIT | ||||
Peggy[11] | Packrat (partial memoization) | , MIT | ||||
Pegasus | Recursive descent, Packrat (selectively) | , MIT | ||||
PetitParser | Packrat | , MIT | ||||
PyPy rlib | Packrat | , MIT | ||||
Rats! | Packrat | , GNU LGPL | ||||
go-packrat | Packrat | Go | Identical | All | GPLv3 |