Bottom-up parser generation follows the same form as that for top-down generation:
The metalanguage for a bottom-up parser is not as restrictive as that for a top-down parser. Left-recursion is not a problem because the tree is built from the leaves up.
Strictly speaking, right-recursion, where the left-hand nonterminal is repeated at the end of the right-hand side, is not a problem. However, as we will see, the bottom-up parsing method described here pushes terminals and nonterminals on a stack until an appropriate right-hand side is found, and right-recursion can be somewhat inefficient in that many symbols must be pushed before a right-hand side is found.
Example 1 BNF for bottom-up parsing
Program --> Statements Statements --> Statement Statements | Statement Statement --> AsstStmt AsstStmt --> Identifier : = Expression Expression --> Expression + Term | Expression- Term | Term Term --> Term * Factor | Term / Factor | Factor Factor --> ( Expression ) | Identifier | Literal
The BNF shown in Example 1 states that a program consists of a sequence of statements, each of which is an assignment statement with a right-hand side consisting of an arithmetic expression. This BNF is input to the parser generator to produce tables which are then accessed by the driver as the input is read.
Example 2 Input to parser generator
Program --> Statements Statements --> Statement Statements | Statement Statement --> AsstStmt ; AsstStmt --> Identifier : = Expression-Term | Term Expression --> Expression + Term | Expression- Term | Term Term --> Term * Factor | Term / Factor | Factor Factor --> (Expression) | Identifier | Literal
Using a context-free grammar, one can describe the structure of a program, that is, the function of the generated parser. The parser generator converts the BNF into tables. The form of the tables depends upon whether the generated parser is a top-down parser or a bottom-up parser. Top-down parsers are easy to generate; bottom up parsers are more difficult to generate.
At compile-time, the driver reads input tokens, consults the table and creates a parse from the bottom-up.
We will describe the driver first, as usual, The method described here is a shift-reduce parsing method; that is, we parse by shifting input onto the stack until we have enough to recognize an appropriate right-hand side of production. The sequence on the stack which is ready to be reduced is called the handle.
The handle is a right-hand side of a production, taking into account the rules of the grammar. For example, an expression would not use a + b as the handle if the string were a + b * c. We will see that our method finds the correct handle.
The shift-reduce method to be described here is called LR-parsing. There are a number of variants (hence the use of the term LR-family), but they all use the same driver. They differ only in the generated table. The L in LR indicates that the string is parsed from left to right; the R indicates that the reverse of a right derivation is produced.
Given a grammar, we want to develop a deterministic bottom-up method for parsing legal strings described by the grammar. As in top-down parsing, we do this with a table and a driver which operates on the table.
The driver reads the input and consults the table. The table has four different kinds of entries called actions:
Shift:
Shift is indicated by the "S#" entries in the table where # is a new state. When we come to this entry in the table, we shift the current input symbol followed by the indicated new state onto the stack.
Reduce:
Reduce is indicated by "R#" where # is the number of a production. The top of the stack contains the right-hand side of a production, the handle. Reduce by the indicated production, consult the GOTO part of the table to see the next state, and push the left-hand side of the production onto the stack followed by the new state.
Accept:
Accept is indicated by the "Accept" entry in the table. When we come to this entry in the table, we accept the input string. Parsing is complete.
Error:
The blank entries in the table indicate a syntax error. No action is defined.
Using these actions, the driver algorithm is :
Algorithm
Initialize Stack to state 0
Append $ to end of input
While Action
<> Accept And Action <> Error Do
Let Stack = s0x1s1...xmsm and remaining Input=aiai+1...$
{S's are state numbers; x's are sequences of terminals and nonterminals}
Case Table [sm,ai] is
S#: Action : Shift
R#: Action : = Reduce
Accept : Action : = Accept
Blank : Action : = Error
EndWhile
Example 3 LR parsing
Consider the following grammar, a subset of the assignment statement grammar:
and consider the table to be built by magic for the moment:
We will use grammar and table to parse the input string, a * ( b + c ), and&127; to understand the meaning of the entries in the table:
Step (1)
Parsing begins with state 0 on the stack and the input terminated by "$":
Stack Input Action0 a * (b + c) $
Consulting the table, across from state 0 and under input Id, is the action S5 which means to Shift (push) the input onto the stack and go to state 5.
Step (2)
Stack Input Action(1) 0 a * (b + c) $ S5(2) 0 id 5 * (b + c) $
The next step in the parse consults Table [5, *]. The entry across from state 5 and under input *, is the action R6 which means the right-hand side of production 6 is the handle to be reduced. We remove everything on the stack that includes the handle. Here, this is id 5. The stack now contains only 0. Since the left-hand side of production 6 will be pushed on the stack, consult the GOTO part of the table across from state 0 (the exposed top state) and under F (the left-hand side of the production). The entry there is 3. Thus, we push F 3 onto the stack.
Step (3)
(2) 0 id 5 * (b + c) $ R6(3) 0 F 3 * (b + c) $
Now the top of the stack is state 3 and the current input is *. Consulting the entry at Table [3, * ], the action indicated is R4, reduced using production 4. Thus, the right-hand side of production 4 is the handle on the stack. The algorithm says to pop the stack up to and including the F. That exposes state 0. Across from 0 and under the right-hand side of production 4 (the T) is state 2. We shift the T onto the stack followed by state 2.
Continuing,
(3) 0 T 3 * (b + c) $ R4(4) 0 T 2 * (b + c) $ S7 0 T2 * 7 (b + c) $ S4 0 T2 * 7 (4 (b + c) $ S5 O T2 * 7 (4id 5 + c) $ R6 0 T2 * 7 (4F3 + c) $ R4 0 T2 * 7 (4T2 + c) $ R2 0 T2 * 7 (4E8 + c) $ S6 O T2 * 7 (4E8 + 6 c $ S5 O T2 * 7 (4E8 + 6 id 5 ) $ R6 O T2 * 7 (4E8 + 6F 3 ) $ R4 O T2 * 7 (eE8 + 6T9 ) $ R1 O T2..* 7 (4E8) ) $ S11 O T2 * 7 (4E8 11 $ R5 O T2 * 7F10 $ R3 O T2 $ R2(19) O E1 $ Accept
Step (19)
The parse is in state 1 looking at "$". The table indicates that this is the accept state. Parsing has thus completed successfully. By following the reduce actions in reverse, starting with R2, the last reduce action, and continuing until R6 the first reduce action, a parse tree can be created. Exercise 1 asks the reader to draw this parse tree.
At any stage of the parse, we will have the following configuration:
Stack Input s0x1s1...xmsm aiai+1...$
where the s's are states, the x's are sequences of terminals or nonterminals, and the a's are input symbols. This is somewhat like a finite-state machine where the state on top (the right here) of the stack contains the "accumulation of information" about the parse until this point. We just have to look at the top of the stack and the symbol coming in to know what to do.
We can construct such a finite-state machine from the productions in the grammar where each state is a set of Items.
We create the table using the grammar for expressions above. The reader is asked in Exercise 2 to extend this to recognize a sequence of assignment statements.
States
The LR table is created by considering different "states" of a parse. Each state consists of a description of similar parse states. These similar parse states are denoted by marked productions called items.
An item is a production with a position marker, e.g.,
E E · + T
which indicates the state of the parse where we have seen a string derivable from E and are looking for a string derivable from + T.
Items are grouped and each group becomes a state which represents a condition of the parse. We will state the algorithm and then show how it can be applied to the grammar of Example 3.
Algorithm
Constructing States via Item groups
(0) Create a new nonterminal S' and a new production S' S where S is the Start symbol
(1) IF S is the Start symbol, put S' · S into a Start State called State 0.
(2) Closure: IF A x · X is in the state, THEN add X · to the state for every product X · in the grammar.
(3) Look for an item of form A x · z where z is a single terminal or nonterminal and build a new state from A --> xz · ( Include in the new states all items with · z in the original state.)
(4) Continue until no new states can be created. (A state is new if it is not identical to an old state.)
Example 4 Constructing items and states for the expression grammar
Step 0
Create E' EStep 1
State 0
E' · E
Step 2
E · E fits the model A x · X , with x, = , and X = E. E T and E E + T are productions whose left-hand sides are E; thus E · T and E · E + T are added to State 0.
State 0
E' · E
E · E + T
E · T
Reapplying Step 2 to E · T adds
T · T * F
T · F
and reapplying Step 2 To T · F adds
F · (E)
F · Id
State 0 is thus:
State 0
E" · E
E · E + T
E · T
T · T * F
F · (E)
F · Id
If the dot is interpreted as separating the part of the string that has been parsed from the part yet to be parsed, State 0 indicates the state where we "expect" to see an E (an expression). Expecting to see an E is the same as expecting to see an E + T (the sum of two things), a T (a term), or a T · F (the product of two things) since all of these are possibilities for an E (an expression).
Using Step 3, there are two items in State 0 with an E after the dot. E' · E fits the model A x · z with x and = , z = E. Thus, we build a new state, putting E' E · into it. Since E · E + T also has E after ·, we add E E · + T. Step 2 doesn't apply, and we are finished with State 1.
State 1
E' E ·
E E · + T
Interpreting the dot , ·, as above, the first item here indicates that the entire expression has been parsed. When we create the table, this item will be used to create the " Accept" entry. (In fact, looking at the table above, it can be seen that "Accept" is an entry for State 1.) Similarly, the item E E · + T indicates the state of a parse where an expression has been seen and we expect a " + T". The string might be "Id + Id" where the first Id has been read, for example, or Id * Id + Id * Id where the first Id * Id has been read.
Continuing, the following states are created.
State 2 State 3 State 4 State 5 State 6ET · T F · F (· E) F Id · E E + · TT T · * F E · E + T T · T * F E · T T · F T · T * F F · (E) T · F F · Id F · (E) F · IdState 7 State 8 State 9 State 10 State 11T T * · F F ( E · ) E E + T · T T * F · F (E) ·F · (E) E E · + T T T · * F F · Id
These are called LR(0) items because no lookahead was considered when creating them.
We could use these to build an LR(0) parsing table, but for this example, there will be multiply defined entries since the grammar is not LR(0) see (Exercise 10) These can also be considered SLR items and we will use them to build an SLR(1) table, using one symbol lookahead.
Algorithm
Construction of an SLR(1) Parsing Table
(1) IF A x · a is in state m for input symbol a, AND A x a · is in state n, THEN enter Sn at Table [m,a].
(2) IF A · is in state n, THEN enter Ri at Table [n,a] WHERE i is the production i : A and a is in FOLLOW (A).
(3) IF S' S · is in State n, THEN enter "Accept" at Table [n, $].
(4) IF A x · B is in State m, AND A x B · is in State n, THEN enter n at Table [m, B].
Example 5 Creating an SLR(1) table for the expression grammar
Following are some of the steps for creating the SLR(1) table shown in Example 4. One example is shown for each of the steps in the algorithm.
Step 1
E E · + T is in State 1EE + · T is in State 6
so Table[1,+] =S6
Step 2
In State 3 we have T F ·
The terminals that can follow T in any sentential form are +, *, ), and $ (see Chapter 4)
So Table[3,+] = Table[3,*] = Table[3,)] = Table[3,$] = R4 where 4 is the number of production T F.
Step 3
E' is the Start symbol here and E' E · is in State 1, so Table[1,$] ="Accept"Step 4
E · E + T is in State 0, while E E · + T is in State 1, so Table[0,E]=1
If the grammar is not SLR(1), then there may be more than one entry in the table. If both a "shift" action and "reduce" action occur in the same entry, and the parsing process consults that entry, then a shift-reduce conflict is said to occur (see Exercise 7). Briefly, a shift-reduce error occurs when the parser cannot decide whether to continue shifting or to reduce (using a different production rule).
Similarly, a reduce-reduce error occurs when the parser has to choose between more than one equally acceptable production.
One way to resolve such conflicts is to attempt to rewrite the grammar. Another method is to analyze the situation and decide, if possible, which action is the correct one. If neither of these steps solve the problem, then it is possible that the underlying language construct cannot be described using an SLR(1) grammar; a different method will have to be used.
In Section 5.2.2, LR(0) states were created: no lookahead was used to create them. We did, however, consider the next input symbol(one symbol lookahead) when creating the table (see SLR table construction algorithm). If no lookahead is used to create the table, then the parser would be called an LR(0) parser. Unfortunately, LR(0) parsers don't recognize the constructs one finds in typical programming languages. If we consider the next possible symbol for each of the items in a state, as well as for creating the table, we would have an LR(1) parser.
LR(1) tables for typical programming languages are massive. SLR(1) parsers recognize many, but not all, of the constructs in typical programming languages.
There is another type of parser which recognizes almost as many constructs as an LR(1) parser. This is called a LALR(1) parser and is constructed by first constructing the LR(1) items and states and them merging many of them. Whenever two states are the same except for the lookahead symbol, they are merged. The first LA stands for Lookahead token is added to the item.
It is important to note that the same driver is used to parse. It is the table generation that is different.
An LR(1) item is and LR(0) item plus a set of lookahead characters.
E E · + T, { $,+ }
indicates that we have seen an E, and are expecting a "+T", which may then be followed by the end of string (indicated by $) or by a "+" (as in a+a+a).
The algorithm is the same as for creating LR(0) items except the closure step which now needs to be modified to include the lookahead character:
Closure:
IF A x · X y, L is in the state, where L is the set of lookaheads, THEN add X · z, FIRST (y/) for each / in L to the state for every X z.
We build the first two states here and leave the remaining (21) to the reader:
State 0: E' * E, { $ } indicates that the string is followed by $.
Applying the closure rule to this gives us initially E · E + T, { $ } as the next item here since FIRST ( $)= {$}. Now, the closure operation must be applied to this and FIRST (+ T $ )= {+}, so the next item is E · E + T, { +, $}. The entire states 0 and 1 are:
State 0 State 1E' · E, {$} E' E · , {$} E · E + T, { +, $} E E · + T, {+, $ } E · T, { +, $} T · T * F, {$ +, * } T · F, { $, +, *} F · Id, { $, +, * } F · (E), { $, +, * }
LALR(1) parsers parse fewer languages than do LR(1) parsers.
It is often the case that two states in an LR(1) state have the exact same items except for the lookahead. We can reduce the size of the ultimate table by merging these two states. There are ten pairs of states that can be merged in the items for Section 5.2.5 and Exercise 11. Two of them and their merged states are:
State i State j State i-jEE + · T, {+, $} E + · T, { ), + } E E + · T, { ), +, $ }T · T * F, { $, +, * } T · T * F, { ), +, * } T · T * F, { ), +, *, $ }T · T * F, { $, +, * } T · T * F, { ), +, * } T · T * F, { ), +,*, $ }F · Id, { $, +, * } F · Id, { ), +, * } F · Id, { ), +, *, $ }F · (E), { $, +, *} F · (E), { ), +, *} F · (E), { ), +, *, $ }
LALR(1) parsers parse fewer language than do LR(1) parsers.
As in LL(1) parsing, the driver discovers that something is wrong when a token which cannot be used in a reduction is pushed onto the stack. Error repair consists of adding or deleting tokens to restore the parse to a state where parsing may continue. Since this may involve skipping input, skipping to the next "fiducial" symbol (symbols that begin or end a construct, such as begin, end, semicolon, etc.) is often done.
It is possible to detect the error earlier than when it is pushed onto the stack. Error recovery algorithms can be more clever than those which replace symbols on the stack or in the input.
The literature describes many syntactic error handling algorithms. See the Related Reading section at the end of this chapter, especially Fischer and LeBlanc (1988) and Hammond and Rayward-Smith(1984).
The tables for both top-down and bottom-up parsing may be quite large for typical programming languages.
Representation as a two-dimensional array, which would allow fast access, is impractical, spacewise, because the tables are sparse. Sparse array techniques or other efficient data structures are necassary.
This section describes YACC, Yet Another Compiler-Compiler, written by Steve Johnson at Bell Telephone Laboratories in 1975 for the UNIX operating system. Like LEX, versions have been ported to other machines and other operating systems, and, like the discussions of LEX, the discussion here is not intended as a user manual for YACC.
YACC produces a parser which is a C program. This C program can be compiled and linked with other compiler modules, including the scanner generated by LEX or any other scanner. Versions which have been ported to other operating systems can produce programs in other languages such as Pascal or Ada.
YACC accepts a BNF grammar as input. Each production may have an action associated with it.
The YACC metalanguage has a form similar to that for LEX:
Definitions <-- C declaration and token definitions>
%%
Rules <-- BNF plus associated actions>
%%
User-Written Procedures <-- In C>
The Definitions section can contain the typical things found at the top of a C program: constant definitions, structure declarations, include statements, and user variable definitions. Tokens may also be defined here.
The Rules section contains the BNF. The left-hand side is separated from the right-hand side by a color, ":". The actions may appear interspersed in the right-hand side; this means they will be executed when all the tokens or nonterminals to the action's left are shifted. If the action occurs after a production, it will be invoked when a reduction ( of the right-hand side to the left-hand side of a production ) is made.
Thus, a YACC input file looks like the following:
%{ #include 1 #include 2 #define ... %} %token ... ... %% Nonterminal: Right-Hand side {semantic action 1} | Right-hand side {semantic action 2} ... %% C functions
Example 6 shows a YACC input file for the language consisting of sequences of assignment statements.
EXAMPLE 6 Sequences of assignment statements in YACC
%{ #include <stdio.h> %} %start program %token Id %token Lit %%Program: Statements {printf("Program \n");};Statements: Statement Statements {printf("Statements \n)";}; | Statement {printf("Statements \n");};Statement: AsstStmt {printf("Statement \n");};AsstStmt: Id ":=" Expression {printf("AsstStmt \n");};Expression: Expression "=" Term {printf("Expression \n");}; | Expression "-" Term {printf("Expression \n");}; | Term {printf("Expression \n");};Term: Term "*" Factor {printf("Term \n");}; | Term "/" Factor {printf("Term \n");}; | Factor {printf("Term \n");};Factor: "(" Expression ")" {printf("Factor \n");}; | Id {printf("Factor \n");}; | Lit {printf("Factor \n");};%%#include lex.yy.c main(){ yyparse ();}
In Example 6, the lexical analyzer is the one output by LEX, or users may write their own ( and call it "lex.yy.c"). Alternatively, a function called "yylex()", consisting of code to find tokens, could be written in either the definition section or the User-Written Procedures section.
YACC contain facilities for specifying precedence and associativity of operators and for specifying errors. An error production in YACC is of the form:
B error { Action }
where is an erroneous construct.
We will look at this example again in Chapter 6 when we discuss semantic analysis.
The C code generated by YACC can be altered before compiling and executing. The generated parser "rules the show" in that it calls the function yylex() when it needs a token. Figure 3 shows a similar picture to that of LEX in Figure 3 of Chapter 3.
The metalanguage input is in a file called translate.y; the output of YACC is in a file called y.tab.c.
the input to the executable a.out program is again the source program because the a.out file contains the included scanner from LEX or the user-written scanner, yylex().
YACC produces a LALR(1) parser. The generated parser produces no output when presented with a correct source program unless the user-written actions contain ouput statements.
This chapter discusses parser generators, a much-researched and developed area of computer science.
The space occupied by the generated parse tables is considerable, containing thousands of entries. LL(1) tables are smaller than LALR(1), by a ratio of about two-to-one. LR(1) tables are too large to be practical.
Timewise, both LL(1) and LR-family parsers are linear for the average case (in the number of tokens processed).
It is easier to write a grammar for LR-family parsers than for LL(1) parsers since LL requires that there be no left-recursion or common prefixes.
Most language designers produce a LALR(1) grammar to describe their language.
The LR-family grammars can also handle a wider range of language constructs; in fact the language constructs generated by LL(1) grammars are a proper subset of the LR(1) constructs.
For the LR-family the language constructs recognized are:
LR (0) << SLR(1) < LALR(1) < LR(1)
LL(1) is almost a subset of LALR(1)
where << means much smaller and < means smaller.
The drivers for both LL(1) and LR-family parsers are easy to write. Table generation is easier for LL(1) than it is for LR-family parser generators.
Error handling is similar for both LL(1) and LR-family parsers, with LL(1) being somewhat simpler. Error handing in parser generators is still developing, and the Related Reading section contains many references to past and recent work in this area.
Overall, for parser generation the choice is between LALR(1) and LL(1), with the decision often being made based upon the nature of a grammar. If a grammar already exists and it is LL(1), then that is probably the method of choice. If the grammar is still to be written or the prewritten grammar is not LL(1), then the method of choice is probably LALR(1).