Bottom-Up Parsing

mod5pic1.gif (7693 bytes)

5.0 Introduction

Bottom-up parser generation follows the same form as that for top-down generation:

mod5fig1.gif (2179 bytes)

5.1 Metalanguage for Bottom-Up Parser Generation

The metalanguage for a bottom-up parser is not as restrictive as that for a top-down parser. Left-recursion is not a problem because the tree is built from the leaves up.

Strictly speaking, right-recursion, where the left-hand nonterminal is repeated at the end of the right-hand side, is not a problem. However, as we will see, the bottom-up parsing method described here pushes terminals and nonterminals on a stack until an appropriate right-hand side is found, and right-recursion can be somewhat inefficient in that many symbols must be pushed before a right-hand side is found.

Example 1 BNF for bottom-up parsing

        
Program         -->     Statements        
Statements      -->     Statement Statements | Statement        
Statement       -->     AsstStmt        
AsstStmt        -->     Identifier : = Expression       
 Expression      -->     Expression + Term | Expression-                                Term | Term        
Term            -->     Term * Factor | Term / Factor 
                            |Factor        
Factor          -->     ( Expression ) | Identifier 
                           | Literal

Example 2 Input to parser generator

        Program         -->     Statements        
             Statements      -->     Statement Statements | Statement 
             Statement       -->     AsstStmt ;        
             AsstStmt        -->     Identifier : = Expression-Term | Term 
             Expression      -->     Expression + Term | Expression-Term | Term        
             Term            -->     Term * Factor | Term / Factor | Factor
             Factor          -->     (Expression) | Identifier |Literal


Using a context-free grammar, one can describe the structure of a program, that is, the function of the generated parser. The parser generator converts the BNF into tables. The form of the tables depends upon whether the generated parser is a top-down parser or a bottom-up parser. Top-down parsers are easy to generate; bottom up parsers are more difficult to generate.

At compile-time, the driver reads input tokens, consults the table and creates a parse from the bottom-up.

We will describe the driver first, as usual, The method described here is a shift-reduce parsing method; that is, we parse by shifting input onto the stack until we have enough to recognize an appropriate right-hand side of production. The sequence on the stack which is ready to be reduced is called the handle.

The handle is a right-hand side of a production, taking into account the rules of the grammar. For example, an expression would not use a + b as the handle if the string were a + b * c. We will see that our method finds the correct handle.

5.2 LR-Family Parsing

The shift-reduce method to be described here is called LR-parsing. There are a number of variants (hence the use of the term LR-family), but they all use the same driver. They differ only in the generated table. The L in LR indicates that the string is parsed from left to right; the R indicates that the reverse of a right derivation is produced.

Given a grammar, we want to develop a deterministic bottom-up method for parsing legal strings described by the grammar. As in top-down parsing, we do this with a table and a driver which operates on the table.

5.2.1 LR-Family: Parser Driver

The driver reads the input and consults the table. The table has four different kinds of entries called actions:

Shift:

Reduce:

Accept:

Error:

Using these actions, the driver algorithm is :

Example 3 LR parsing

5.2.2 LR-Family: SLR(1) Table Generation

At any stage of the parse, we will have the following configuration:

        
Stack                  Input        
s0x1s1...xmsm   aiai+1...$

where the s's are states, the x's are sequences of terminals or nonterminals, and the a's are input symbols. This is somewhat like a finite-state machine where the state on top (the right here) of the stack contains the "accumulation of information" about the parse until this point. We just have to look at the top of the stack and the symbol coming in to know what to do.

We can construct such a finite-state machine from the productions in the grammar where each state is a set of Items.

We create the table using the grammar for expressions above. The reader is asked in Exercise 2 to extend this to recognize a sequence of assignment statements.

States

The LR table is created by considering different "states" of a parse. Each state consists of a description of similar parse states. These similar parse states are denoted by marked productions called items.

An item is a production with a position marker, e.g.,

        E  E · + T

which indicates the state of the parse where we have seen a string derivable from E and are looking for a string derivable from + T.

Items are grouped and each group becomes a state which represents a condition of the parse. We will state the algorithm and then show how it can be applied to the grammar of Example 3.

Example 4 Constructing items and states for the expression grammar

If the dot is interpreted as separating the part of the string that has been parsed from the part yet to be parsed, State 0 indicates the state where we "expect" to see an E (an expression). Expecting to see an E is the same as expecting to see an E + T (the sum of two things), a T (a term), or a T · F (the product of two things) since all of these are possibilities for an E (an expression).

Using Step 3, there are two items in State 0 with an E after the dot. E' · E fits the model A x · z with x and = , z = E. Thus, we build a new state, putting E' E · into it. Since E · E + T also has E after ·, we add E E · + T. Step 2 doesn't apply, and we are finished with State 1.

Interpreting the dot , ·, as above, the first item here indicates that the entire expression has been parsed. When we create the table, this item will be used to create the " Accept" entry. (In fact, looking at the table above, it can be seen that "Accept" is an entry for State 1.) Similarly, the item E E · + T indicates the state of a parse where an expression has been seen and we expect a " + T". The string might be "Id + Id" where the first Id has been read, for example, or Id * Id + Id * Id where the first Id * Id has been read.

Continuing, the following states are created.


State 2              State 3         State 4         State 5         State 6

E  T ·        T  F ·        F  (· E)   F  Id ·       E  E + · T

T  T · * F                     E  · E + T             T  · T * F

                                E  · T                 T  · F

                                T  · T * F                     F  · (E)

                                T  ·  F                        F  · Id

                                F  · (E)

                                F  · Id



State 7                  State 8       State 9           State 10        State 11

T     T  * · F F  ( E · )     E  E + T ·     T  T * F ·     F  (E) ·

F  · (E)       E  E · + T     T  T  · * F    

F  · Id

These are called LR(0) items because no lookahead was considered when creating them.

We could use these to build an LR(0) parsing table, but for this example, there will be multiply defined entries since the grammar is not LR(0) see (Exercise 10) These can also be considered SLR items and we will use them to build an SLR(1) table, using one symbol lookahead.



Example 5 Creating an SLR(1) table for the expression grammar

5.2.3 Shift-Reduce Conflicts

If the grammar is not SLR(1), then there may be more than one entry in the table. If both a "shift" action and "reduce" action occur in the same entry, and the parsing process consults that entry, then a shift-reduce conflict is said to occur . Briefly, a shift-reduce error occurs when the parser cannot decide whether to continue shifting or to reduce (using a different production rule).

Similarly, a reduce-reduce error occurs when the parser has to choose between more than one equally acceptable production.

One way to resolve such conflicts is to attempt to rewrite the grammar. Another method is to analyze the situation and decide, if possible, which action is the correct one. If neither of these steps solve the problem, then it is possible that the underlying language construct cannot be described using an SLR(1) grammar; a different method will have to be used.

5.2.4 LR-Family Members

In Section 5.2.2, LR(0) states were created: no lookahead was used to create them. We did, however, consider the next input symbol(one symbol lookahead) when creating the table (see SLR table construction algorithm). If no lookahead is used to create the table, then the parser would be called an LR(0) parser. Unfortunately, LR(0) parsers don't recognize the constructs one finds in typical programming languages. If we consider the next possible symbol for each of the items in a state, as well as for creating the table, we would have an LR(1) parser.

LR(1) tables for typical programming languages are massive. SLR(1) parsers recognize many, but not all, of the constructs in typical programming languages.

There is another type of parser which recognizes almost as many constructs as an LR(1) parser. This is called a LALR(1) parser and is constructed by first constructing the LR(1) items and states and them merging many of them. Whenever two states are the same except for the lookahead symbol, they are merged. The first LA stands for Lookahead token is added to the item.

It is important to note that the same driver is used to parse. It is the table generation that is different.

5.2.5 LR-Family: LR(1) Table Generation

An LR(1) item is an LR(0) item plus a set of lookahead characters.

indicates that we have seen an E, and are expecting a "+T", which may then be followed by the end of string (indicated by $) or by a "+" (as in a+a+a).

The algorithm is the same as for creating LR(0) items except the closure step which now needs to be modified to include the lookahead character:

Applying the closure rule to this gives us initially E · E + T, { $ } as the next item here since FIRST ( $)= {$}. Now, the closure operation must be applied to this and FIRST (+ T $ )= {+}, so the next item is E · E + T, { +, $}. The entire states 0 and 1 are:




        State 0                              State 1 

E' · E, {$} E' E · , {$} E · E + T, { +, $} E E · + T, {+, $ } E · T, { +, $} T · T * F, {$ +, * } T · F, { $, +, *} F · Id, { $, +, * } F · (E), { $, +, * }

5.2.6 LR-Family: LALR(1) Table Generation

It is often the case that two states in an LR(1) state have the exact same items except for the lookahead. We can reduce the size of the ultimate table by merging these two states. There are ten pairs of states that can be merged in the items for Two of them and their merged states are:

State i                              State j                         State i-jE  E + · T, {+, $}             E  + · T, { ), + }             E  E + · T, { ), +, $ }T  · T * F, { $, +, * }        T  · T * F, { ), +, * }        T  · T * F, { ), +, *, $ }T  · T * F, { $, +, * }        T  · T * F, { ), +, * }        T  · T * F, { ), +,*, $ }F  · Id, { $,  +, * }          F  · Id, { ), +, * }           F  · Id, { ), +, *, $ }F  · (E), { $, +, *}           F  · (E), { ), +, *}           F  · (E), { ), +, *, $ }

LALR(1) parsers parse fewer language than do LR(1) parsers.

5.3 Error Handling in LR-Family Parsing

As in LL(1) parsing, the driver discovers that something is wrong when a token which cannot be used in a reduction is pushed onto the stack. Error repair consists of adding or deleting tokens to restore the parse to a state where parsing may continue. Since this may involve skipping input, skipping to the next "fiducial" symbol (symbols that begin or end a construct, such as begin, end, semicolon, etc.) is often done.

5.3.1 Better Error Handling

It is possible to detect the error earlier than when it is pushed onto the stack. Error recovery algorithms can be more clever than those which replace symbols on the stack or in the input.

The literature describes many syntactic error handling algorithms. See the Related Reading section at the end of this chapter, especially Fischer and LeBlanc (1988) and Hammond and Rayward-Smith(1984).

5.3.2 Generator Errors

One particularly insidious error occurs when a syntax error is made in the BNF which is input to the parser generator. The message

which, although accurate, is not likely to inspire confidence in the end-user of the compiler of which the generated parser is a part.

5.4 Table Representation and Compaction

The tables for both top-down and bottom-up parsing may be quite large for typical programming languages.

Representation as a two-dimensional array, which would allow fast access, is impractical, spacewise, because the tables are sparse. Sparse array techniques or other efficient data structures are necassary.

5.5 YACC, A LALR(1) Bottom-up Parser Generator

This section describes YACC, Yet Another Compiler-Compiler, written by Steve Johnson at Bell Telephone Laboratories in 1975 for the UNIX operating system. Like LEX, versions have been ported to other machines and other operating systems, and, like the discussions of LEX, the discussion here is not intended as a user manual for YACC.

YACC produces a parser which is a C program. This C program can be compiled and linked with other compiler modules, including the scanner generated by LEX or any other scanner. Versions which have been ported to other operating systems can produce programs in other languages such as Pascal or Ada.

YACC accepts a BNF grammar as input. Each production may have an action associated with it.

5.5.1 YACC Metalanguage

The YACC metalanguage has a form similar to that for LEX:

The Definitions section can contain the typical things found at the top of a C program: constant definitions, structure declarations, include statements, and user variable definitions. Tokens may also be defined here.

The Rules section contains the BNF. The left-hand side is separated from the right-hand side by a color, ":". The actions may appear interspersed in the right-hand side; this means they will be executed when all the tokens or nonterminals to the action's left are shifted. If the action occurs after a production, it will be invoked when a reduction ( of the right-hand side to the left-hand side of a production ) is made.

Thus, a YACC input file looks like the following:


        %{

        #include 1

        #include 2

        #define ...

        %}

        %token ...

        ...

        %%

        Nonterminal: Right-Hand side {semantic action 1}

                   | Right-hand side {semantic action 2}

        ...

        %%

        C functions

Example 6 shows a YACC input file for the language consisting of sequences of assignment statements.

EXAMPLE 6 Sequences of assignment statements in YACC


        %{

        #include <stdio.h>

        %}

        %start program

        %token Id

        %token Lit

        %%



Program:        Statements              {printf("Program \n");};

Statements:     Statement Statements    {printf("Statements \n)";};

                | Statement             {printf("Statements \n");};

Statement:      AsstStmt                {printf("Statement \n");};

AsstStmt:       Id ":=" Expression      {printf("AsstStmt \n");};

Expression:     Expression "=" Term     {printf("Expression \n");};

                | Expression "-" Term   {printf("Expression \n");};

                | Term                  {printf("Expression \n");};

Term:           Term "*" Factor         {printf("Term \n");};

                | Term "/" Factor       {printf("Term \n");};

                | Factor                {printf("Term \n");};

Factor:         "(" Expression ")"      {printf("Factor \n");};

                | Id                    {printf("Factor \n");};

                | Lit                   {printf("Factor \n");};

%%

#include lex.yy.c

 

main()

{

  yyparse ();

}

In Example 6, the lexical analyzer is the one output by LEX, or users may write their own ( and call it "lex.yy.c"). Alternatively, a function called "yylex()", consisting of code to find tokens, could be written in either the definition section or the User-Written Procedures section.

YACC contain facilities for specifying precedence and associativity of operators and for specifying errors. An error production in YACC is of the form:

B error { Action }

where is an erroneous construct.

We will look at this example again in Chapter 6 when we discuss semantic analysis.


5.6 Summary

This chapter discusses parser generators, a much-researched and developed area of computer science.

The space occupied by the generated parse tables is considerable, containing thousands of entries. LL(1) tables are smaller than LALR(1), by a ratio of about two-to-one. LR(1) tables are too large to be practical.

Timewise, both LL(1) and LR-family parsers are linear for the average case (in the number of tokens processed).

It is easier to write a grammar for LR-family parsers than for LL(1) parsers since LL requires that there be no left-recursion or common prefixes.

Most language designers produce a LALR(1) grammar to describe their language.

The LR-family grammars can also handle a wider range of language constructs; in fact the language constructs generated by LL(1) grammars are a proper subset of the LR(1) constructs.

For the LR-family the language constructs recognized are:

where << means much smaller and < means smaller.

The drivers for both LL(1) and LR-family parsers are easy to write. Table generation is easier for LL(1) than it is for LR-family parser generators.

Error handling is similar for both LL(1) and LR-family parsers, with LL(1) being somewhat simpler. Error handing in parser generators is still developing, and the Related Reading section contains many references to past and recent work in this area.

Overall, for parser generation the choice is between LALR(1) and LL(1), with the decision often being made based upon the nature of a grammar. If a grammar already exists and it is LL(1), then that is probably the method of choice. If the grammar is still to be written or the prewritten grammar is not LL(1), then the method of choice is probably LALR(1).