mod3pic1.gif (8121 bytes)

3.0 Introduction

A scanner generator, also called a lexical analyzer generator, follows the form shown in Module 1:

In Figure 1, the scanner is shown in two parts - the table or program stub which is generated and the driver which is written, not generated.

3.1 Metalanguage

The metalanguage for a scanner generator is a set of regular expressions describing the tokens in the language. A program stub or finite state table is generated from the regular expressions. At compile-time, the driver reads input characters from the source program, consults the table based upon the input and the current state, and moves to a new state based upon the entry, perhaps prforming an action such as entering an identifier into a name table.

We will illustrate these concepts with a sample language consisting of assignment statements whose right-hand sides are arithmetic expressions.

Example 1

We see that Identifiers are defined to consist of a sequence of one or more letters and digits beginning with a letter and that a literal consists of a sequence of digits. The allowable operators are those for addition, subtraction, multiplication, division and assignment.

These regular expressions are input to the generator to produce state tables either in table form or as a case statement program stub.

These regular expressions are input to the generator to produce state tables either in table form or as a case statement program stub.

Example 2

In Section 3.2 and in Section 3.3, we discuss the two programs, the driver program and the lexical analyzer generator.

3.2 The Driver

In this section, we will presume the table has been created by magic. Rows in the table represent states, columns are input characters, and the entries in the table are states.

The lexical analyzer (the driver) considers the input source program as a long sequence of characters with two pointers: a current and a lookahead pointer:

When lexical analysis begins to find the next token, current and lookahead both point to the same character:

In the algorithm, four actions are performed on these pointers:

Algorithm

Driver for Lexical Analysis

WHILE there is more input

InputChar := GetChar

State := Table[0, InputChar]

WHILE State != Blank

InputChar := GetChar

State := Table[State, InputChar]

ENDWHILE

Retract

Accept

Return token = (Class, Value)

ENDWHILE

  1. GetChar: Moves the lookahead pointer ahead one character and returns that character.

  2. Retract: Moves the lookahead pointer back one character.

    Before Retract

    After Retract

  3. Accept: Moves the current pointer ahead to the lookahead pointer:

    After Accept

  4. Return: Returns a token consisting of a class and a value, as well as performs any actions associated with that state, e.g., installing an identifier into the name table.

The driver program scans the input program and comsults the entry at Table[State, InputChar] for the new state. The entry at Table[State, InputChar] consists of a new state and perhaps an action to be performed before moving to the new state:

Algorithm

In the algorithm retract is necessary when a token is found because the algorithm will have scanned one character too far.

Example 3

The magically created table

When the driver algorithm begins, the current and lookahead pointers are initialized to the character preceding the input. (See Exercise 2 )