Languages are described using grammars. In particular, the syntax of a programming
language is described using context-free grammars and written down using Backus-Naur
form (BNF).
The metalanguage for a scanner generator is a set of regular expressions
describing the tokens in the language. A program stub or finite state table is
generated from the regular expressions. At compile-time, the driver reads input
characters from the source program, consults the table based upon the input and
the current state, and moves to a new state based upon the entry, perhaps performing
an action such as entering an identifier into a name table.
The structure of English is given in terms of subjects, verbs, etc.
The structure of a computer program is given in terms of procedures, statements,
expressions, etc. For example, an arithmetic expresion consisting of just addition
and multiplication can be described using the following rules:
Expression -- > Expression + Term | Term
Term -- > Term * Factor | Factor
Factor -- > (Expression) | name | integer
Unlike natural languages like English, all the legal strings in a programming
language can be specified using a context-free grammar. However, grammars for
programming languages may specify incorrect strings as well. For example
context-free languages cannot be used to tell if a variable, say, A, declared
to be of type boolean is used in an arithmentic expression A + 1.
A sentence is ambiguous if there is more than one distinct derivation.
If a sentence is ambiguous, then the parse tree is not unique; we can create more
than one parse tree for the same sentence. A grammar is ambiguous if it can generate
even one ambiguous sentence.
Consider the following grammar for IF-THEN-ELSE statements:
S -- > IF b THEN S ELSE S
|IF b THEN S |a
where b represents a boolean condition and a represents some other statements.
Then IF b THEN IF b THEN a ELSE a has two parse trees.
The second parse, with ELSE associated with the closest IF is considered to be correct.
We can rewrite this grammar to be unambiguous:
S1 -- > IF b THEN S1 | IF b THEN S2 ELSE S1 | a
S2 -- > IF b THEN S2 ELSE S2 | a
Then, IF b THEN IF b THEN a ELSE a has only 1 parse
Left recursive grammars are a problem for some (top_down)parsers. We can change the left recursive
expression grammar using iteration.
Braces, { } are often used to represent 0 or more occurrence of their contents,
while brackets, [], enclose optional items. Thus, using extended BNF, we can write
the Expression grammar:
Expression -- > Term {+ Term}
Term -- > Factor {* Factor}
Factor -- > (Expression) | a
The first rule derives the sentential forms Term, Term + Term, Term + Term+ Term, etc.
Of course, since Factor can derive an Expression in Factor -- > (Expression), this grammar is still
recursive, but it is not left recursive.
The following is a grammar for a simple calculator.The grammar supports the plus (+), minus (-)
multiply (*), and divide (/) operations.
Expression -- > Term { ADD_OP Term }
Term -- > Factor { MULT_OP Factor}
Factor -- > ( Expression )| Number
Number -- > Digit { Digit }
Digit -- > 0|1|2|3|4|5|6|7|8|9
ADD_OP -- > +|-
MULT_OP -- > *|/