Languages are described using grammars. In particular, the syntax of a programming language is described using context-free grammars and written down using Backus-Naur form (BNF).

The metalanguage for a scanner generator is a set of regular expressions describing the tokens in the language. A program stub or finite state table is generated from the regular expressions. At compile-time, the driver reads input characters from the source program, consults the table based upon the input and the current state, and moves to a new state based upon the entry, perhaps performing an action such as entering an identifier into a name table.


The structure of English is given in terms of subjects, verbs, etc. The structure of a computer program is given in terms of procedures, statements, expressions, etc. For example, an arithmetic expresion consisting of just addition and multiplication can be described using the following rules:

Expression -- > Expression + Term | Term
Term -- > Term * Factor | Factor
Factor -- > (Expression) | name | integer

Unlike natural languages like English, all the legal strings in a programming language can be specified using a context-free grammar. However, grammars for programming languages may specify incorrect strings as well. For example context-free languages cannot be used to tell if a variable, say, A, declared to be of type boolean is used in an arithmentic expression A + 1.


A sentence is ambiguous if there is more than one distinct derivation. If a sentence is ambiguous, then the parse tree is not unique; we can create more than one parse tree for the same sentence. A grammar is ambiguous if it can generate even one ambiguous sentence.
Consider the following grammar for IF-THEN-ELSE statements:

S -- > IF b THEN S ELSE S
|IF b THEN S
|a
where b represents a boolean condition and a represents some other statements. Then IF b THEN IF b THEN a ELSE a has two parse trees.
The second parse, with ELSE associated with the closest IF is considered to be correct. We can rewrite this grammar to be unambiguous:
S1 -- > IF b THEN S1 | IF b THEN S2 ELSE S1 | a
S2 -- > IF b THEN S2 ELSE S2 | a
Then, IF b THEN IF b THEN a ELSE a has only 1 parse

Left recursive grammars are a problem for some (top_down)parsers. We can change the left recursive expression grammar using iteration. Braces, { } are often used to represent 0 or more occurrence of their contents, while brackets, [], enclose optional items. Thus, using extended BNF, we can write the Expression grammar:
Expression -- > Term {+ Term}
Term -- > Factor {* Factor}
Factor -- > (Expression) | a
The first rule derives the sentential forms Term, Term + Term, Term + Term+ Term, etc.
Of course, since Factor can derive an Expression in Factor -- > (Expression), this grammar is still recursive, but it is not left recursive.


The following is a grammar for a simple calculator.The grammar supports the plus (+), minus (-) multiply (*), and divide (/) operations.

Expression -- > Term { ADD_OP Term }
Term -- > Factor { MULT_OP Factor}
Factor -- > ( Expression )| Number
Number -- > Digit { Digit }
Digit -- > 0|1|2|3|4|5|6|7|8|9
ADD_OP -- > +|-
MULT_OP -- > *|/