2.0 Language Design Issues

Various issues have been identified for designing a programming language. We will describe four aspects:

Mathematical Aspect

The mathematical aspect dictates that the language be elegant, but simple to describe mathematically. Lisp-like languages are an example of languages for which this aspect is especially noticeable.

Other ways to describe this aspect are to say that the language should have an inherent simplicity and be easy to learn, expressive and orthogonal. Expressiveness is the power of a language to solve problems. Orthogonality requires combinations of legal constructs to be themselves legal constructs. For example, if a user type of stack is created, and the language allows arrays, an array of stacks would also be allowed. This feature is somewhat related to the object-oriented concept polymorphism and to the concept of first-order languages.

Included in the mathematical aspect is the ability of the language's syntax and semantics to be described accurately and precisely

Cognitive and Social Aspect

This is the programmer's aspect. C is an example of a language with high ptogrammer visibility. It should be easy to read and write programs in the language. Most people would agree that it is easy to write programs in BASIC, but reading them is more difficult. With programming maturity, C is also easy to write, but hard to read.

Implementation Aspect

Niklaus Wirth, the designer of Pascal (and other languages since then), states that a "programming language is as good as its compiler". The first compilers for PL/1 were so buggy that the language's use was affected. When Ada was designed, the design team, remembering the problems with PL/1, designed validation requirements that a compiler would have to pass to be considered a legal Ada compiler.

The time to construct the compiler and its size and speed as well as its "user-friendliness" are all factors to be considered. The original FORTRAN compiler took 18 person-years to develop! We can now design and implement a compiler much faster because tools to generate various phases have been developed and becuase algorithms and data structures for the various compiler tasks have been developed.

Implementation Aspect

2.1 Grammars

Grammars describe languages. Natural languages, such as English, are often described by a grammar which groups words into syntactic categories such as subjects, predicates, prepositional phrases etc., and then into subcategories such as nouns, verbs, and prepositions, etc.

Stated more mathematically, a grammar is a formal device for specifying a potentially infinite language in a finite way, since it is impossible to list all the possible strings in a language whether it is English or C. At the same time, a grammar imposes a structure on the sentences in the language. That is, a grammar, G, defines a language, L(G), by defining a way to derive all legal strings in the language. We will look at this for a (very) small subset of English.

2.1.1 Context-free Grammars for Natural Languages

Noam Chomsky, in 1957, used the following notation called productions, to define the syntax of English. The terms used here, sentence, noun phrase,etc. plus the following rules describe a very small subset of English sentences. The articles a, and thehave been categorized as adjectives for simplicity.

 
         <sentence>      -- >   <noun phrase> <verb phrase>
         <noun phrase>   -- >   <adjective> <noun phrase> 
                             | <adjective> <singular noun>
         <verb phrase>   -- >   <singular verb> <adverb>
         <adjective>     -- >   a | the |little
         <singular noun> -- >   boy
         <singular verb> -- >   ran
         <adverb>        -- >   quickly

Here, the arrow, -- >, might be read as "is defined as" and the vertical bar, "|", as "or". Thus, a noun phrase is defined as an adjective followed by another noun phrase or as an adjective followed by a singular noun. This definition of noun phrase is recursive because noun phrase occurs on both sides of the production. Grammars are recursive to allow for infinite length strings.

This grammar is said to be context-free because only one syntactic category, e.g., <verb phrase >, occurs on the left of the arrow. If there were more than one syntactic category, this would describe a context and be called context-sensitive.

A grammar is an example of a metalanguage- a language used to describe another language. Here, the metalanguage is the context-free grammar used to describe a part of the English language.

Example 2 shows a diagram called a parse tree or structure tree for the sentence the little boy ran quickly.

The sentence: "quickly, the little boy ran" is also a syntactically correct sentence, but it cannot be derived from the above grammar. In fact, it is impossible to describe all the correct English sentences using a context-free grammar.

On the other hand, it is possible, using the grammar above, to derive the syntactically correct, but semantically incorrect string, "little the boy ran quickly." Context-free grammars cannot describe semantics.

2.1.2 Context-free Grammars for Programming Languages

The structure of English is given in terms of subjects, verbs, etc. The structure of a computer program is given in terms of procedures, statements, expressions, etc. For example, an arithmetic expresion consisting of just addition and multiplication can be described using the following rules:

        <expression> ::= <expression> + <term> | <term>
        <term>       ::= <term> * <factor> | <factor>
        <factor>     ::= (  <expression> ) 
                       | <name> | <integer>
        <name>       ::= <letter> | <name> <letter> | <name> <digit>
        <integer>    ::= <digit> | <integer> <digit>
        <letter>     ::= A | B | ... |Z 
        <digit>      ::= 0 | 1 | 2 | ... | 9

Here, we have used "::=" for is defined as rather than an arrow, -- > as before. The metalanguage BNF (Backus-Naur Form) is a way of specifying context-free languages, and BNF was originally defined using ::= rather than -- >. As long as we understand what is meant and what the capabilities of this grammatical notation are, the notation doesn't matter. We will often omit the angle brackets, < >, when writing BNF.

Unlike natural languages like English, all the legal strings in a programming language can be specified using a context-free grammar. However, grammars for programming languages specofy semantically incorrect strings as well. For example context-free languages cannot be used to tell if a variable, say, A, declared to be of type boolean is used in an arithmentic expression A + 1.

Parse Tree for A * B A parse tree shows the structure of a string using the grammar.

Derivations

The parse tree shows the structure, but it does not tell us in exactly what order the productions were applied. The following example shows the expression grammar and one derivation of a + a * a.

(E stands for expression, T for term and F for factor. The number above each arrow is the number of the production applied.

A left derivation replaces the left-most nonterminal at each step of the derivation, while a right derivation replaces the right-most nonterminal at each step

Parsing reverses the derivation process: given an input string, the parser has to "discover" a derivation (if any).

Terminals and Nonterminals

Terminals

Identifiers

Nonterminals

Productions

Productions can be thought of as a set of replacement rules (also called rewriting rules ). Each rule can be written:

A -- >

where A is a nonterminal and is a string of terminals and nonterminals. In the Expression grammar E -- > E + T is a production.

Start Symbol

The start symbol, also called a goal symbol, is a special nonterminal designated as the one from which all strings are derived. In the Expression grammar, E is the designated start symbol.

Sentential Form and Sentence

Sentential Form

sentential form

a + a * a

Sentence

sentence

a + a * a.

        Algorithm 
          Derive String 
        
        String := Start Symbol
        REPEAT
          Choose any nonterminal in String.
          Find a production with this nonterminal on the left-hand side.
          Replace the nonterminal with one of the options on the right-hand
            side of the production.
        UNTIL String contains only terminals.

Extended BNF

Reecursive procedures in programming can be rewritten using iteration (and a stack). Similarly, we can rewrite recursive procedures using iteration. Braces, { } are often used to represent 0 or more occurrence of their contents, while brackets, [], enclose optional items. Thus, using extended BNF, we can write the Expression grammar:

T -- > F {* F}

F -- > (E) | a

The first rule derives the sentential forms T, T + T, T + T + T, etc.

Of course, since F can derive an E in F -- > (E), this grammar is still (indirectly) recursive.

Extended BNF

T -- > F {* F}

F -- > (E) | a

The first rule derives the sentential forms T, T + T, T + T + T, etc.

Of course, since F can derive an E in F -- > (E), this grammar is still (indirectly) recursive.

2.2 Ambiguity

If an English sentence has more than one meaning, it is said to be ambiguous. Often such sentences can be parsed in more than one way. The sentence,

Time flies like an arrow

can be interpreted with time as a noun, flies as a verb and like an arrow as an adverbial phrase. This interpretation is a comment on the fast passage of time. However, if time is interpreted as an adjective, flies as a noun, like as a verb, and arrow as a direct object noun, the sentence becomes a comment on the love life of some species called a time fly. There are other interpretations of this sentence. (see Exercise 4 )

Similarly, meaning is assigned to programming language constructs based on their syntax. We prefer, therefore that programming language grammars describe programs unambiguously.

A sentence is ambiguous if there is more than one distinct derivation. If a sentence is ambiguous, then the parse tree is not unique; we can create more than one parse tree for the same sentence.

A grammar is ambiguous if it can generate even one ambiguous sentence.

Example 6

Ambiguous Grammar for Expressions

Consider the following version of the Expression grammar:

E -- > E * E

E -- > (E)

E -- > a

and input a + a + a. We can find the following left-most derivations:

Example 7

Consider the following grammar for IF-THEN-ELSE statements:


        S -- > IF b THEN S ELSE S
             |IF b THEN S 
  
             |a

where b represents a boolean condition and a represents some other statements. Then

  IF b THEN IF b THEN a ELSE a

has two parse trees.

The second parse, with ELSE associated with the closest IF is considered the way this should be parsed.

We can rewrite this grammar to be unambiguous:


 
 
 S1 -- > IF b THEN S1 | IF b THEN S2 ELSE S1 | a
 S2 -- > IF b THEN S2 ELSE S2 | a

Then,

IF b THEN IF b THEN a ELSE a has only 1 parse

2.3 Summary

Languages are described using grammars. In particular, the syntax of a programming language is described using context-free grammars and written down using Backus-Naur form (BNF).

A string may be derived from its grammar. Parsing is the process of "discovering" a derivation, given an input string and a grammar.