RAGS - Survey of Grammar Models

This page is Copyright John N. Shutt 1996. Here's what you're allowed to do with it.
Last modified: 06-Nov-07.

Survey of Grammar Models

Chomsky grammars

A Chomsky grammar consists of a vocabulary of symbols and a set of rewrite rules. The vocabulary is partitioned into terminal symbols and nonterminal symbols, and one of the nonterminals is designated as the start symbol. A rewrite rule has the form a b, where a and b are strings of symbols, and a contains at least one nonterminal.

Whenever the left side of a rule occurs in a symbol string, the string may be rewritten by replacing that occurrence with the right side of the rule. The notation a b signifies that string a can be transformed into string b by repeatedly rewriting it via the rules of the grammar. (Call the grammar G. It is assumed here, for simplicity of notation, that only one grammar is under consideration.) b is said to be derived from a under G, and binary relation is called the derivation relation for G. The set of all terminal strings that can be derived from the start symbol of G is called the language generated by G.

Several interesting classes of Chomsky grammars are formed by placing restrictions on the form of rules. For purposes of this thesis, the most important such class is that of context-free grammars, or CFGs. A CFG is a Chomsky grammar such that the left side of every rule has length one. Thus, each context-free rewrite action expands a nonterminal symbol into a string of symbols. This uncomplicated expansion of nonterminals is easy to grasp intuitively, and lends itself readily to depiction via a parse tree. Because of these conceptual advantages, CFGs are very commonly used in the description of programming languages, even though the class of context-free languages (i.e., languages generated by CFGs) excludes some important programming language features, such as lexical scope and static typing.

There is a remarkably close analogy between Chomsky grammars and Turing machines, if the latter are considered, not as a computational model per se, but as a grammar model. Developing this analogy provides broad insights into the concept of grammatical derivation, and serves as good preparation for some of the more elaborate grammar models discussed later in the survey.

A Turing machine consists of a finite-state controller, and an infinite storage tape accessed through a read/write head. The controller determines what machine actions are possible from a given machine configuration (combination of state, tape contents, and head position), much as the rule set of a Chomsky grammar determines what rewrites are possible from a given symbol string. The overall behavior of the machine is described by a computation relation that transforms machine configurations by multiple actions, just as a Chomsky derivation relation transforms strings by multiple rewrites. There is a designated internal halt state from which no further action is possible, just as a Chomsky grammar cannot further rewrite a terminal string.

The analogy extends to subclasses of Turing machines as well. To each of the commonly used subclasses of Chomsky grammars, there is a standard automaton model of equal power; and although it is not usually done, each of these automaton models can be formulated as a subclass of Turing machines, by restricting the kinds of actions that can be permitted by the controller — just as the corresponding classes of grammars restrict the kinds of rewrites that can be permitted by the rule set.

Attribute grammars

In order to reconcile the desire to use CFGs with the need to rigorously define context-dependent programming language features, various extensions of the CFG model have been developed. Most of these extensions retain a CFG kernel, and augment it with a distinct facility that handles context-dependence. The most popular such extension is Knuth's attribute grammars.

As originally envisioned by Knuth, an attribute grammar consists of a CFG and an attribute system. Terminal strings are processed in two stages. First, the CFG is used to parse the terminal string. Then, the nodes in the parse tree are decorated with attribute name/value pairs. The attribute system associates with each CFG rule a group of formulae called semantic rules, which specify the dependencies between the attributes of a parent node and the attributes of its children when the parent's nonterminal is expanded via that CFG rule.

Although more complicated than CFG derivation, this arrangement is still moderately intuitive, because context-dependent information is distributed only by propagation through the links of the context-free parse tree, which can be visualized.

A distinction is made between inherited attributes, whose values propagate down the parse tree, and synthesized attributes, whose values propagate up the parse tree. Knuth identified this distinction as key to his development of the model. However, a number of similar grammar models have been developed in which the distinction is downplayed or eliminated altogether. A feature common to such models is that they integrate attribute evaluation into the derivation relation, rather than treating it as a separate process as Knuth did.

One such model is extended attribute grammars. Here, each nonterminal occurrence is accompanied by its attribute values, listed in some predetermined order, with arrows up or down ( or ) indicating which attributes are synthesized and which inherited. The set of rules of a CFG are replaced by a set of rule forms, in which polynomial expressions fill the attribute positions. The rewrite rules available for derivation are constructed by making every possible substitution of values for variables in the rule forms.

The distinction between inherited and synthesized attributes in an extended attribute grammar has no technical significance. There is no longer any definite direction of computation; derivations that use the wrong attribute values will simply reach an impasse without deriving a terminal string. This trend is carried further by definite clause grammars, which are a closely related grammar model based on principles from logic programming. In a definite clause grammar, context-dependent language features are described by the explicit imposition of logical constraints on a CFG; thus, definite clause grammars reject the inheritance/synthesis distinction even in concept.

A recent addition to the family of attribute grammar variant models is higher-order attribute grammars. This model introduces special nonterminal attributes, which appear as nonterminals in the context-free rules, but are not expanded during parsing; instead, they are evaluated as attributes whose values are new parse-tree branches. These branches are then attached to the existing parse tree, and the attribute evaluation process is expanded to include the attributes of the nodes in the new branches. This evaluation of attributes created by evaluation of attributes may be viewed as a form of recursion.

Other nonadaptable grammars

W-grammars, sometimes called two-level grammars, were developed by Aad van Wijngaarden a few years before Knuth's work. The first level of a W-grammar is a context-free meta-grammar, whose "nonterminals" are called meta-variables, and whose "terminals" are symbols of the second level. The second level is defined by a set of rule schemata, which are context-free rules that may contain embedded meta-variables. The rewrite rules available for derivation are constructed by making every possible substitution of symbols for meta-variables in the rule schemata.

The meta-variables and rule schemata of a W-grammar are analogous to the variables and rule forms of an extended attribute grammar. However, W-grammars are more complicated in concept because they entail three fundamentally distinct kinds of derivation: meta-variables to symbols, rule schemata to rules, and nonterminals to terminals.

There are also many other grammar models that are less commonly used for programming language definition. One large family of grammar models increase the power of CFGs by placing restrictions on the order in which the context-free rules can be applied during derivation; a drawback of this approach is that the context-dependent factors are invisible in the parse tree. Indexed grammars are another less commonly used model, in which the application of otherwise context-free rules can be regulated by means of index lists, which are attached to the nonterminals much like inherited attributes in an extended attribute grammar. The use of index lists is reminiscent of, though less versatile than, the language attribute of Christiansen grammars (see below).

Adaptable grammars

One way of understanding declarations in a program is to view them as adding new grammatical constructs to the language. Based on this observation, some researchers have been led to the notion of an adaptable grammar model, i.e., a model in which the rule set of a grammar can vary during parsing. More precisely, in this survey a grammar model is considered adaptable iff it provides for the explicit manipulation of rule sets.

There are two major classes of adaptable grammar models, depending on whether rule sets are "manipulated" by imperative or declarative means. Most adaptable grammar models use the imperative approach. Three such imperative models are surveyed in this thesis. However, imperative grammar adaptation has an inherent weakness. The only purpose served by the rule set at any given moment is to determine the outcome of the next decision to be made by the parser. Therefore, the grammar designer must know exactly what decisions will be made by the parser, and in what order; otherwise, there is no way of knowing what adaptations to make. But this means that the model is dependent on the parsing algorithm, which greatly increases the conceptual complexity of the model.

Although several apparently declarative adaptable models were encountered while researching the survey, the only one with a fully developed formalism is that first proposed by Henning Christiansen in the mid-1980s, called here Christiansen grammars. As presented in his earlier works, Christiansen's model is a modified form of extended attribute grammars. In his later works, he uses an equivalent formulation based on definite clause grammars. The earlier form is used here, because it more nearly resembles the notation developed for recursive adaptable grammars in Part II of this thesis, and because the inheritance/synthesis distinction provides useful insights into both models.

Structurally, a Christiansen grammar is just an extended attribute grammar such that the first attribute of every nonterminal is inherited and has as its value a Christiansen grammar; this special attribute is called the language attribute. Whenever a nonterminal is expanded during derivation, the rewrite rule is drawn from the language attribute of the nonterminal. The only significance of the initial Christiansen grammar is that it becomes the language attribute of the root node of the parse tree.

Observe that each language attribute value, and therefore each grammar adaptation, in a Christiansen grammar is localized to a particular branch of the parse tree. A fringe benefit of this approach is that block-structured scope is supported gracefully, which is generally not true in imperative adaptable grammars.

The decision to make the language attribute inherited rather than synthesized is not at all arbitrary. In any parse tree, the language attribute of a nonterminal says nothing — at least, nothing remotely finite — about the ancestors or siblings of the node; on the contrary, it is entirely concerned in a clearly delineated way with constraining the relation between the node and its children.