Generating files for the document parser

Introduction

The information stored in the document type definition, from now on called DTD, must be transferred to the document parser. This is necessary because the DTD and the document are parsed separately. Information about the structure of the document, the element-rules (<&#!element ...>), is translated into LLgen Jacobs rules. Information about the attributes, entities and so on is stored in several files and those files are included in the document parser.

LLgen

LLgen is a parser-generator. Its input is a grammar, that consists of tokens (terminal symbols) and production-rules (non-terminals). A typical LLgen grammar is:

%token b,c;
%start a\(rupars, a;

a :	b c d ;
d :	b c ;

In the example `b', `c' are declared as tokens and `a', `d' are production-rules. The grammar-rule `a : b c d ;' denotes that first the symbol `b' must be encountered in the inputstream followed by the symbol `c' followed by the symbols denoted by the righthandside of grammar-rule `d'. The startsymbol of the grammar is `a', the name of the parser to recognize the input is ``a\(rupars''. The input that is recognized by ``a\(rupars'' is `bcbc'.

When a terminal or non-terminal on the righthandside is followed by question-mark (`?'), it is marked as optional.

%token b,c;
%start a\(rupars, a;

a :	b c? d ;
d :	b c ;

The input recognized now consists of the strings `bcbc' and `bbc'. A terminal or non-terminal on the righthandside can also be followed by star (`*') or plus (`+'). The star means repeat zero or more times and the plus means repeat one or more times, LLgen uses square brackets (`[', `]') to group constructs together. Alternations in LLgen are denoted by a `|'. For example: `a | b ' means `a' or `b'. For more information on LLgen, see\s8\v'-0.4m'1\v'0.4m'\s10.

Element Declarations

To translate the DTD into LLgen-rules is relatively straightforward. For every starttag (and endtag) of an element a terminal symbol is introduced that must precede (follow) that element. When the starttag can be omitted, a question-mark follows the terminal symbol denoting the starttag. However, the omission of the starttag is not simple solved by just adding a question-mark after the starttag, because the starttag can only be omitted when the element is contextual required (see Standard, page 8, def. 4.61). This implies that not every optional starttag can be omitted. For example:

<!element a O - (b?, c)? >
<!element b O - (d) >
<!element c O - (d) >
<!element d O - CDATA >

The starttag of `b' can not be omitted because the element is not contextual required. The starttag of `c' can only be omitted when element `b' has occurred, since then the optional sequence-group is not optional any more and `c' becomes a contextual required element. But if `b' has not occurred, the starttag of `c' is required, since `c' is still an optional element. Because the omission of the starttag of `c' is determined by the occurrence of `b', the problem can not be solved staticly during the DTD parsing, but must be solved dynamically in the document parser. The generated code for the above example DTD is :

a : [starttag_b b]? [%if (starttag_b has occurred) [starttag_c? | starttag_c] c] ;
b : starttag_d d endtag_b ;
c : starttag_d d endtag_c ;
d : cdata endtag_d ;

The `%if' denotes an LLgen conflict resolver. The condition associated with the conflict resolver is only evaluated when a conflict arises, i.e. when the parser does not know which alternative to choose because both alternatives begin with the same token. In this case both alternatives begin with starttag_c, so the first alternative is chosen when element `b' has occurred. Note that the starttag of `d' may be omitted according to the DTD, but actually may not be omitted because the content is a declared content (see Standard page 22, 7.3.1.1 and page 40, 11.2.3).

The endtag may be omitted in most cases, but the omission can only be determined during parsing. So all the rules are generated with a required endtag. If the endtag may be omitted and the user does omit the endtag, the parser itself adds the missing endtag during error-recovery and resumes parsing. If the endtag is omitted but it is not allowed, the parser does not insert the missing endtag but instead issues an error-message (see also the chapter on Endtag Omission).

The sequence-group in a content model is translated directly into LLgen-code. Between two content tokens, a separator is generated. This separator handles the parsing of newlines, spaces, tabs, comments, shortrefs etc. between two consecutive tokens. If the content model is mixed content, i.e. the content model contains the keyword PCDATA or ANY, spaces, tabs and newlines are part of PCDATA and not of the separator (see Standard section 7.6, page 24).

<!element a		O O	(b, c) >
<!element b		O O	(#PCDATA, d) >
<!element (c,d)		O O	CDATA >

Generated LLgen-code:

document :	\fIseparator2\fP [starttag_a a] \fIseparator2\fP ;
a :	\fIseparator2\fP [starttag_b b] \fIseparator2\fP [starttag_c c] \fIseparator2\fP [endtag_a] ;
b :	\fIseparator1\fP [pcdata] \fIseparator1\fP [starttag_d d] \fIseparator1\fP [endtag_b] ;
c :	[cdata] [endtag_c] ;
e :	[cdata] [endtag_d] ;

The non-terminal `separator1' parses comments, shortrefs, entity-references, marked-sections, usemaps, character references and processing instructions. This is `other-content' according to the Standard (page 24). `separator2' parses the same constructs as `separator1' but also spaces, tabs and newlines.

The or-group in the content model is translated directly into LLgen-code, only the order in which the alternatives are generated may differ. If there is an empty alternative i.e. the alternative can be skipped without being filled in, this one is generated first. For example:

<!element a		- - (b|c|d?) >
<!element (b,c,d)	- - CDATA >

Generated code:

document : [starttag_a a] ;
a :	[%if (conflict, choose first alternative) [starttag_d d]? | [starttag_b b]  | [starttag_c c] ] ;
b :	cdata endtag_b ;
c :	cdata endtag_c ;
d :	cdata endtag_d ;

In this case no conflict occurs, but when several of the alternatives are empty, the parser does not know which empty alternative to choose. The conflict resolver forces the parser to choose the first one and the conflict is resolved.

To parse an and-group some extra rules are introduced. This is necessary because this construct is not know in LLgen: it must be simulated. The and-group `a & b' means `a' followed by `b' or `b' followed by `a'. The possible LLgen-rule might be:

and-group :	[ a b ] | [ b a ] ;

Such a rule does not create an ambiguity for two members. However, as the number of members becomes three or more, this approach is ambiguous.

and-group :	[ a b c ] | [ a c b ] | [ b a c ] | [ b c a ] | [ c a b ] | [ c b a ] ;

On encountering starttag_a, the parser does not know whether it must choose the clause [ a b c ] or [ a c b ]. This can only be determined when the parser looks ahead. But look-ahead is not permitted in LLgen so we have to choose another solution. Instead of one rule, four rules are generated that avoid the ambiguity.

and-group :	[ a and2_3 ] | [ b and1_3 ] | [ c and1_2 ] ;
and1_2 :	[ a b ] | [ b a ] ;
and1_3 :	[ a c ] | [ c a ] ;
and2_3 :	[ b c ] | [ c b ] ;

When the parser encounters starttag_a, `a' is first parsed and next the rule for `and2_3' is parsed. Here the parser chooses [ b c ] on encountering starttag_b otherwise the other alternative is chosen. Every alternative except the last one in the or-group is preceded by a conflict resolver, so that only that alternative is chosen whose starttag is really encountered. This is needed because several of the alternatives may produce empty.

Inclusions

The production-rule for the inclusion elements are part of the generated grammar. However, a call to invoke the production-rule is not generated in the rule itself where the inclusion may occur, this would produce a great overhead. Take `Q' is a generic identifier or group in a content model and `x' the occurrence indicator, i.e `x' is \fBREP\fP, \fBOPT\fP or \fBPLUS\fP or empty if there is not any. Take ``$R sub 1$'' through ``$R sub n$'' are applicable inclusions, then a token

Qx
is treated as though it were : ($R sub 1$ | $...$ | $R sub n$)*, (Q, ($R sub 1$ | $...$ | $R sub n$)*)x

This approach is rather cumbersome, so we choose another method. In LLgen it is possible to write one's own error recovery routine, i.e. on encountering an error LLgen calls the users error-routine to handle the error. If the error-routine does not take any action, the error recovery of LLgen takes over. We use this feature for inclusions. When an inclusion is encountered, i.e. the starttag of the inclusion does not fit in the production-rule being parsed, our error-routine is automatically called by LLgen. The error-routine checks whether the starttag is the starttag of an applicable inclusion and if so the inclusion is parsed.

To parse the inclusion, we have to indicate in the grammar that the rule for the inclusion is a possible start element in the grammar. Note that the parser for the inclusion must be linked with the corresponding starttag, this is done on the generated file ``incl.i''. The file contains a structure with two fields. For example:

<!element a		- -	(b, c, d)	+(b) >
<!element (b,c,d)	- -	CDATA	>

The generated structure on ``incl.i'':

Inclfunc incl\(rufunc[] = {
	{ starttag_b, f\(ruB },
	{ 0, 0 }
};

The first field contains the starttag with which the inclusion must begin, the second fields holds a pointer to a function which invokes the parser for the inclusion. To make it possible to refer to such a function, on the file ``incl.ext'' an external reference is generated. The last structure entry with zeroes is added to indicate the end of the structure information.

After parsing the inclusion the LLgen parser expects the end-of-file token. However, we know at this moment that an end-of-file token is expected so we put it on the inputstream and the inclusion is terminated. The document parser can resume with the rest of the input in the state the parser was before the inclusion occurred. There is no restriction on the number of times the inclusion may appear. The file ``incl.i'' is included in ``incl.c''.

Endtag Omission

The endtag for an element may be omitted for the document element when the permission is yes (`O'). If the permission is yes (`O') for an element the endtag can be omitted when the element is followed either

(1) by the endtag of another open element or (2) by an element or SGML character that is not allowed in its content.

The above rules make it necessary to transfer information about the content of an element from the DTD parser to the document parser. The information is stored on the file ``taglist.i''. The information about the content consist of the starttags of the content tokens and the keywords CDATA, PCDATA or RCDATA. When the endtags are marked optional in the grammar, a lot of ambiguities arise. To prevent this all the endtags are required. This causes an error if an endtag is omitted and our error-routine (see Inclusions) is called. The error-routine checks if the required token is an endtag. If the wrong token (the one causing the error) is an endtag of an open element, rule 1 is applied and the endtag of the most recent open element is pushed on the inputstream. Otherwise if the wrong token is not a starttag of an element that may appear in the content of the most recent open element and the wrong token is not a SGML character that may appear in the content, rule 2 applies and the endtag is also pushed on the inputstream. In all other cases an error really has occurred and a message is printed.

To get any idea of the generated file see the following example.

<!element a	O O	(b, d, c, #PCDATA) >
<!element b	O O	(c, d) >
<!element c	O O	CDATA >
<!element d	O O	RCDATA >

The generated file ``taglist.i'' :

ttaglist taglist[] = {
	endtag_a, true, { starttag_b, starttag_d, starttag_c, pcdata, 0 },
	endtag_b, true, { starttag_c, starttag_d, 0 },
	endtag_c, true, { cdata, 0 },
	endtag_d, true, { rcdata, 0 },
	0, 0, 0
};

Take for instance the following document.

<a>
<b>
<c>
This is element c
<d>
This is element d
<d>
This is element d in element a
<c>
This is element c in element a
This is PCDATA, impossible to determine whether this is
element c or pcdata

This is an incorrect document because the endtag of the element `b' may not be omitted according rule 2. The elements `c' and `b' occur in the content of `b' so they can not be used to determine the end of element `b'. Also the endtag of the second element `c' can not be omitted because there is no way to determine where element `c' ends and the PCDATA starts. The following document is a correct document.

<a>
<b>
<c>
This is element c
<d>
This is element d
<&#/b>
<d>
This is element d in element a
<c>
This is element c in element a
<&#/c>
This is PCDATA, it is now possible to determine whether
this is element c or pcdata

The file ``taglist.i'' is included in ``taglist.c''.

Attribute Declarations

The attribute definition list declaration is needed in the document parser to check if the attribute values conform to the declared type and to check whether the starttag has attributes of the given attribute name. For this purpose a list with all the attribute definitions is generated on the file ``att\(rupar.i''. The file ``att\(rupar.ext'' contains the extern declarations. Every attribute definition list is stored in a single structure. The structure has several fields corresponding to the attribute definition.

struct attdef\(rustruct attr<nr> = {
{ name, declared value, name-token-group, default type, default value, current value },
\.....
{ 0, 0, 0, 0, 0, 0 }
};

The name attr denotes attr1 or attr2, etc.

The name-token-group corresponds with the values for a notation attribute or with the name-tokens of the declared value. The name-token-group is first declared and initialized. Take for example the following attribute definition list:

<!attlist element-one
a		CDATA		#IMPLIED
b		(c|d|e)		"c"	-- no declared value but a name-token group --
c		NAMES		"tinker tailor soldier"
>

The generated list on ``att\(rupar.i'' :

String range1[] = { "c", "d", "e", 0 };
struct attdef\(rustruct attr1[] = {
	{ "a", CDATA,		0,		IMPLIED,		 0,	0 },
	{ "b", ENUMERATE,	range1,	UNDEFINED,	"c",	0 },
	{ "c", NAMES,		0,		UNDEFINED,	"tinker tailor soldier", 0},
	{ 0, 0, 0, 0, 0, 0 }
};

The generated external definition:

extern struct attdef\(rustruct attr1;

The file ``att\(rupar.i'' is included in ``att\(rupar.c''. The file with the extern definitions is included in another generated file with the name ``tags.i''.

Entity Declaration

Entity declarations can occur in the DTD but also in the document itself. The entities declared in the DTD must be known to the document parser so they are stored on the file ``entity.i''. Not only the general entities must be stored but also the parameter entities, because those parameter entities can be used in the entity declarations in the document. On this file the external identifiers which are used in the external entities are stored too. Take for example the following input:

<!-- parameter entities -->
<!entity %	extern	PUBLIC "/usr/sylvia/test.i" >
<!entity %	ent		"This is a test" >

<!-- general entities -->
<!entity 	what		STARTTAG	"one" >
<!entity	who		ENDTAG		"one" >
<!entity	extern	PUBLIC "/usr/sylvia/test.i" >

The generated file looks like:

Extern\(ruid	extern0 = { <type> , "/usr/sylvia/test.i", "" };
Entity	dtd\(rugen\(ruentities[] = {
	{ "what", 25, "one", 0 },
	{ "who", 24, "one", 0 },
	{ "extern", 605, 0, &extern0 },
	{ 0, 0, 0, 0 }
};

Extern\(ruid	extern1 = { <type> , "/usr/sylvia/test.i", "" };
Entity	dtd\(rupar\(ruentities[] = {
	{ "extern", 605, 0, &extern1 },
	{ "ent", 604, "This is a test", 0 },
	{ 0, 0, 0, 0 }
};

The in the external identifier indicates whether it is a SUBDOC or NDATA entity. Note that SUBDOC entities are not allowed in basic SGML. The number 24 is an internal code indicating that the entity is a starttag entity, i.e. that the entity-text is expanded to `<one>'. Number 25 indicates that the entity is a endtag entity, i.e. the entity-text is `<&#/one>'. Number 604 indicates that the entity-text must be expanded as it is, and number 605 indicates that the entity is a external entity. The associated external identifier is found in the fourth field in the Entity-structure. The file ``entity.i'' is included in ``entity.c''.

Shortreference Map Declarations

The shortreference declarations must also be incorporated in the document parser. The declarations are stored on the file ``map.i''. Because of the fact that the parser only handles basic SGML, the number of shortreferences is fixed. There are only 32 different shortreferences (see Standard: fig. 4, page 33). This makes it possible to generate an array of length 32 containing strings that denote the entity names. The empty map is always generated, even if it is not used in the DTD.

<!shortref	mapje	"'"	entity\(ru1
				"("	entity\(ru2
				")"	entity\(ru3
>

Generate file ``map.i''

Map	MAPEMPTY = { "EMPTY",
	{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
	0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 },
0 };
Map	MAPMAPJE = { "MAPJE",
	{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, entity\(ru1, entity\(ru2, entity\(ru3,
	0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 },
0 };
P\(ruMap	maps[] = { &MAPEMPTY, &MAPMAPJE, ILL\(ruMAP };

The first field denotes the map name, the second field denotes the array of all the shortreferences, with for the used delimiters the name of the entity to be expanded on encountering the shortreference. The third field is used for speeding up the document parser. It will contain the set of the first characters of the used delimiters. In this case the set for `mapje' will consist of { `` ( '', `` ) '', `` ' '' }. This information is used in the lexical analyzer. The information is determined in the document parser. The file ``map.i'' is included in ``shortref.c''.

Capacity Points

The DTD is checked against overflow of the capacity points. For instance, for every entity, NAMELEN points (with value 8 in basic SGML) are counted for `entcap'. The number of characters in the entity text are counted in `entchcap'. The resulting value must be less than the maximum value specified (35000). However, in the document the user may also specify entities and maps. The points for these entities must also be counted and added to the resulting points of the DTD. To transfer the values from the DTD-parser to the document parser, the relevant subtotals (the `entcap', `entchcap', `mapcap' and the `totalcap') are stored on the file ``cappoint.i''. This file is included in the document parser and the values from the document entities and maps are added to the generated ones. The file looks like:

int totalcap = <integer>;
int entcap = <integer>;
int entchcap = <integer>;
int mapcap = <integer>;

Parser Information

The generated parser needs information on several subjects obtained during the generation process. These subjects are described below.

The document parser must give understandable error-messages and not cryptic sentences like `` error in element 650, 234''. It is easy to give a number because internally the system uses numbers instead of element-names. The starttag and the endtag of an element are also represented internally by numbers. So a mapping must be defined to map the starttag- and endtagcodes onto the element name.

Associated with the most recently opened element are inclusions, exclusions and shortreference maps. When an element is opened the inclusions associated with the element become applicable (the same holds for the exclusions). The shortreference map associated with the element, if there is any, must now be used and the old one stored. When the endtag of the element is parsed the inclusions and exclusions belonging to the element are no longer active and the previous map, if the element has an associated map, must be restored.

The attributes belonging to an element must also be known because in order to make the document complete, the attributes must be printed with their values.

All this information is stored on the file ``tags.i''. For example:

<!element a	- - (b, c) +(fig) >
<!element b	- - CDATA >
<!element c	- - (b) -(fig) >
<!element fig	- - EMPTY >
<!attlist a	attr1		CDATA  #REQUIRED >
<!shortref	map1		"'"	"ent" >
<!usemap	map1	a >
<!entity	ent	"This is an entity" >

The generated file ``tags.i'' contains:

int doc\(rustart = starttag_a;

int incl1[] = { starttag_fig, 0 };
int excl1[] = { starttag_fig, 0 };

static Parsertable parserinfo = {
	{ "A", starttag_a, endtag_a, attr1, incl1, 0,     &MAPMAP1, 0 },
	{ "B", starttag_b, endtag_b, 0,     0,     0,     0,        0 },
	{ "C", starttag_c, endtag_c, 0,     0,     excl1, 0,        0 },
	{ "FIG", starttag_fig, endtag_fig, 0, 0, 0, 0, 0 },
	{ ILL\(ruSTRING, ILL\(ruSYMBOL, ILL\(ruSYMBOL, ILL\(ruATTR, ILL\(ruEXCEP, ILL\(ruEXCEP, ILL\(ruMAP, 0, 0 }
};

The fourth field of `parserinfo' refers to the attribute definition list generated on the file ``att\(rupar.ext''. The next two fields contain a pointer to an array filled with the starttag of the inclusion and exclusion elements, respectively. The seventh field refers to the associated shortreference map. The last two fields are filled in, in the document parser and are used for the replacement values. The eight field contains the replacement value for the starttag, the nineth field for the endtag.

Complete Document

So far we have discussed the generated files. These are generated after the DTD is parsed. During parsing of the document the document is made complete. The complete document differs with the input document in:

- all omitted starttag are added - all omitted endtags are added - all entity references are expanded. When the entity is an external entity, the Standard does not define the action to be taken. In the Amsterdam SGML Parser, the SYSTEM and PUBLIC identifier must denote a file. First the SYSTEM identifier is tried and if the named file does not exist then the PUBLIC identifier is used. If both files can not be opened the entity is ignored and an error-message is given. - all shortref delimiters are expanded to the associated entity. - all usemap declarations in the document are parsed and not echoed in the complete document. - all entity declarations at the beginning of the document are parsed and not echoed in the complete document. - all spaces, tabs, newlines and comments between to consecutive tags are removed. For example:

<a>    -- this is a comment --
<b>

Resolves to:
<a><b>

- all marked sections are executed, i.e. those with the keyword \fBIGNORE\fP are ignored and those with the keywords \fBINCLUDE\fP and \fBTEMP\fP are included. The marked section text is parsed. - the attributes with their values are printed in the starttag. The value of implied attributes are not given. The value is surrounded by LIT (`"'). - the character references &\#TAB, &\#RE, &\#RS and &\#SPACE are expanded. - all processing instructions are echoed but truncated to PILEN (240) characters.

The parsing continues until there are more than MAX\(ruERRORS (60) found or the document is parsed completely. The document parser tries to do error recovery after encountering an error but can not possibly always succeed. All the errors are written on standard error output. The complete document is written on the standard output.