Generating LLgen code for parsing the document

Bool node_gi_startopt(node)
- check whether the starttag of node is optional P_Node node;
int count_nr_ands(node)
- count number of elements of and-group node P_Node node;
int number_of_names(elem)
- return the number of names in element elem P_Element elem; The function node_gi_startopt() returns TRUE if the starttag of the generic identifier node is optional, otherwise FALSE is returned.
The number of content tokens in an and-group are counted in the procedure count_nr_ands(). Every element structure may contain more than one element name, number_of_names() returns that number.

void set_flags(node, flag, context)
- set the flagnumbers for controlling starttag omission P_Node node; int *flag; Bool *context; In several cases the omission of the starttag can only be found out during parsing and not beforehand. This is the case for instance when a sequence-group is optional, the first content token in the sequence-group is optional and the rest of the content tokens are not optional. Whether or not to omit the starttag of the second content token in the sequence-group depends on the first token. If the first content token occurs, the second content token becomes required and the starttag of the second element may be omitted, otherwise the starttag may not be omitted.
.sp 1 .TS l l l l l l l l l c c c. correct document. correct document. invalid document starttag ~c~ omitted ~b~ does not occur, ~b~ does not occur because ~b~ occurs. starttag ~c~ must occur starttag ~c~ omitted _ ,,,, ~~~ ,,,, <&#/c> <&#/b> <&#/c> <&#/a> ,,,, <&#/a> <&#/c> <&#/a> .TE

In the LLgen-code a boolean variable flag is generated. This flag indicates the status for those elements, whose omission depends on the context . When the flag equals TRUE the starttag may be omitted. The flag indicates that an element in a group occurred, i.e. the group is not optional any more.
set_flags() calculates which content tokens are contextual required and only when node is a sequence-group or a generic identifier the flag is used. In an or-group or an and-group, starttags may not be omitted (Standard section 7.3.11) so the flag is not needed. After a contextual required content token, the flag always equals TRUE. A content token in an optional group has occurred, so now all the other starttags in the group may be omitted.
When a content token is optional and belongs to an optional sequence-group, a new flag variable is needed. All the content tokens in the sequence-group depend on the status of this new flag to indicate whether or not the starttag may be omitted.
The parameter flag contains the number to be used for the next flag so that for a particular element, all the used flags are unique. context equals TRUE when the rest of the siblings of node do not depend on the flag status.
Between two elements in the same content model, comments, newlines, spaces etc. are allowed, so between two elements a separator is generated. The kind of separator depends on the content model (Standard section 7.6), in case there is the keyword PCDATA in the content model, spaces, newlines etc. are part of PCDATA and not of the separator. seperator is filled with information about the content model so that the right kind of separator is generated.

void code_gi(node)
- generate LLgen code for generic identifier node P_Node node; To generate code for a generic identifier, means generate code for the starttag and the generic identifier itself. The starttag can be omitted when the starttag is marked optional. If the GI is contextual required, code is generated to test whether or not, depending on the status of a flag the starttag may be omitted. If the element is optional, the starttag is always generated.

void code_any()
- generate LLgen code for content model is ANY
void code_key(key)
- generate LLgen code for SGML-keyword key int key; If the content model is ANY, all defined elements may occur at this place. code_any() generates an OR-group with all elements as members. The code for PCDATA is also added as a member of the OR-structure (Standard section 11.2.4).
The function code_key() generates code for CDATA, PCDATA, RCDATA and ANY. No code is generate for the keyword EMPTY, because the content of the element is empty.

void code_endbracket(nd, in_and)
- generate LLgen code for end of a group P_Node nd; Bool in_and;
void code_startbracket(nd, null, in_and)
- generate LLgen code for start of a group P_Node nd; Bool null; Bool in_and;
Bool change_or(node)
- change alternatives in or-group P_Node node; A generic identifier (GI) is grouped together, i.e. there are brackets around the group consisting of the starttag, the element and the endtag so one can see that these belong together. The same goes for a sequence-, or- and and-group.
The function code_endbracket() generates the closing bracket for such a group. When the group is optional or repeated the code for this option is also generated.
The function code_startbracket() generates the opening bracket for a group. When it is a repeated group not only the opening bracket is generated but also a conflict resolver, whether it is needed or not. It also generates a conflict resolver when in_and is TRUE (group is part of an and-group) and the group is marked optional. When an or-group has two empty alternatives the parser does not know which one to choose. A conflict resolver is generated to forestall the error. If the conflict resolver is not necessary it is discarded by LLgen. The only problem is that the conflict resolver must occur with the first empty alternative, so it can not be generated with the first alternative. change_or() puts the first empty alternative in the or-group in front of the rest. Now an empty alternative always appears as the first alternative and the conflict resolver can be generated as the first part of an or-group.
Generated code: a : [%prefer [[STARTTAG_B] B]? | [[STARTTAG_A] A] | [[STARTTAG_C] C]? ];

void code_content(node, in_and)
- generate LLgen code for a part of the content model P_Node node; Bool in_and;
void empty_content(elem)
- generate LLgen code for an empty content model P_Element elem; The function code_content() generates code for node. All children of node are printed, if any, and also the brackets to group the children together. Between two consecutive content tokens of a sequence-group a separator is printed (see set_flags()). in_and denotes whether node is part of an and-group.
empty_content() generates code for declared content EMPTY. This code only contains a C action to generate the complete starttag, no LLgen-code is generated.

void code_var(nr_var)
- generate LLgen code for the necessary variables int nr_var;
void code_header()
- generate LLgen code for variables
void code_start()
- generate LLgen code for the used terminals These functions generate code for the static part of the document parser, not for the parser itself. The function code_var() generates code for the variables needed for a specific element. The name of the variable is ``flag'' concatenated with a number. The variable gives the status with respect to the omission of starttags of contextual required content tokens. When the flag equals TRUE, the starttag may be omitted. The number of variables needed is indicated by nr_var. The variables are numbered flag1 ... flag.
The function code_start() generates all LLgen tokens needed to parse the document according to the document type definition. For every element a starttag and an endtag token-symbol is generated. The headerpart, containing information about which files must be included and which function must be called upon an error are generated by code_header().

void generate_code(elem)
- generate LLgen code for one element P_Element elem;
void generate(filename)
- generate LLgen code for the complete parser String filename; Each element in the DTD corresponds to a non-terminal in the generated parser. generate_code() generates code for one non-terminal indicated by elem, and generates the actions needed to make the document complete.
The function generate() generates code for the complete document parser. This function generates code for the declaration of the terminals, the rules for every non-terminal and the actions associated with the non-terminals. These LLgen code is placed on the file with the name of the second parameter with which ``generator'' is called followed by the suffix `1'. If the number of generated code rules is greater than MAX_GRAMMAR_RULES, this file is ended and the rest is generated on the file with the name of the second parameter followed by the suffix `2', etc..

void subset(start, nr, answer)
- take nr elements out of maximum int start; int nr; P_Set answer;
void find_subsets(max)
- find all sets with n out of max int max; .EQ delim $$ .EN The function subset() finds all the $left ( N over nr right )$ different sets. Every set is stored into answer and when the set is complete, it is copied into an internal variable subset_group. The N is a constant with the value of max, it is set in find_subsets().
To find all sets with nr elements out of N, find_subsets() is called with the maximum value max for N. When the function is done, all sets are stored in the global local variable subset_group. find_subsets() is used for the generic code for and-groups. .EQ delim off .EN

void code_and(node)
- generate name for an and-group P_Node node;
void make_and(set, count, nr_ands)
- generate extra rules P_Set set; int counts; int nr_ands;
void code_and_content(nr_var)
- generate LLgen code for an and-group int nr_var; If the content model contains an and-group a unique name is generated instead of the and-group and the node containing the and-group is stored for processing later. First the content model processed must be completed. code_and() takes care of the storing and generating of a unique name.
After processing the content model, stored and-groups are processed. This is done by generating extra rules for LLgen (make_and()) and generating the code for each content token in an and-group (code_and_content()).
.sp 1 Generate code: .sp 1 .TS l l l l. a : and1_1_2_3 ; and1_1_2_3 : [%if (f_and1_1(token)) [and1_1 and1_2_3] | [%if (f_and1_2(token)) [and1_2 and1_1_3] | [and1_1 and1_1_2]]] ; and1_1_2 : [%if (f_and1_1(token)) [ and1_1 and1_2] | [and1_2 and1_1]] ; and1_1_3 : [%if (f_and1_1(token)) [ and1_1 and1_3] | [and1_3 and1_1]] ; and1_2_3 : [%if (f_and1_2(token)) [ and1_2 and1_3] | [and1_3 and1_2]] ; and1_1 : [ST_b] .... ; and1_2 : [ST_c] ,,,, ; and1_3 : [ST_d] ;;;; ; .TE

The extra-name is constructed as follows, it always starts with ``and'' and is followed by the number of the and-group. In this case this is the first and-group so the number is 1. After the number the number of the content tokens used in this rule are named. So and1_1_2 means it is the first and-group and only the first and second content token (b, c) appear in any order. The %if is a conflict resolver which is necessary when one of the content tokens resolves to empty. The f_and... is a procedure which returns TRUE if the token is the start of and... and FALSE otherwise.
gen_code.c gen_code.h