Technical documentation

Introduction

An SGML System Conforming to International Standard ISO 8879 -- Standard Generalized Markup Language
This is the introduction to the technical documentation of the Amsterdam SGML Parser. The Amsterdam SGML Parser accepts basic SGML as defined in the Standard on page 53. Additional services are also supported, see Standard page 58.
validation services:

GENERAL	YES
MODEL	YES
EXCLUDE	NO
CAPACITY	YES
NONSGML	YES
SGML	YES
FORMAL	NO

We will use the abbreviation DTD for document type definition. The following picture gives the structure of the Amsterdam SGML Parser.

scale=2.54
arrowhead=7
boxht=1
boxwid=2.5
box "generator"
arrow from 1st box.s to 1st box.s -(0, 1.5)
box invis "LLgen code" "C        code" with .n at last arrow .end
arrow from last box .s to last box.s -(0, 1.5)
box "doc_parser" with .n at last arrow .end
arrow from last box .s to last box.s -(0,1.5)
box invis "complete" "document" with .n at last arrow .end
box invis "DTD" with .e at 1st box.w -(2, 0)
arrow from last box .e to 1st box .w
box invis "document" with .e at 3rd box.w -(2, 0)
arrow from last box .e to 3rd box .w

The system consists of two different programs, the DTD-parser called generator and the document parser called doc_parser.

The generator takes a DTD as input.
It checks whether the DTD is correct.
If so, it generates several files containing LLgen code and files containing C code. The second parameter of the generator represents the filename for the generated LLgen code. The filename is constructed out of the second parameter, by adding a number and the suffix `.g'. For instance, take the second parameter to be `document', the generated filenames are: `document1.g', `document2.g', etc.
The LLgen rules in the LLgen file correspond to the element declarations in the DTD.
The C files contain information from the DTD which are needed in the document parser, e.g. attribute definitions, defined entities, etc.

These LLgen and C files are, together with some other LLgen and C files, parsed by LLgen and compiled by the C compiler. This produces an executable program doc_parser which is the document parser for the documents made according the input DTD.
The sources consist of files with three different types of suffixes:

suffix	file type
c	C files
h	C header files
i	include files
gen	generic C files
gh	generic C header files
g	LLgen files

The C files and the C header files contain plain C code. The include files contain generated structures that are included in the program. The generic C and generic C header files implement generic datatypes. This is plain C code that is parameterized by using the C preprocessor.
The LLgen files correspond closely to the rules in the Standard. The grammar is made LL(1) so some changes were needed. Theses changes include a different placement of ps, ts, ds and differences in the parsing of keywords. See also the general description of the lexical analyser.
The sources for the Amsterdam SGML Parser are divided in three groups:
Sources used only in the generator.

ambigu.c	element.c	gen_tagl.c	notation.c
att_gen.c	empty.c	gen_tags.c	omitstrt.c
capacity.c	gen_code.c	generate.c	sgmlproc.c
context.c	gen_incl.c	node.c
attrib.g	dtd.g	elem.g	sgml.g

Sources used only in the doc_parser.

att_par.c	incl.c	rep_pars.c	taglist.c
doc_pars.c	myerror.c	replace.c	tags.c
elem_stk.c	out.c	startend.c
document*.g	rules.g

Sources used both in the generator and the doc_parser.

att_chk.c	group.c	modes.c	str_in.c
charclas.c	in.c	mode_stk.c	symtable.c
conc_syn.c	keywords.c	report.c	token_in.c
entity.c	Lpars.c	set.c	tools.c
file_in.c	lexical.c	shortref.c
comment.g	ent.g	marked.g	tokens.g
doc.g	extern.g	shortnot.g

The files which are used by both the generator and the doc_parser contain pieces of code that are used only by one of the parser. This code is placed between `#ifdef' and `#endif' statements of the C preprocessor. Code used only by the generator is placed between `#ifdef GENERATOR'. Code used only by the doc_parser is placed between `#ifdef DOC_PARSER'. This means that the preprocessor symbol GENERATOR must be defined when compiling generator, and DOC_PARSER when compiling doc_parser.
All the following files must be compiled anew when the DTD is changed, to yield a new document parser. This list also includes the C-files corresponding to the LLgen files.

att_par.c	incl.c	myerror.c	tags.c
conc_syn.c	keywords.c	shortref.c
doc_pars.c	lexical.c	startend.c
entity.c	modes.c	taglist.c
comment.c	document*.c	extern.c	rules.c
doc.c ent.c	marked.c	shortnot.c

In the file `types.h' a preprocessor symbol DEBUG can be defined. If this symbol is defined, debugging output can be obtained by specifying several flags when calling generator or doc_parser. Each flag turns the debugging in one module on.

a	debug ambigu.c, empty.c and omitstrt.c (only in \fIgenerator\fP)
c	debug rep_pars.c (only in \fIdoc_parser\fP)
d	put debug information on stderr, instead of on file ``debug_info1''
e	debug entity.c, doc.g and extern.g
g	debug dtd.g (only in \fIgenerator\fP)
i	debug in.c
k	debug marked.g
l	debug lexical.c
m	debug myerror.c (only in \fIdoc_parser\fP)
p	print all elements on a file (only in \fIgenerator\fP)
s	debug shortref.c and shortnot.g
t	debug att_chk.c  (also rules.g in the \fIdoc_parser\fP)

Assertions in the program text for checking internal consistencies are activated also. It is advised to define DEBUG during installation and the test-phase.
The file `types.h' contains most of the type definitions. For example:

typedef struct node_struct  *P_Node;

This typedef defines a pointer to an opaque structure. Only inside the module which implements the structure (here: ``node.c'') the `struct' definition and the actual fields are known. For all other modules the only way to handle a variable of type `P_Node', is to use the functions exported (here: ``node.h'') by the implementation module. This is how information hiding is accomplished. Information hiding makes the program easy to adapt and maintain, because most changes are local to one module.
The document parser has two additional flags: -r . The replacement file contains replacements for the starttags and endtags. Instead of the complete document a for instance, document with Troff-code is generated. -z "string of text". The string of text is used in error-messages. Instead of the name of the file the error occurred in the string of text is printed. This is very helpful in batch systems where the filename is the same for all the documents.

Installing the Amsterdam SGML Parser.

The Amsterdam SGML Parser distribution consists of two directories. The directory Parser/Src contains the parser itself. The other directory, LLgen, contains the program LLgen. This program is used by the parser. LLgen is an LL(1) recursive descent parser generator. It is described in a separate document in the appendix.

Installing LLgen

In a Unix environment, LLgen should be made according to the instructions in the file `READ_ME' in the directory LLgen. In a non-Unix environment, this might not work. The file `machdep.c' contains most machine dependent code of LLgen. The use of unix-calls `link' and `unlink' must probably be rewritten. If this is not possible, they might be thrown away, together with the code in `main.c' in which they are used. The purpose of this code is to stop LLgen from generating a new C-file, when the file is not changed since the previous call to LLgen. This is used as an aid to let the make program work faster.
Note, however, that no guarantee is given for LLgen to work under non-Unix systems.

Installing the Amsterdam SGML Parser

The source for the Amsterdam SGML Parser is to be found in the directory Parser/Src. This directory has two sub-directories GEN and DOC. In GEN the C-code from LLgen and the object code for the DTD-parser is created. In DOC the C-code from LLgen, C-code for the document parser and the object code for the document parser is created. In a Unix system the dtd parser can be generated by the command: make generator
from the Parser/Src directory. This creates an executable program `generator' in the directory GEN. When the generator is successfully generated, the document parser can be created by the commands: GEN/generator dtd_file document make doc_parser
NOTE: before the generator is executed all the old document*.g files and the DOC/document*.o and DOC/document*.c files must be removed. If the `dtd_file' contains a correct document type definition, the first command generates an LLgen file `document*.g' in the Parser/Src directory. Otherwise errors are displayed on standard error output and the document.g-file is not generated. If no errors are displayed, then the second command can be given. The second command generates C-files, corresponding to the LLgen-files, in the directory DOC,then all files are compiled and in the directory DOC the executable program `doc_parser' is created. `doc_parser' is a parser that accepts documents that are written according to the DTD in `dtd_file'. This program can be moved to any place in the system.
The program `doc_parser' takes as argument a file containing a document and delivers on the standard output the `complete document', including all start- and end-tags, expanded entities, etc. . All error- and warning-messages are written on the standard error output. A typical call is: doc_parser file.doc >complete_file.doc 2>error_output
If the `file.doc' parameter is a minus `-', then the document will be taken from standard input.

Installing the Amsterdam SGML Parser on non-unix systems

On non-unix systems the Amsterdam SGML Parser should be fairly easy to install. The C code is conform the C book .[ Ritchie .]. The main difficulty can be the absence of the make-program. This means the installer has to interpret the makefile and write her own installation script. Care should also be taken to see whether the names for generated files are correct to the operating system in use. Note that various C-identifiers are equal in the first 16 or more characters. This means that the program does not run easily if the C compiler or loader can not handle this.
At this moment the Amsterdam SGML Parser and LLgen are installed on SUN Unix system 4.2 and VAX/VMS.