Introduction
An SGML System Conforming to
International Standard ISO 8879 --
Standard Generalized Markup Language
This is the introduction to the technical documentation of the Amsterdam SGML Parser.
The Amsterdam SGML Parser accepts basic SGML as defined in the Standard on page 53.
Additional services are also supported, see Standard page 58.
validation services:
GENERAL | YES
|
MODEL | YES
|
EXCLUDE | NO
|
CAPACITY | YES
|
NONSGML | YES
|
SGML | YES
|
FORMAL | NO
|
We will use the abbreviation DTD for document type definition.
The following picture gives the structure of the Amsterdam SGML Parser.
scale=2.54
arrowhead=7
boxht=1
boxwid=2.5
box "generator"
arrow from 1st box.s to 1st box.s -(0, 1.5)
box invis "LLgen code" "C code" with .n at last arrow .end
arrow from last box .s to last box.s -(0, 1.5)
box "doc_parser" with .n at last arrow .end
arrow from last box .s to last box.s -(0,1.5)
box invis "complete" "document" with .n at last arrow .end
box invis "DTD" with .e at 1st box.w -(2, 0)
arrow from last box .e to 1st box .w
box invis "document" with .e at 3rd box.w -(2, 0)
arrow from last box .e to 3rd box .w
The system consists of two different programs, the DTD-parser called generator
and the document parser called doc_parser.
- The generator takes a DTD as input.
- It checks whether the DTD is correct.
- If so, it generates several files containing LLgen code and files
containing C code. The second parameter of the generator represents
the filename for the generated LLgen code. The filename is constructed
out of the second parameter, by adding a number and the suffix `.g'.
For instance, take the second parameter to be `document', the generated
filenames are: `document1.g', `document2.g', etc.
- The LLgen rules in the LLgen file correspond to the element
declarations in the DTD.
- The C files contain information from the DTD which are
needed in the document parser, e.g. attribute definitions, defined entities, etc.
These LLgen and C files are, together with some other LLgen and
C files, parsed by LLgen and compiled by the C compiler.
This produces an executable program doc_parser which
is the document parser for the documents made according the input DTD.
The sources consist of files with three different types of
suffixes:
suffix | file type
|
---|
c | C files
|
h | C header files
|
i | include files
|
gen | generic C files
|
gh | generic C header files
|
g | LLgen files
|
The C files and the C header files contain plain C code. The
include files contain generated structures that are included
in the program.
The generic C and generic C header files implement generic datatypes.
This is plain C code that is parameterized by using the C preprocessor.
The LLgen files correspond closely to the rules in the Standard.
The grammar is made LL(1) so some changes were needed.
Theses changes include a different placement of
ps, ts, ds and differences in the parsing of keywords.
See also the general description of the lexical analyser.
The sources for the Amsterdam SGML Parser are divided in three groups:
Sources used only in the generator.
ambigu.c | element.c | gen_tagl.c | notation.c
|
att_gen.c | empty.c | gen_tags.c | omitstrt.c
|
capacity.c | gen_code.c | generate.c | sgmlproc.c
|
context.c | gen_incl.c | node.c |
|
attrib.g | dtd.g | elem.g | sgml.g
|
Sources used only in the doc_parser.
att_par.c | incl.c | rep_pars.c | taglist.c
|
doc_pars.c | myerror.c | replace.c | tags.c
|
elem_stk.c | out.c | startend.c
|
document*.g | rules.g
|
Sources used both in the generator and the doc_parser.
att_chk.c | group.c | modes.c | str_in.c
|
charclas.c | in.c | mode_stk.c | symtable.c
|
conc_syn.c | keywords.c | report.c | token_in.c
|
entity.c | Lpars.c | set.c | tools.c
|
file_in.c | lexical.c | shortref.c
|
comment.g | ent.g | marked.g | tokens.g
|
doc.g | extern.g | shortnot.g
|
The files which are used by both the generator and the doc_parser contain
pieces of code that are used only by one of the parser.
This code is placed between `#ifdef' and `#endif' statements of the
C preprocessor.
Code used only by the generator is placed between `#ifdef GENERATOR'.
Code used only by the doc_parser is placed between `#ifdef DOC_PARSER'.
This means that the preprocessor symbol GENERATOR must be defined when
compiling generator, and DOC_PARSER when compiling doc_parser.
All the following files must be compiled anew when the DTD is
changed, to yield a new document parser. This list also
includes the C-files corresponding to the LLgen files.
att_par.c | incl.c | myerror.c | tags.c
|
conc_syn.c | keywords.c | shortref.c
|
doc_pars.c | lexical.c | startend.c
|
entity.c | modes.c | taglist.c
|
comment.c | document*.c | extern.c | rules.c
|
doc.c ent.c | marked.c | shortnot.c
|
In the file `types.h' a preprocessor symbol DEBUG can be defined.
If this symbol is defined, debugging output can be obtained by
specifying several flags when calling generator or doc_parser.
Each flag turns the debugging in one module on.
a debug ambigu.c, empty.c and omitstrt.c (only in \fIgenerator\fP)
c debug rep_pars.c (only in \fIdoc_parser\fP)
d put debug information on stderr, instead of on file ``debug_info1''
e debug entity.c, doc.g and extern.g
g debug dtd.g (only in \fIgenerator\fP)
i debug in.c
k debug marked.g
l debug lexical.c
m debug myerror.c (only in \fIdoc_parser\fP)
p print all elements on a file (only in \fIgenerator\fP)
s debug shortref.c and shortnot.g
t debug att_chk.c (also rules.g in the \fIdoc_parser\fP)
Assertions in the program text for checking internal
consistencies are activated also.
It is advised to define DEBUG during installation and the test-phase.
The file `types.h' contains most of the type definitions. For
example:
typedef struct node_struct *P_Node;
This typedef defines a pointer to an opaque structure.
Only inside the module which implements the structure (here: ``node.c'')
the `struct' definition and the actual fields are known.
For all other modules the only way to handle a variable of type
`P_Node', is to use the functions exported (here: ``node.h'')
by the implementation module.
This is how information hiding is accomplished.
Information hiding makes the program easy to adapt and maintain,
because most changes are local to one module.
The document parser has two additional flags:
-r . The replacement file contains
replacements for the starttags and endtags. Instead of the complete
document a for instance, document with Troff-code is generated.
-z "string of text". The string of text is used in
error-messages. Instead of the name of the file the error occurred
in the string of text is printed. This is very helpful in batch systems
where the filename is the same for all the documents.
Installing the Amsterdam SGML Parser.
The Amsterdam SGML Parser distribution consists of two directories.
The directory Parser/Src contains the parser itself.
The other directory, LLgen, contains the program LLgen.
This program is used by the parser.
LLgen is an LL(1) recursive descent parser generator.
It is described in a separate document in the appendix.
Installing LLgen
In a Unix environment, LLgen should be made according to
the instructions in the file `READ_ME' in the directory LLgen.
In a non-Unix environment, this might not work.
The file `machdep.c' contains most machine dependent code of
LLgen.
The use of unix-calls `link' and `unlink' must probably be
rewritten.
If this is not possible, they might be thrown away, together with
the code in `main.c' in which they are used.
The purpose of this code is to stop LLgen from generating a new
C-file, when the file is not changed since the previous call to
LLgen. This is used as an aid to let the make program work faster.
Note, however, that no guarantee is given for LLgen to work under
non-Unix systems.
Installing the Amsterdam SGML Parser
The source for the Amsterdam SGML Parser is to be found in the directory Parser/Src.
This directory has two sub-directories GEN and DOC.
In GEN the C-code from LLgen and the object code for the DTD-parser
is created.
In DOC the C-code from LLgen, C-code for the document parser
and the object code for the document parser is created.
In a Unix system the dtd parser can be generated by the command:
make generator
from the Parser/Src directory.
This creates an executable program `generator' in the directory GEN.
When the generator is successfully generated,
the document parser can be created by the commands:
GEN/generator dtd_file document
make doc_parser
NOTE: before the generator is executed all the old document*.g files
and the DOC/document*.o and DOC/document*.c files must be removed.
If the `dtd_file' contains a correct document type definition, the first command generates
an LLgen file `document*.g' in the Parser/Src directory.
Otherwise errors are displayed on standard error output
and the document.g-file is not generated.
If no errors are displayed, then the second command can be given.
The second command generates C-files, corresponding to the LLgen-files,
in the directory DOC,then all files are compiled and in the directory DOC
the executable program `doc_parser' is created.
`doc_parser' is a parser that accepts documents that are written
according to the DTD in `dtd_file'.
This program can be moved to any place in the system.
The program `doc_parser' takes as argument a file containing a document and
delivers on the standard output the `complete document', including all
start- and end-tags, expanded entities, etc. .
All error- and warning-messages are written on the standard error output.
A typical call is:
doc_parser file.doc >complete_file.doc 2>error_output
If the `file.doc' parameter is a minus `-', then the document will be taken from
standard input.
Installing the Amsterdam SGML Parser on non-unix systems
On non-unix systems the Amsterdam SGML Parser should be fairly easy to install.
The C code is conform the C book
.[
Ritchie
.].
The main difficulty can be the absence of the make-program.
This means the installer has to interpret the makefile and
write her own installation script.
Care should also be taken to see whether the names for generated files
are correct to the operating system in use.
Note that various C-identifiers are equal in the first 16 or more characters.
This means that the program does not run easily if the C compiler or loader
can not handle this.
At this moment the Amsterdam SGML Parser and LLgen are installed on SUN Unix
system 4.2
and VAX/VMS.