Parsing SGML
- What does it mean to parse an SGML document?
- Why should I parse my SGML documents?
- How can I parse my SGML documents?
Basically, it means verifying that a document's
markup is correct according to the
document type definition (DTD).
Parsing an SGML document (e.g. an HTML document, or a document
marked up using the somewhat well known
DocBook DTD) involves two phases:
- generating a parser according to the DTD, then
- running the created parser on your marked up document.
The DTD contains everything a parser generator needs to know about which
markup tags are legal, how
they can be nested, whether they can be omitted, etc. Once a parser has
been generated, it can be used repeatedly on documents using the same
DTD as it was generated from.
As part of enforcing the document syntax specified by the DTD, an SGML
parser may insert omitted tags, format the document, capitalize the markup,
etc.
The parser may or may not have capabilities to perform post-processing
on the document to generate some output form.
Many SGML parsing packages hide the parser generation step (notably,
James Clark's SP system.) This is not
incorrect, it just obfuscates the process a bit.
On a slightly more technical level, an SGML parser:
- "checks each new character to see if it is part of a general delimiter
string that identifies the start or end of a piece of markup,
- "checks whether or not the character is a short reference delimiter
that needs to be expanded,
- "checks if the character is a separator character that should be
ignored,
- "checks if the character is a valid part of the markup tag,
- "identifies the various markup tags, identifying any entities that need
to be expanded or recalled from external sources, and
- "checks if identified markup tags are valid according to the declared
model."
(Taken from Bryan, SGML.)
Providing solid reasons for parsing SGML documents is not as easy
as the SGML zealot would like. It seems rather silly to
parse a document just to get the omitted tags filled in.
When you parse an SGML document, you are verifying it.
SGML allows you to easily transmit documents and DTDs across the globe.
Wouldn't it seem like a good idea to be sure that the documents you get
from someone in Timbuktu are correctly written? Likewise, other people
don't want to receive documents that have been incorrectly marked up.
Some will claim that you need only run your document through a 'renderer'
to be sure that it's written correctly (e.g. load an HTML document into
Netscape or another browser.) There are two flaws with this:
- Netscape (any browser, for that matter) does not adhere completely
with the HTML 2.0 DTD (which is the current standard), and
- for documents written for another DTD, there may not be
such a quick way (as a Web browser) to convert the document into an output form.
- The output form is only one way of viewing a
structured
document. Markup does not necessarily have to do with
output formatting.
Parsing is not a trivial step. It is something that should be done
on any and every SGML document.
There is a web service that allows you to
verify HTML documents.
(It uses James Clark's SP package.)
Locally, there are two ways to do this.
- On raven and
penguin the SP
package is installed. Take a look
at the examples and the (impossible)
documentation.
- On owl and sequoia the
Amsterdam Parser
(with equally impossible documentation) is
installed. This is the parser that is used in
Lab 7.
last modified:
jes@cs.wpi.edu