Parsing SGML



What does it mean to parse SGML?

Basically, it means verifying that a document's markup is correct according to the document type definition (DTD).


Parsing an SGML document (e.g. an HTML document, or a document marked up using the somewhat well known DocBook DTD) involves two phases:

  1. generating a parser according to the DTD, then
  2. running the created parser on your marked up document.
The DTD contains everything a parser generator needs to know about which markup tags are legal, how they can be nested, whether they can be omitted, etc. Once a parser has been generated, it can be used repeatedly on documents using the same DTD as it was generated from.

As part of enforcing the document syntax specified by the DTD, an SGML parser may insert omitted tags, format the document, capitalize the markup, etc.

The parser may or may not have capabilities to perform post-processing on the document to generate some output form.

Many SGML parsing packages hide the parser generation step (notably, James Clark's SP system.) This is not incorrect, it just obfuscates the process a bit.

On a slightly more technical level, an SGML parser:

  1. "checks each new character to see if it is part of a general delimiter string that identifies the start or end of a piece of markup,
  2. "checks whether or not the character is a short reference delimiter that needs to be expanded,
  3. "checks if the character is a separator character that should be ignored,
  4. "checks if the character is a valid part of the markup tag,
  5. "identifies the various markup tags, identifying any entities that need to be expanded or recalled from external sources, and
  6. "checks if identified markup tags are valid according to the declared model."
(Taken from Bryan, SGML.)

Why should I parse my SGML documents?

Providing solid reasons for parsing SGML documents is not as easy as the SGML zealot would like. It seems rather silly to parse a document just to get the omitted tags filled in.

When you parse an SGML document, you are verifying it. SGML allows you to easily transmit documents and DTDs across the globe. Wouldn't it seem like a good idea to be sure that the documents you get from someone in Timbuktu are correctly written? Likewise, other people don't want to receive documents that have been incorrectly marked up.

Some will claim that you need only run your document through a 'renderer' to be sure that it's written correctly (e.g. load an HTML document into Netscape or another browser.) There are two flaws with this:

  1. Netscape (any browser, for that matter) does not adhere completely with the HTML 2.0 DTD (which is the current standard), and
  2. for documents written for another DTD, there may not be such a quick way (as a Web browser) to convert the document into an output form.
  3. The output form is only one way of viewing a structured document. Markup does not necessarily have to do with output formatting.

Parsing is not a trivial step. It is something that should be done on any and every SGML document.

There is a web service that allows you to verify HTML documents. (It uses James Clark's SP package.)


How can I parse my SGML documents?

Locally, there are two ways to do this.



last modified:
jes@cs.wpi.edu
___________________________________________________________________ If you walk the trails of Nepal, you will know what Buddhist dharma is all about. - Tengboche Rimpoche ___________________________________________________________________