Lexical analyser

Token recognition

The function of the lexical analyser is to take input from the input stream and convert this input into tokens which are used by the parser. The lexical analyser is not aware of different input streams, these are handled by the input manager (see ``in.c'').

Every token has a associated value, this value gives more detail about the token read. For example the token TOK_DIGIT has as value the actual digit read (`0', `1', `2', ... `9'). For some tokens there is no meaningful associated value: in that case the value equals 0. For example, the token TOK_AND does not need a value, the token is already completely determined

All tokens are defined in the file `tokens.g'. Most tokens that are equivalent to the delimiters and the character classes of the Standard (See figure 1, 2, 3, page 29-31). These are:

TOK_AND	TOK_COM	TOK_CRO	TOK_DTGC
TOK_DTGO	TOK_DSC	TOK_DSO	TOK_EE
TOK_ERO	TOK_ETAGO	TOK_GRPC	TOK_GRPO
TOK_LIT	TOK_LITA	TOK_MDC	TOK_MDO
TOK_MINUS	TOK_MSC	TOK_NET	TOK_NONSGML
TOK_OPT	TOK_OR	TOK_PERO	TOK_PIC
TOK_PIO	TOK_PLUS	TOK_RE	TOK_REFC
TOK_REP	TOK_RNI	TOK_RS	TOK_SEQ
TOK_SHORTREF	TOK_SPACE	TOK_STAGO	TOK_TAGC
TOK_VI	TOK_DIGIT	TOK_DATACHAR	TOK_DELMCHAR
TOK_FUNCHAR	TOK_LETTER	TOK_MSICHAR	TOK_MSOCHAR
TOK_MSSCHAR	TOK_NMCHAR	TOK_NMSTRT	TOK_SEPCHAR
TOK_SPECIAL

The character class NMCHAR denotes both UCNMCHAR and LCNMCHAR because they always occur together. For the same reason NMSTRT denotes both UCNMSTRT and LCNMSTRT and LETTER denotes both UC_LETTER and LC_LETTER. However these tokens are not sufficient, there are a few other tokens needed.

The MDO delimiter ("<!") is not recognized on its own. The keyword following the MDO delimiter is always recognized with it and together they form a single token. In this way the tokens

MDO_ATTLIST	MDO_DOCTYPE	MDO_ELEMENT
MDO_ENTITY	MDO_LINK	MDO_LINKTYPE
MDO_NOTATION	MDO_SHORTREF	MDO_SGML
MDO_USELINK	MDO_USEMAP

are defined.

The other delimiters that may follow the MDO delimiter, COM ("--") and MDC (">") and DSO ("["), are also joined with the MDO to form a single token. In this way the tokens MDO_COM, MDO_MDC, TOK_MDO_DSO are defined.

The MSC ("]]") and MDC delimiters are also joined and form the token TOK_MSC_MDC.

To avoid an error in the standard the token TOK_PERODEF is defined, which corresponds to the PERO ("%") delimiter when it occurs in a parameter entity declaration. See the description of the function pero() in the lexical analyser.

Inside the SGML declaration all keywords are recognized by the lexical analyser and returned as tokens. This defines the tokens:

SGML_APPINFO	SGML_BASESET	SGML_CAPACITY	SGML_CHARSET
SGML_CONCUR	SGML_CONTROLS	SGML_DATATAG	SGML_DELIM
SGML_DESCSET	SGML_DOCUMENT	SGML_ENTITY	SGML_EXPLICIT
SGML_FEATURES	SGML_FORMAL	SGML_FUNCTION	SGML_GENERAL
SGML_IMPLICIT	SGML_INSTANCE	SGML_LCNMCHAR	SGML_LCNMSTRT
SGML_LINK	SGML_MINIMIZE	SGML_NAMECASE	SGML_NAMES
SGML_NAMING	SGML_NO	SGML_NONE	SGML_OMITTAG
SGML_OTHER	SGML_PUBLIC	SGML_QUANTITY	SGML_RANK
SGML_RE	SGML_RS	SGML_SCOPE	SGML_SHORTREF
SGML_SHORTTAG	SGML_SHUNCHAR	SGML_SGMLREF	SGML_SIMPLE
SGML_SPACE	SGML_SUBDOC	SGML_SWITCHES	SGML_SYNTAX
SGML_UCNMCHAR	SGML_UCNMSTRT	SGML_UNUSED	SGML_YES

There is a special token TOK_NOD, which denotes the undefined token. It is returned by various functions when no appropriate token can be found.

The token TOK_CONREF is used when an element occurs with a filled in CONREF attribute. When this occurs, the content of the element is empty (see Standard Annex B, page 86). TOK_CONREF is put into the input stream by the parser to avoid recognition of the content of the element.

Starttag and endtag recognition

Starttags are special to the lexical analyser. Whenever the lexical analyser recognizes a STAGO ("<") delimiter, it scans the input until the end of the starttag is found. The end can be marked by a TAGC (">"), NET ("/"), STAGO or ETAGO ("</"). The lexical analyser returns the complete starttag as one token. The lexical analyser reads the endtag in the same way. The names of the tokens are constructed from the names of the generic identifiers of the elements. If, for example, the DTD is:

<!doctype DOC [
<!element DOC    - - (A, B)>
<!element A      - - (#PCDATA)>
<!element B      - - (#PCDATA)>
]>

then the tokens for the starttags of DOC, A and B are respectively ST_DOC, ST_A, ST_B. The tokens for the endtags are respectively END_DOC, END_A, and END_B. The attributes that belong to a starttag are read and stored. See ``att_par.c'' for a description of the attribute storage.