Token recognition



The function of the lexical analyser is to take input from the input stream and convert this input into tokens which are used by the parser. The lexical analyser is not aware of different input streams, these are handled by the input manager (see ``in.c'').

Every token has a associated value, this value gives more detail about the token read. For example the token TOK_DIGIT has as value the actual digit read (`0', `1', `2', ... `9'). For some tokens there is no meaningful associated value: in that case the value equals 0. For example, the token TOK_AND does not need a value, the token is already completely determined

All tokens are defined in the file `tokens.g'. Most tokens that are equivalent to the delimiters and the character classes of the Standard (See figure 1, 2, 3, page 29-31). These are:
TOK_ANDTOK_COMTOK_CROTOK_DTGC
TOK_DTGOTOK_DSCTOK_DSOTOK_EE
TOK_EROTOK_ETAGOTOK_GRPCTOK_GRPO
TOK_LITTOK_LITATOK_MDCTOK_MDO
TOK_MINUSTOK_MSCTOK_NETTOK_NONSGML
TOK_OPTTOK_ORTOK_PEROTOK_PIC
TOK_PIOTOK_PLUSTOK_RETOK_REFC
TOK_REPTOK_RNITOK_RSTOK_SEQ
TOK_SHORTREFTOK_SPACETOK_STAGOTOK_TAGC
TOK_VITOK_DIGITTOK_DATACHARTOK_DELMCHAR
TOK_FUNCHARTOK_LETTERTOK_MSICHARTOK_MSOCHAR
TOK_MSSCHARTOK_NMCHARTOK_NMSTRTTOK_SEPCHAR
TOK_SPECIAL

The character class NMCHAR denotes both UCNMCHAR and LCNMCHAR because they always occur together. For the same reason NMSTRT denotes both UCNMSTRT and LCNMSTRT and LETTER denotes both UC_LETTER and LC_LETTER. However these tokens are not sufficient, there are a few other tokens needed.

The MDO delimiter ("<!") is not recognized on its own. The keyword following the MDO delimiter is always recognized with it and together they form a single token. In this way the tokens
MDO_ATTLISTMDO_DOCTYPEMDO_ELEMENT
MDO_ENTITYMDO_LINKMDO_LINKTYPE
MDO_NOTATIONMDO_SHORTREFMDO_SGML
MDO_USELINKMDO_USEMAP
are defined.

The other delimiters that may follow the MDO delimiter, COM ("--") and MDC (">") and DSO ("["), are also joined with the MDO to form a single token. In this way the tokens MDO_COM, MDO_MDC, TOK_MDO_DSO are defined.

The MSC ("]]") and MDC delimiters are also joined and form the token TOK_MSC_MDC.

To avoid an error in the standard the token TOK_PERODEF is defined, which corresponds to the PERO ("%") delimiter when it occurs in a parameter entity declaration. See the description of the function pero() in the lexical analyser.

Inside the SGML declaration all keywords are recognized by the lexical analyser and returned as tokens. This defines the tokens:

SGML_APPINFOSGML_BASESETSGML_CAPACITYSGML_CHARSET
SGML_CONCURSGML_CONTROLSSGML_DATATAGSGML_DELIM
SGML_DESCSETSGML_DOCUMENTSGML_ENTITYSGML_EXPLICIT
SGML_FEATURESSGML_FORMALSGML_FUNCTIONSGML_GENERAL
SGML_IMPLICITSGML_INSTANCESGML_LCNMCHARSGML_LCNMSTRT
SGML_LINKSGML_MINIMIZESGML_NAMECASESGML_NAMES
SGML_NAMINGSGML_NOSGML_NONESGML_OMITTAG
SGML_OTHERSGML_PUBLICSGML_QUANTITYSGML_RANK
SGML_RESGML_RSSGML_SCOPESGML_SHORTREF
SGML_SHORTTAGSGML_SHUNCHARSGML_SGMLREFSGML_SIMPLE
SGML_SPACESGML_SUBDOCSGML_SWITCHESSGML_SYNTAX
SGML_UCNMCHARSGML_UCNMSTRTSGML_UNUSEDSGML_YES

There is a special token TOK_NOD, which denotes the undefined token. It is returned by various functions when no appropriate token can be found.

The token TOK_CONREF is used when an element occurs with a filled in CONREF attribute. When this occurs, the content of the element is empty (see Standard Annex B, page 86). TOK_CONREF is put into the input stream by the parser to avoid recognition of the content of the element.

Starttag and endtag recognition

Starttags are special to the lexical analyser. Whenever the lexical analyser recognizes a STAGO ("<") delimiter, it scans the input until the end of the starttag is found. The end can be marked by a TAGC (">"), NET ("/"), STAGO or ETAGO ("</"). The lexical analyser returns the complete starttag as one token. The lexical analyser reads the endtag in the same way. The names of the tokens are constructed from the names of the generic identifiers of the elements. If, for example, the DTD is:

<!doctype DOC [
<!element DOC    - - (A, B)>
<!element A      - - (#PCDATA)>
<!element B      - - (#PCDATA)>
]>
then the tokens for the starttags of DOC, A and B are respectively ST_DOC, ST_A, ST_B. The tokens for the endtags are respectively END_DOC, END_A, and END_B. The attributes that belong to a starttag are read and stored. See ``att_par.c'' for a description of the attribute storage.