Document revision date: 19 July 1999 | |
Previous | Contents | Index |
The Table-Driven Finite-State Parser routine is a general-purpose, table-driven parser implemented as a finite-state automaton, with extensions that make it suitable for a wide range of applications. It parses a string and returns a message indicating whether or not the input string is valid.
Note
No support for arguments passed by 64-bit address reference or the use of 64-bit descriptors is planned for LIB$TPARSE. On Alpha systems, LIB$TABLE_PARSE supports arguments passed by 64-bit address reference and the use of 64-bit descriptors.)LIB$T[ABLE_]PARSE is called with the address of an argument block, the address of a state table, and the address of a keyword table. The input string is specified as part of the argument block.
The LIB$ facility supports the following two versions of the Table-Driven Finite-State Parser:
LIB$TPARSE Available on VAX systems. LIB$TPARSE is available on Alpha systems in translated form. In this form, it is applicable to translated VAX images only. LIB$TABLE_PARSE Available on VAX and Alpha systems. LIB$TPARSE and LIB$TABLE_PARSE differ mainly in the way they pass arguments to action routines.
The term LIB$T[ABLE_]PARSE is used here to describe concepts that apply to both LIB$TPARSE and LIB$TABLE_PARSE.
LIB$TPARSE/LIB$TABLE_PARSE argument-block ,state-table ,key-table
OpenVMS usage: cond_value type: longword (unsigned) access: write only mechanism: by value
argument-block
OpenVMS usage: unspecified type: unspecified access: modify mechanism: by reference
LIB$T[ABLE_]PARSE argument block. The argument-block argument contains the address of this argument block.The LIB$T[ABLE_]PARSE argument block contains information about the state of the parse operation. It is a means of communication between LIB$T[ABLE_]PARSE and the user's program. It is passed as an argument to all action routines.
You must declare and initialize the argument block. Section 1.4 describes the argument block in detail. Section 2.2 illustrates the coding for an argument block declaration and discusses its initialization.
LIB$T[ABLE_]PARSE supports the following argument blocks:
- A 32-bit argument block that accommodates longword addresses, values, and input tokens on both VAX and Alpha systems.
On Alpha systems, this argument block also accommodates a numeric token whose binary representation is less than or equal to 2**64.- A 64-bit argument block that accommodates quadword addresses, values, and input tokens on Alpha systems.
state-table
OpenVMS usage: unspecified type: unspecified access: read only mechanism: by reference
Starting state in the state table. The state-table argument is the address of this starting state. Usually, the name appearing as the first argument of the $INIT_STATE macro is used.You must define the state table for your parser. LIB$T[ABLE_]PARSE provides macros in the MACRO and BLISS languages for this purpose. Section 1.3 describes these macros.
key-table
OpenVMS usage: unspecified type: unspecified access: read only mechanism: by reference
Keyword table. The key-table argument is the address of this keyword table. This name must be the same as that which appears as the second argument of the $INIT_STATE macro.You must only assign a name to the keyword table. The LIB$T[ABLE_]PARSE macros allocate and define the table. See Section 4 for more information about the keyword table.
The following sections explain in detail how LIB$T[ABLE_]PARSE works and how to call it from both the MACRO assembly language and high-level languages:
- How LIB$T[ABLE_]PARSE Works --- Describes the data structures used by LIB$T[ABLE_]PARSE and how LIB$T[ABLE_]PARSE operates on them.
- Coding and Using a Simple State Table --- Explains how to construct and use a simple state table.
- Using Advanced LIB$T[ABLE_]PARSE Features --- Explains how to use subexpressions, abbreviations, action routines, and other advanced features.
- Data Representation --- Includes information for the low-level-language programmer, such as the binary representation of state table data.
1 How LIB$T[ABLE_]PARSE Works
LIB$T[ABLE_]PARSE analyzes an input string according to a set of states
and transitions presented in a state table you define. It determines
whether the input string is valid according to the rules you define for
the input language.
There are three parts to any parsing operation:
1.1 Overview
Before discussing the alphabet, the state table, and the argument block
in detail, this section provides an overview of how these three parts
work together.
1.1.1 Evaluating the Input String
LIB$T[ABLE_]PARSE evaluates the input string from left to right as it
transitions from state to state. For a particular transition in a
particular state, it evaluates the beginning of the unprocessed part of
the input string against the symbol type you specify for the transition
to determine whether there is a match.
LIB$T[ABLE_]PARSE compares each character of the remaining input string, from left to right, against the transition's symbol type until it encounters a character in the input string that does not match. It takes the substring that matches the symbol type and stores a pointer to it in the argument block as the current token. In this way, any character in the input string that does not belong to the symbol type's constituent character set effectively becomes a separator.
If LIB$T[ABLE_]PARSE finds a match, it executes the transition.
If the input string does not match, LIB$T[ABLE_]PARSE attempts to match the next transition. It performs the comparison using the transitions in the order in which you define them for the state.
1.1.2 Executing a Transition
When LIB$T[ABLE_]PARSE finds a match with a transition, it performs the
following steps:
1.1.3 Exiting LIB$T[ABLE_]PARSE
LIB$T[ABLE_]PARSE continues to match and execute transitions from state
to state until one of the following occurs:
LIB$T[ABLE_]PARSE generates no signals and establishes no condition handler; action routines can signal through LIB$T[ABLE_]PARSE back to the calling program. |
When LIB$T[ABLE_]PARSE cannot successfully parse the entire string, it defines the current token, as follows, and stores it in the argument block before returning:
1.2 Alphabet of LIB$T[ABLE_]PARSE
The LIB$T[ABLE_]PARSE alphabet consists of a set of symbol types
defined in Table lib-9. This alphabet includes strings made up of
elements of the ASCII character set. It provides all the basic building
blocks needed for constructing a grammar using the ASCII character set.
The alphabet also includes symbol types that represent the more complex
constructions found in programming and command language grammar.
Use the symbols types that comprise the LIB$T[ABLE_]PARSE alphabet to define a vocabulary and grammar for your language. For each transition you define, you specify one of the alphabet symbol types. LIB$T[ABLE_]PARSE compares the characters at the beginning of the remaining input string with this symbol type of each of the possible transitions. If LIB$T[ABLE_]PARSE finds a match, it enters the state specified by that transition.
Symbol Type | Characters Matched |
---|---|
' x' | The particular ASCII character. In a state table, it is expressed by enclosing the character in single quotation marks. The character can be any member of the 8-bit ASCII code set. LIB$T[ABLE_]PARSE does not consider uppercase and lowercase alphabetic characters and codes with different values in bit 7 to be equivalent. |
TPA$_ANY | Any single character. |
TPA$_ALPHA | Any alphabetic character, which includes the DEC multinational character set. |
TPA$_DIGIT | Any numeric character, that is, 0 through 9. |
TPA$_STRING | Any string of one or more alphanumeric characters, that is, uppercase or lowercase A through Z, and the numeric characters 0 through 9. The string can be any length. It is bounded on the right by the first nonalphanumeric character or by the end of the string. |
TPA$_SYMBOL | Any string of one or more through characters of the standard OpenVMS symbol constituent set, that is, uppercase and lowercase A through Z and all DEC multinational characters, in addition to the dollar sign ($) and the underscore (_). The string is bounded on the right by some character not in the symbol constituent set (usually a blank) or by the end of the string. |
' keyword' |
The string of characters enclosed in single quotation marks. A keyword
can consist of one or more characters of the OpenVMS symbol constituent
set, that is, uppercase and lowercase A through Z, the numeric
characters 0 through 9, the dollar sign ($), and the underscore (_).
Uppercase and lowercase alphabetics are treated as different characters.
A state table can contain up to 220 keywords. The keyword is bounded on the right by a character not in the symbol constituent set or by the end of the string. Keywords that are one character in length are expressed in the form ' x*' to distinguish them from the single-character symbol (' x'). They must be differentiated because they are not the same in operation. For example, in the input string AB+C, the single character 'A' would match the first character of this string, whereas the keyword 'A*' would not, because B in the string is in the symbol constituent set. |
TPA$_BLANK | Any string of one or more blanks and/or tabs. |
TPA$_OCTAL | Any octal number (that is, any string of one or more numeric characters 0 through 7) whose magnitude is less than 2 32 for a 32-bit argument block or less than 2 64 for a 64-bit argument block. |
TPA$_DECIMAL | Any decimal number (that is, any string of one or more numeric characters 0 through 9) whose magnitude is less than 2 32 for a 32-bit argument block or less than 2 64 for a 64-bit argument block. |
TPA$_HEX | Any hexadecimal number (that is, any string of one or more numeric characters 0 through 9, A through F) whose magnitude is less than 2 32 for a 32-bit argument block or less than 2 64 for a 64-bit argument block. |
(Alpha specific) TPA$_OCTAL_64 | Any octal number (that is, any string of one or more numeric characters 0 through 7) whose magnitude is less than 2 64. |
(Alpha specific) TPA$_DECIMAL_64 | Any decimal number (that is, any string of one or more numeric characters 0 through 9) whose magnitude is less than 2 64. |
(Alpha specific) TPA$_HEX_64 | Any hexadecimal number (that is, any string of one or more numeric characters 0 through 9, A through F) whose magnitude is less than 2 64. |
TPA$_FILESPEC | Any string that constitutes a valid OpenVMS file specification. The string is bounded on the right by the first character that either is not a file specification constituent character or would cause the string to violate the syntax rules of a file specification. |
TPA$_NODE | Matches a full node specification including the double colon (::). |
TPA$_NODE_ACS | Matches a primary node specification including the access control string, if any, but not the double colon (::). |
TPA$_NODE_PRIMARY | Matches a primary node specification excluding both the access control string, if any, and the double colon (::). |
TPA$_UIC | Any string that constitutes a valid OpenVMS numerical UIC specification, bounded by square brackets or angle brackets. The binary value of the UIC, converted in octal radix, is placed in the argument block. The wildcard character (*) is permitted in the group and/or member fields; its presence results in that field being set to its largest possible value in the binary representation. |
TPA$_IDENT |
Any string that constitutes a valid OpenVMS identifier. Identifiers may
be given as numerical UICs according to the rules for TPA$_UIC, or as
alphabetic identifier names that appear in the system's rights
database. The binary value of the identifier, converted in either octal
or hexadecimal radix or by lookup in the system rights database, is
placed in the argument block. Identifiers can be entered in any of the
following forms:
[n,m] <n,m>You can use a wildcard (*) in place of any occurence of number or name in an identifier form. |
TPA$_LAMBDA | The empty string (always matches). As it executes the transition, LIB$T[ABLE_]PARSE does not remove any characters from the input string. LAMBDA transitions are useful in getting action routines called under otherwise awkward circumstances, providing unconditional GOTOs to link portions of a state table together, and providing default actions in certain cases. |
TPA$_EOS | The end of the input string. |
state label |
The label of a state that functions as a subexpression. A subexpression
is analogous to a subroutine within the state table.
The subexpression facility permits complex syntactic constructs that appear in many places in grammar to appear only once in the state table. It also permits a degree of nondeterministic or pushdown parsing with a parser that is otherwise deterministic and finite-state. See Section 3.5 for detailed information about subexpressions and examples of their use. |
By default, LIB$T[ABLE_]PARSE treats blanks (defined to be either spaces or tabs), as though they belong to no symbol type constituent set. Effectively, this makes the blank a separator. LIB$T[ABLE_]PARSE begins its next comparison with the first nonblank character following the blanks. To have LIB$T[ABLE_]PARSE evaluate a blank as it would any other character in the input string, set the TPA$V_BLANKS flag in the argument block. Section 3.2 provides an example of the use of this flag. |
1.3 State Tables
This section describes state table generation and the macros used to
construct state tables. Section 2 explains how to use these
macros.
The state table must be set up using either MACRO or BLISS. Everything else, including any action routines, can be coded in the language of your choice. Simply compile the state table separately, then link it with your program.
The body of the state table consists of one or more states, each of which defines one or more transitions to the same or other states. The order of the states and the order of the transitions for each state are important:
The list of symbol types does not include subexpression calls, because the generality of these calls depends on the symbol types recognized within the subexpression. If you use action routines to reject certain transitions, you can change the order in which that symbol type is placed in this order. In any case, LIB$T[ABLE_]PARSE executes the first transition listed in a state that you permit to match the leftmost portion of the remaining input string. |
1.3.1 MACRO State Table Generation Macro Calls
The OpenVMS system MACRO library contains a set of assembler macros
that allow convenient and readable coding of a LIB$T[ABLE_]PARSE state
table. These macros generate symbol definitions and tables. They do not
produce any executable code or routine calls.
There are four MACRO state table generation macros:
A state table begins with a call to $INIT_STATE and ends with a call to $END_STATE. Within the state table, define each state by a call to $STATE immediately followed by as many calls to $TRAN as you need to define the transitions from that state.
1.3.1.1 $INIT_STATE---Initializes the LIB$T[ABLE_]PARSE Macros
The $INIT_STATE macro declares the beginning of a state table. It
initializes the internals of the table generator macros and declares
the locations of the state table and the keyword table:
Section 4 provides specific information on the allocation and binary representations of the state table and the keyword table. This information may be useful in debugging your program.
$INIT_STATE state-table ,key-table |
state-table
The name assigned to the state table. LIB$T[ABLE_]PARSE equates this label to the start of the first state in the state table.key-table
The name assigned to the keyword table. LIB$T[ABLE_]PARSE equates this label to the start of the keyword table.
You must supply both the address of the state table and the address of the keyword table in the call to LIB$T[ABLE_]PARSE to perform a parse. The $INIT_STATE macro can appear more than once in a program. Each occurrence defines a separate state table. No part of any state table can refer to part of any other state table.
1.3.1.2 $STATE---Defines a State
The $STATE macro declares the beginning of a state.
$STATE [label] |
label
An optional label for the state. LIB$T[ABLE_]PARSE equates the label, if present, to the starting address of the state.
1.3.1.3 $TRAN---Defines a State Transition
The $TRAN macro defines a transition from the state in which it is
defined to some other (or to the same) state. The arguments of the
macro define, among other things, the symbol type that causes the
transition to be executed, the state to which to transfer, and the
action routine to call, if any. The transition defined by a $TRAN macro
belongs to the state defined by the last preceding $STATE macro.
$TRAN type [,label] [,action] [,mask] [,msk-adr] [,argument] |
type
The symbol type, taken from the LIB$T[ABLE_]PARSE alphabet, that is recognized by this transition. The transition is taken if the characters from the beginning of the remaining input string match the specified symbol type.
Previous Next Contents Index
privacy and legal statement 5932PRO_044.HTML