Lexical Rules

 

Overview

The parseLib provides both a default lexical analyzer, for those who do not want to write lexical rules, and a rule based lexical sub-language for those who need to specify their own lexical rules. The lexical rule facility is feature rich with BNF notation, argument passing, repeating rules, etc. The main purpose of the lexical analyzer is to parse the input string into a vector of feature rich tokens (usually the tokens are Structures with the Structure attributes representing the features). This vector of feature rich tokens is then passed to the syntax analyzer.

It is possible to use the lexical analyzer for one??s own purposes. One might eliminate the syntax and semantic rules with the lexical rules producing the final output; or, one might eliminate the lexical rules entirely so the resulting compiler becomes a stand alone syntax analyzer or semantic analyzer.

This section contains initial working notes on the design of the lexical definition sub-language. The lexical definition sub-language is a combination of the lexical analysis ideas in [1.3.3] and the feature based grammar ideas in [2.7]. The main theme of this section is: input string in ?? vector of feature rich tokens out.

$ch

The $ch variable contains the current input character location, at the start of this lexical rule, during lexical analysis. The $ch variable can be used in connection with user defined condition rules, action rules, and in user defined functions, for example:

(writeln (mid $IN $ch 20))

The Lisp expression, shown above, will display the 20 characters, from the input source string, starting with the current token.

$$this

The $this variable contains the current input character, at each invocation of a lexical rule with the + and * BNF command modifiers, during lexical analysis. The $this variable can be used in connection with user defined condition rules, for example:

MAIN: Any{(isNumber $this)}*

$IN

The $IN variable contains the current input to the new compiler. The $IN variable can be used in connection user defined condition rules, action rules, and in user defined functions, for example:

(writeln (mid $IN $ch 20))

The Lisp expression, shown above, will display the 20 characters, from the input source string, starting with the current token.

$LIST

The $LIST variable contains the current vector of feature rich tokens produced by the lexical analyzer rules. At the beginning of the lexical phase, the $LIST variable is an empty vector. At the end of the lexical phase, the $LIST variable should contain the feature rich tokens recognized by the lexer. The $LIST variable is usually returned by the final rule in the lexer, for example:

FINALRULE: ??conditions?? :: $LIST ::

The Lisp expression, shown above, will look ahead one character in the input source string, eating up input characters until the next character is a lower case ??y??.

The final lexical rule, shown above, returns a vector of feature rich tokens to the new compiler for use in the syntax analysis phase.

$OUT

The $OUT function provides a convient way to put feature rich tokens, produced by the lexical analyzer rules, into the $LIST vector, for example:

NAME: Letter AlphaNum* :: ($OUT $ch (append $1 $2) Name: true) ::

The NAME lexical rule, shown above, recognizes any name beginning with a letter and continuing with none or more letters or digits. The $OUT function works in connection with the Syntax Features defined in the syntax analyzer (remember that the syntax analyzer is the customer of the lexical analyzer).

The $OUT function must receive two or more arguments. The first argument is the character position, in the input string $IN, which relates to this token (tokens without a Charpos feature will cause the generated compiler to abort in the syntax analyzer). The second argument is the Value feature of the token (tokens without a Value feature will cause the generated compiler to abort in the syntax analyzer).

The remaining arguments to the $OUT function (if any) must be in pairs (unpaired arguments will cause the generated compiler to abort in the lexical analyzer). Each argument pair begins with the feature name (feature names must begin with an uppercase letter and must contain one or more lowercase letters), and is followed by the feature??s value.

The $OUT function uses the Value argument as a key into the directory of Syntax Features (see the Syntax chapter). If found, the remaining arguments beyond Charpos and Value are ignored and the new token defaults to the features defined in the Syntax Features directory; otherwise, the remaining arguments are added to the token as additional features.

$ASIS

The $ASIS function provides a convient way to put feature rich tokens, produced by the lexical analyzer rules, into the $LIST vector, for example:

NAME: Letter AlphaNum* :: ($ASIS $ch (append $1 $2) Name: true) ::

The NAME lexical rule, shown above, recognizes any name beginning with a letter and continuing with none or more letters or digits. The $ASIS function IGNORES the Syntax Features defined in the syntax analyzer (remember that the syntax analyzer is the customer of the lexical analyzer).

The $ASIS function must receive two or more arguments. The first argument is the character position, in the input string $IN, which relates to this token (tokens without a Charpos feature will cause the generated compiler to abort in the syntax analyzer). The second argument is the Value feature of the token (tokens without a Value feature will cause the generated compiler to abort in the syntax analyzer).

The remaining arguments to the $ASIS function (if any) must be in pairs (unpaired arguments will cause the generated compiler to abort in the lexical analyzer). Each argument pair begins with the feature name (feature names must begin with an uppercase letter and must contain one or more lowercase letters), and is followed by the feature??s value.

The $ASIS function IGNORES the directory of Syntax Features (see the Syntax chapter). The user supplied feature arguments are added to the token as features. The directory of Syntax Features is IGNORED.

Default Lexical Analyzer

The parseLib provides a default lexical analyzer, for those who do not want to write lexical. The default lexical analyzer attempts to parse the input stream into names, numbers, operators, and special characters. The main purpose of the default lexical analyzer is to parse the input string into a vector of feature rich tokens. This vector of feature rich tokens is then passed to the syntax analyzer. If there are no Lexical Rules specified, then the default lexical analyzer is in effect.

Attributing Tokens

The default lexical analyzer automatically converts the list of lexical tokens, output from the lisp function, into a list of attributed structures. The attributed structures are said to give the tokens features. At the very minimum, each token is converted into a structure with a Value attribute set equal to the original token.

t1 #{Value: t1}

Where t1 is a lexical token output from the lisp function, and #{Value: t1} is a structure containing at minimum the attribute Value. Obviously an attributed token may contain more than one feature. A token may contain the features of Verb, Noun, etc as follows.

t1 #{Verb: true Noun: true Value: t1}

The classes of lexical tokens which are automatically recognized by the system, and which attributes they are given, is defined in the Syntax section. In the event that a lexical token cannot be automatically recognized by the system, the token is passed to the default token rule for attributing.

Note: All attribute names must begin with an uppercase character and must contain at least one non uppercase character.

Default Tokens

The vocabulary of lexical tokens which are automatically recognized by the system, and which attributes they are given, is defined in the Syntax section of the parseLib definition language. In the event that a lexical token is not defined in the Syntax section, it cannot be automatically attributed by the system. All undefined lexical tokens are sent to the default token rule for attributing.

t1 (defaultTokenRule t1 )

defaultTokenRule

The default token rule is a user defined child Lambda, of the resulting compiler and can be overridden in the #UserFunction# section of the %DEFINITION file.

If the default token rule is not specified, the generated compiler will output a simple structure, with a Value attribute, for every unrecognized token encountered. The default token rule is a user defined child Lambda, of the resulting compiler, which accepts one argument and always returns a structure as in the following example:

  
           ;; Default token attributing function.
			(defchild theparent:defaultTokenRule(token)
			pvars:(tokenDirectory)
			vars:(ret)
            (cond
              ;; Recognize numeric tokens.
              ((isNumber token)
              (setq ret (new Structure: Number: true)))
               ;; Recognize name tokens.
              ((and (isSymbol token) (isCharAlphanumeric token))
              (setq ret (new Structure: Name: true)))
              ;; Recognize operator tokens.
              ((isSymbol token)
              (setq ret (new Structure: Operator: true)))
              ;; Output an error token.
              (else
                   (setq ret (new Structure: Error: "Illegal token")))
            ) ; end cond
            ret) ; end defaultTokenRule
            

defaultLexer.turnFractionsOnOff

The turnFractionsOnOff function allows the user, usually as a part of his/her startRule code, to tell the lexical analyzer whether or not to recognize fractions. If fraction recognition is turned off, the following lexical sequence:

23.456

will be recognized as one integer and a period symbol followed by another integer.

defaultLexer.lowerCaseSW

The lowerCaseSW variable allows the user, usually as a part of his/her startRule code, to tell the lexical analyzer whether or not to convert all recognized names to lower case. If lower case conversion is turned on, the following lexical sequence:

HeLp

will be converted into the following sequence:

help

defaultLexer.keepWhitespaceSW

The keepWhitespaceSW variable allows the user, usually as a part of his/her startRule code, to tell the lexical analyzer whether or not to keep all recognized whitespace strings. If keep whitespace is not turned on, all delimited strings with beginning with the name "Whitespace??" will be thrown away.

Delimited Strings

The default lexical analyzer supports user defined delimited strings in the compiler definition as follows:

#DelimitedStrings#
Whitespace: {;} _eol
String: {"} {"}
#End#

The default lexical analyzer allows any number of valid delimited string definitions between the #DelimitedStrings# header and the #End# terminator. User defined delimited strings are attributed with the specified name preceeding the start end end delimiters, and with Constant = true.

Note: The Whitespace name has a special meaning to the compiler (see keepWhitespaceSW shown variable). All delimited strings with name beginning with Whitespace (i.e. Whitespace, Whitespace1, Whitespace2, etc.) are ignored by the defaultLexer function and are removed from the parse tree. This means the _getToken function will never see them and user defined rules do not have to take whitespace tokens into consideration.

Note: The Delimited Strings definitions are only meaningful if the default lexical analyzer IS in effect. If there are any lexical rules defined, the delimited strings declarations are ignored.

Lexical Features

Collections of single characters can be grouped into sets to form lexical features as follows:

#LexicalFeatures# Digit: [ |"0"-"9"| ] Alphanum: [ |a-z| |A-Z| |"0"-"9"|] Name: [ |a-z| |A-Z| |"0"-"9"| _ ] CharData: [ |0-255| ~ < > "[" 34] #End#

In the example, shown above, a character has the feature ??Digit?? iff it is a 0 through a 9. Similarly a character has the feature ??AlphaNum?? iff it is a letter or a digit. Specifying lexical features allows one to check for whole sets of characters with a single test in the lexical rules, for example:

NAME: Letter AlphaNum* :: ($OUT $ch (append $1 $2) Name: true) ::

The NAME lexical rule, shown above, recognizes any name beginning with a letter and continuing with none or more letters or digits.

Numbers

Numbers, specified in a lexical feature declaration, are always considered to be ascii character codes, not ascii characters (one must surround numbers with quotes ??8?? if one wishes to declared the character value, for example:

LineFeed: [ 10 ]

Not (~)

The not symbol ~, specified in a lexical feature declaration, indicates that all following charaters are NOT included in the set, for example:

Whitespace: [ |0-32| ~ 13 ]

The Whitespace feature, shown above, includes all ascii codes between 1 and 32, but does not include the carriage return code.

Ranges

Character ranges can be specified by enclosing two character declarations between vertical bar, for example:

Alphanum: [ |a-z| |A-Z| |??0??-??9??|]

Quotes

Characters can be quoted whenever they are numbers or special characters, for example:

Digit: [|??0??-??9??|]

Lexical Rules

The default lexical analyzer can be overridden by specifying Lexical Rules, which will perform the parsing operation. Reference [1.3.3] uses syntax definition from regular definitions of the form:

d1 : r1 || c1 || :: a1 ::

d2 : r2 || c2 || :: a2 ::

. . .

dn : rn || cn || :: a1 ::

Where each di is a rule name, each ri is a rule expression, each ci is a Lisp conditional expression, and each ai is a Lisp action expression. The syntax for Lexical Rule definition is as follows:

#LexicalRules#

NUMBER: Sign? Digit+ Period? Digit+ || (<> (number $2) 0) || :: (number (append $1 $2 $3 $4)) ::

NUMBER: LONGHAND+ :: (number (apply append $1)) ::

 

LONGHAND: Whitespace* << true >>

LONGHAND: "one" :: 1 ::

LONGHAND: "two" :: 2 ::

LONGHAND: "three" :: 3 ::

#End#

The Lisp condition rule is optional. If present, it must be enclosed by the || symbol. The Lisp action rule is mandatory. It must be enclosed by the :: symbol. The rule variable $0 is the default return value initialized by the rule automatically to #void. The rule variables $1 through $9 correspond to the respective token expressions in the rule body.

Note1: All rule names must contain only uppercase characters and must contain no non-uppercase characters, numerals, or underscores.

Note2: The $ symbol must not be used in an argument phrase, action, or condition rule anywhere except as a rule variable identifier $0 through $9. If the condition or action rule requires a $ symbol, for instance inside a string constant, place the $ symbol in a user defined function which is called by the argument phrase, action, or condition rule.

BNF Notation

Lexical rule names, lexical feature names, but not constants, may have trailing BNF operators of "*" or "+" or "?". For example:

NUMBER: Sign? Digit+ Period? Digit+ || (<> (number $2) 0) || :: (number (append $1 $2 $3 $4)) ::

NUMBER: LONGHAND+ :: (number (apply append $1)) ::

Any lexical rule name and any lexical feature name (other than the special Eof and Nop features) may have trailing BNF operators. The user is required to make sure that the resulting rule does not cause the new compiler to loop endlessly on the input string. The BNF operators have the following meanings:

??        The "*" operator signifies none or more (may cause endless looping if specified inappropriately).

??        The "+" operator signifies one or more.

??        The "?" operator signifies none or one.

Note1: For lexical features, the BNF operators return a character string of each repetition character append together.

Note2: For lexical rules, the BNF operators return a vector of each repetition result.

 

Argument Passing

User defined rules may be passed arguments. A Lisp argument phrase, enclosed with the ( ) symbol pair, will cause the user defined rule to receive the specified argument. Within a user defined rule definition, the %0 thru %9 variables represent any arguments which may have been passed to the rule as follows:

QUALIFY: DotOperator Name QUALIFY( (setq $0.Value (append |ref|: %0 $2)) ) :: $3 ::

QUALIFY: DotOperator Name :: (setq $0 (append |ref|: %0 $2)) ::

TERM: Name QUALIFY($1) :: $2 ::

TERM: Name :: $1 ::

The TERM rule will recognize all input of the form Name.Name.Name ... The rule returns when a Dot Operator no longer qualifies the name.

Note: The % symbol must not be used in an argument phrase, action, or condition rule anywhere except as a rule variable identifier %0 through %9. If the argument phrase, action, or condition rule requires a % symbol, for instance inside a string constant, place the % symbol in a user defined function which is called by the argument phrase, action, or condition rule.

Iterative Rules

User defined rules may be repeated interatively. A Lisp action rule, enclosed with the << >> symbol pair, will cause the user defined rule to repeat. The contents of the $0 variable remain intact. The builtin Eof attribute name allows a rule to test for End Of File in the following rule:

SEXPR: Term Operator Term << ($OUT $ch (list $2 $1 $3) BinaryOperator: true) >>

SEXPR: Eof :: $0 ::

The SEXPR rule will recognize all input of the form Term Operator Term ... The rule returns when the End Of File is reached.

Note: The $n symbol contains the repetition count for the rule. During the first iteration through the rule, the $n variable is set to 1.

Term Conditions

User defined rules may be also have user defined conditions attached. A Lisp condition phrase, enclosed with the { } symbol pair, will cause the user defined rule to receive the specified condition. Within a user defined rule condition, the %0 thru %9 variables represent any arguments which may have been passed to the rule, while the $0 thru $9 variables represent any terms which may have been recognized by the rule.

STRING: Quote{(= $n 1)} << true >>

STRING: Any << (setq $0 (appendList $0 $1) >>

STRING: Quote{(> $n 1)} :: ($OUT $ch $0 String: true) ::

The STRING rule will recognize all input tokens inclosed within two quotes The rule returns only when the second quote is recognized. User defined rules may have both argument passing and user defined conditions attached. The suer defined condition is always last, as follows.

TERM: NAME(%0){(isLowerCase $1)} :: $1 ::

Note: The % symbol must not be used in an argument phrase, action, or condition rule anywhere except as a rule variable identifier %0 through %9. If the argument phrase, action, or condition rule requires a % symbol, for instance inside a string constant, place the % symbol in a user defined function which is called by the argument phrase, action, or condition rule.

MAIN Rule

The user must define a MAIN rule in the Lexical Rule definitions. The MAIN rule is the rule which the new compiler will invoke to start the lexical analysis phase. If there is no MAIN rule defined, the default lexical analyzer will be in effect.

MAIN: STATEMENT Semicolon << ($OUT $ch $1 Statement: true) >>

MAIN: Eof :: $LIST ::

This sample MAIN rule will recognize all syntax of the form statement; statement; statement; ... The rule returns when the End Of File is reached.

Note: Remember to avoid excessive recursion errors by making strategic rules iterative instead of right recursive.

Special Rule Syntax

Any

If a rule is to accept any token, use the Any attribute. For example:

;; This rule recognizes a plus sign between anything

RULE: Any + Any :: ($OUT $ch (list $2 $1 $3) BinaryOperator: true) ::

Eof

If a rule is to test for end of file, use the Eof attribute. For example:

;; This rule recognizes an end of file condition

RULE: Eof :: $0 ::

Nop

The special Nop attribute always returns a constant token of #void. The Nop attribute is designed to provide a test which always is true, but does not promote the input pointer (i.e. a no-operation rule).

MAIN: STATEMENT Semicolon << ($OUT $ch $1 Statement: true) >>

MAIN: Eof :: $LIST ::

MAIN: Nop :: (error "If we get here we have an invalid token") ::

This sample MAIN rule will recognize all syntax of the form statement; statement; statement; ... However, if the MAIN rule encounters anything else (other than statement;), then an error message will be returned. The rule returns when the End Of File is reached, or an error is generated.

Note: If the Nop test is used, user ordering of the specified rule is almost always required.

$N

We may use a previously recognized parser variable to indicate a test for equality. For example:

;; These two rules are equivalent

RULE: Any $1 :: $0 ::

RULE: Any Any{(= $2 $1)} :: $0 ::

%N

We may use a previously passed parser argument variable to indicate a test for equality. For example:

;; These two rules are equivalent

RULE: Any %0 :: $0 ::

RULE: Any Any{(= $2 %0)} :: $0 ::