ACSC373 – Compiler Writing – Chapter 5 – Dr. Stephania Loizidou Himona

ACSC373 – Compiler Writing

Chapter 5 – Lexical Analysis

Introduction

Lexical analyzer (or scanner) has to recognize the basic tokens making up the source program input to the compiler.

Tokens – syntactically simple in structure (recognised by comparatively simple algorithm).

The lexical analyser relieves the burden of the syntax analyser. If the syntax of the tokens can be expressed in terms of a regular grammar  straightforward construction of a lexical analyser.

  • The lexical analyser breaks up the program input into a sequence of tokens – the basic syntactic component of the language.

The lexical analyser is also responsible for decoding the lexical representation of symbols in the source program.

(e.g. possible to modify a Pascal compiler to accept the character @ instead of ^ by simply making a minor modification to the lexical analyser).

  • Spelling of reserved words should be concern only to the lexical analyser (e.g. replace all Pascal reserved words by their Welsh translation, by changing the lexical analyser’s table).

Tokens

The lexical analyser recognises the basic syntactic components of a programming language. For example, a lexical analyser for Pascal would recognise identifiers, reserved words, numerical constants, strings, punctuation symbols and special symbols. These tokens would be passed to the syntax analyser as single values (nature, value or identification).

So, the lexical analyser has to perform the dual task of recognising the token and sometimes evaluating it.

Significant Problem: the need for “look-ahead”

It may be necessary to read several characters of a token before the type of that token can be determined. The degree of lookahead depends on the particular token on the language.

PASCAL – good language on this respect (most tokens can be distinguished using just

one-or-two character lookahead).

e.g. a+ character, a Pascal lexical analyser immediately returns the symbol to the syntax analyser

or, to determine that the token 12345.67 is going to be a real rather than an integer.

FORTRAN – Do 1 I = 1

The lexical analyser has to reach the end of the statement before it can determine that it should return the identifier Do 1 I rather than the Do to start a Do loop.

Furthermore, in some languages the role of the lexical analyser becomes merged with that of the syntax analyser.

Furthermore, in some languages, the role of the lexical analyser becomes merged with that of the syntax analyser.

e.g. IF (GT.GT.DO) IF = DO + GT

Special keywords, not identifiers partially resolved by the syntax analyser offering context information to the lexical analyser when required.

Regular Grammars

Chomsky type 3 (Finite – state or regular) grammars special significance to lexical analysis.

Recall, type grammars had productions of the form:

A  a or A  aBsyntactically simple structures

Alternation and repetition can be specified by these grammar but more complex structures such as balanced parentheses cannot be handled.

Strengths of these grammars: possible to construct simple and efficient parsers for them.

If the lexical tokens of a programming language can be defined in terms of a type 3 grammar, then an efficient lexical analyser for a compiler for that language can be constructed simply, derived directly from the set of production rules defining the tokens.

Notation:

Regular Expressions

  • Consists of symbols (in the alphabet of the language that is being defined) and a set of operators that allow

In terms of precedence,

(2) – Concatenation (adjacent units)

(3) – Alternation (units separated by |)

(1) – Repetition (*) – (zero or more)

Example

  • a b denotes the set of strings {a b} – this set contains just one member
  • a | b denotes {a, b}
  • a * denotes (ε, a, aa, aaa, …..}
  • a b * denotes (a, ab, abb, abbb,….}
  • (a | b)* denotes the set of strings made up of zero or more instances of an ‘a’ or a ‘b’
  • (a b | c)* d denotes {d, abd, cd, abcd, ababcd, …..}

Equivalent regular expressions: when they denote the same set of strings, that is, the same

language.

e.g.a ( b | c), a b | a c are equivalent

Regular grammars and regular expressions are equivalent notations.

i.e. given a set of productions defining a regular grammar, it is possible to convert them to an equivalent regular expression by an algorithmic process. Or, similarly, a regular expression can be converted into an equivalent set of production rules.

Regular expression can be used to specify the syntax of the lexical tokens of programming language in a clear and concise way.

e.g. the syntax of an integer constant;

Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Sign  + | - | ε

Integerconstant  sign digit digit*

i.e. an optional sign followed by one or more digits.

An identifier

Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Letter  a | b | c | … | z |A | B | C | … | Z

Identifier  letter (letter| digit)*

i.e. an initial letter followed by a string of letters or digits

Regular expressions can be represented diagrammatically,

e.g. (a b | c) * d

Finite-State Automata

It is possible to represent a regular expression as a transition diagram, that is, a directed graph having labelled branches

e.g. (a b | c ) * d

The nodes – states are enclosed in circles (state number). Double circle states: accepting states (i.e. a state reached if the expression to be parsed has been successfully recognised). Arrow lines: edges or transitions.

c

d

start a b

Parser Action ( of abcd)

  1. Start in state 1
  2. Input a, transition to state 2
  3. Input b, transition to state 1
  4. Input c, transition to state 1
  5. Input d, transition to state 3, an accepting statethe parse succeeds

Another input

abc

  1. Starting in state 1
  2. Input a, transition to state 2
  3. Input c: there are no edges labelled c, from state 2 and hence this parse fails.

(Since the parser knows that it is in state 2 when the parse fails, it can output the informative information that it was expecting the input of a b – the only edge emerging from state 2).

Transition table (of (a b | c ) * d )

State / a / b / c / d
1 / 2 / - / 1 / 3
2 / - / 1 / - / -
3 / finished

or, transition diagram, describes a finite-state automata.

Deterministic finite-state automation (DFA): for any state, there can only be one possible next state for any given input symbol (i.e. no two edges emerging from a state can be labelled by the same symbol). No transitions can be labelled with ε (the empty string) – all the edges have to labelled with non-null symbols, e.g. diagram above.

Non-deterministic finite-state automation (NFA): if an input symbol can cause more than one next state. A state may have two or more edges emerging from it, labelled with the same input symbol.

Example

a

Two edges labelled by b coming from state 2  NFA

a * b b * b a

Its transition table,

State / a / b
1 / {1} / {2}
2 / - / {2 , 3}
3 / {4} / -
4 / finished

Sets of states have to be used (no longer single – state numbers).

NFA somewhat more difficult to implement on a conventional computer than a DFA. However, NFA’s have the advantage that they generally require fewer states than an equivalent DFA (if they recognised strings described by the same regular expression).

Later, (somewhere else) to see how to convert NFA into an equivalent DFA – an exercise for you!

Implementing a Lexical Analyser

Writing a lexical analyser for Pascal (techniques relevant to wide range of other languages).

Task of lexical analyser: to recognise basic tokens in its input and to return some encoded representation of these tokens to the next part of the compiler.

First Steps:

-To decide on the precise set of tokens the lexical analyser should recognise.

-The notation to be used to specify the syntax of these tokens, preferably in some formal notation.

Tokens – identified by an integer value or equivalent

e.g. enumerated type in Pascal.

Convenient way – implement the lexical analyser as a procedure or function called by the syntax analyser.

e.g. (Pascal’s lexical analyser).

Type lextoken = (beginsym, endsym, ifsym, dosym, whilesym, periodsym,

commasym, semicolon …);

Var token: lextoken (* the L.A. updates this variable each time it is called *)

Procedure NextToken;

.

.Has to read characters from the compiler’s input

.and return the identity of single lexical token in the global variable token each time it is called.

First step, decide

Set of tokens:

Pascal:

-21 special symbols

-35 word symbols or reserved words

-one directive and

-a few alternative representations

-+ identifiers and numerical and character constants

-to skip comments and non-significant spaces, tabs and newlines

-to handle erroneous input and issue informative error message

Ex. Think of the corresponding tokens of either the C or C++ language.

The body of NextToken starts as follows:

While (ch = spacechar) or (ch = tabchar) do Nextch; (*skip white space*)

Case ch of

‘+’: begin token := plussyml Nextch end;

‘-‘ : begin token := minusym Nextch end;

(*and so on for all the other single – character tokens*)

(Possibly, use array or table lookup to return the corresponding Nexttoken value in Token as there is a fairly large number of single-character tokens in Pascal).

Other special symbols have to be handled more carefully, <, <=, or >.

‘<’

Begin

NextCh;

If Ch = ‘=’ then

Begin

Token := lesseqsym;

NextCh;

End

Else

If ch = ‘>’ then

Begin

Token := noteqsym;

NextCh

End

Else

Sym := lesssym

End;

Comments

{or (* ….} or *)

When NextToken encounters a { or a (* symbol, it must call NextCh repeatedly until } or *) is found.

One way: remove comments within NextCh;

Count {this is a comment} er := 0;

Counter : = 0;

Or, return a single-space character to the caller on encountering a comment construct in the input text.

Numerical constants

Integer, floating-point constants

‘0’ : ‘1’ : ‘2’ : ‘3’ : ‘4’ :

‘5’ : ‘6’ : ‘7’ : ‘8’ : ‘9’ :

Begin

Intval := 0;

While Ch in [‘0’ .. ‘9’] do

Begin(*accumulate decimal value*)

Intval := intval * 10 + ord (ch) – ord (‘0’);

NextCh

End;

Token := integersym

End; (*value of constant returned in intval*)

Character and string constants

Enclosed by apostrophes, ‘a’ “

For denoting single’.

Identifiers and reserved words

While Ch in Valididentifierschars do

Example

Consider the character string abc<def input to NextToken. The letter a guides NextToken into the routine to read an identifier; thus, the characters are read one by one until the < character is encountered. This character is retained as the first character to be analysed when NextToken is next called – there is no need to backspace on the input stream. This situation where a single character causes the termination of one token and starts another is very common and is handled naturally by the single-character lookahead.

Errors

Errors detected by the lexical analyser: “ a letter not allowed to terminate a number, numerical overflow, end of line before end of string and null string detected”.

Also, NextToken should report error if it encounters a character that is not a member of the character set of Pascal.

e.g. (explanation mark outside a character string or comment)

Automating the Production of Lexical Analysis

Writing a lexical analyser from scratch – fairly demanding task easier to modify an existing lexical analyser for a similar language.

Given the syntax specification of the lexical tokens, the coding of the lexical analyser should be mechanical.

Software tools can automate the production of lexical analyser.

e.g. Lex (one of the many utilities of UNIX)

(most famous)

-Requires the syntax of each lexical token be defined in terms of a regular expression.

-Lex generates a recognizer program – the lexical analyser.

-Lex transforms regular expressions to an equivalent deterministic finite-state automation.

-Produces rapidly reliable and efficient lexical analysis while the control file for lex provides clear and accurate documentation.

1