CIS 324: Language Design and Implementation

Syntax Analysis

1. Context-Free Grammars (CFG)

A context- free grammar is a finite set of variables ( nonterminals )

each of which represents a language. The languages represented

by the nonterminals are described recursively in terms of each

other and primitive symbols called terminals. The rules relating

the nonterminals are called productions.

A production states that the language associated with a given

nonterminal contains strings that are formed by concatenating strings

from the languages of certain other nonterminals, possibly along

some terminals.

A context-free grammar has four components:

 terminals- basic symbols from whish sentences are formed;

 nonterminals- syntactic variables that denote strings and

define sentences by imposing hierarchical structure;

 productions- specify the way for combining terminals and

nonterminals into sentences;

 a start symbol

A context- free grammar is formally represented by the tuple:

CFG = ( N, T, P, s )

where: N and T are finite sets of nonterminals and terminals

/ under the assumption that N and T are disjoint/

P is a finite set of productions

/ each production is of the form: A

where A is a nonterminal and  is a sting of symbols /

s is the start symbol

Notational conventions:

 terminals are usually denoted by:

- lower-case letters early in the alphabet: a, b, c;

- operator symbols: +, -, *, etc.;

- punctuation symbols: (, ), {, }, ; etc.;

- digits: 0, 1, 2, ..., 9;

- boldface strings: if, else, etc.;

 nonterminals are usually denoted by:

- upper-case letters early in the alphabet: A, B, C;

- the letter S representing the start symbol;

- lower-case italic names: expr, stmt, etc.;

 grammar symbols, that is either terminals or nonterminals, are

represented by upper-case letters late in the alphabet: X, Y, Z

 strings of terminals only are represented by lower-case letters

late in the alphabet: u, v, w, ... z

 productions are represented in the following way: A 1, A 2 etc.

 alternatives in roductions are represented: A 1 | 2 etc.

2. Derivations

Derivations give explanations how a grammar defines a language:

A is a derivation if A  is a production

and  and  are grammar symbols

If 1 2 ... n then 1 derives n.

The symbol  means derives in one step,

* means derives in zero or more steps,

+ means derives in one or more steps.

Axioms:1. *  for any string 

2. if *  and *  then 

A grammar G defines a languege L: L( G ) so that each string

of terminals w is in L( G ) if and only if S + w, the

string w is called sentence of G.

If S *  , where  may contain nonterminals, then  is

a sentential form of G.

A sentence is a sentential form without nonterminals.

Example:

E  - E  - ( E )  - ( E + E )  - ( id + E )  - ( id + id )

E + - ( id + E )  - ( id + id )

2.1 Leftmost Derivations

Derivations in which only the leftmost nonterminal in any sentential

form is replaced at each step are called leftmost: lm.

if w A w  thenlm.

where A  is a production and w is a string of terminals ,

 is a string of grammar symbols.

To emphasize the fact that  derives  by a leftmost derivation

may be written: lm* .

If S lm*  then  is a left-sentential form of the grammar.

2.2 Rightmost Derivations

Derivations in which only the rightmost nonterminal in any sentential

form is replaced at each step are called rightmost: rm.

if  A w  w thenrm.

where A  is a production and w is a string of terminals ,

 is a string of grammar symbols.

To emphasize the fact that  derives  by a rightmost derivation

may be written: rm* .

If S rm*  then  is a right-sentential form of the grammar.

3. Parse Trees

A parse tree is a graphical representation for a derivation that

enables to identify the choice regarding the replacement order.

Each internal node of the parse tree is labeled by a nonterminal,

and its children are labeled from left to right by the symbols in

the right hand side of the production by which the node

nonterminal is replaced according to the derivation.


4. Elimination of Left Recursion

The left-recursive pair of productions

AA | 

could be replaced by the non-left-recursive productions

AA ' and A ' A ' | 

without changing the set of strings derivable from A.

The left-recursive productions

AA1 | A2 | A3 | ... | Am | 1 | 2 | 3 | ... | n

could be replaced by the non-left-recursive productions

A 1 A ' | 2 A ' | ... | n A ' and

A ' 1 A ' | 2 A ' | ... | m A ' | 

Algorithm for Elimination of Left Recursion

Initialize: Arrange the nonterminals in some order A1, A2, ..., An

Repeat: for i := 1 to n do

for j := 1 to i - 1 do

replace any production the form AiAj

by the productions Ai1 | 2 | ... | k

where Aj1 | 2 | ... | k

end

eliminate the immediate left-recursion among the Ai

productions( AA' and A' A' |  )

Example: SAa | b

AAc | Sd | 

for i = 1 nothing happens

for i = 2 we obtainAAc | Aad | bd | 

after eliminating the immediate left recursions

SAa | b

Abd A' | A'

A'cA' | adA' | 