CIS 324: Language Design and Implementation
Syntax Analysis
1. Context-Free Grammars (CFG)
A context- free grammar is a finite set of variables ( nonterminals )
each of which represents a language. The languages represented
by the nonterminals are described recursively in terms of each
other and primitive symbols called terminals. The rules relating
the nonterminals are called productions.
A production states that the language associated with a given
nonterminal contains strings that are formed by concatenating strings
from the languages of certain other nonterminals, possibly along
some terminals.
A context-free grammar has four components:
terminals- basic symbols from whish sentences are formed;
nonterminals- syntactic variables that denote strings and
define sentences by imposing hierarchical structure;
productions- specify the way for combining terminals and
nonterminals into sentences;
a start symbol
A context- free grammar is formally represented by the tuple:
CFG = ( N, T, P, s )
where: N and T are finite sets of nonterminals and terminals
/ under the assumption that N and T are disjoint/
P is a finite set of productions
/ each production is of the form: A
where A is a nonterminal and is a sting of symbols /
s is the start symbol
Notational conventions:
terminals are usually denoted by:
- lower-case letters early in the alphabet: a, b, c;
- operator symbols: +, -, *, etc.;
- punctuation symbols: (, ), {, }, ; etc.;
- digits: 0, 1, 2, ..., 9;
- boldface strings: if, else, etc.;
nonterminals are usually denoted by:
- upper-case letters early in the alphabet: A, B, C;
- the letter S representing the start symbol;
- lower-case italic names: expr, stmt, etc.;
grammar symbols, that is either terminals or nonterminals, are
represented by upper-case letters late in the alphabet: X, Y, Z
strings of terminals only are represented by lower-case letters
late in the alphabet: u, v, w, ... z
productions are represented in the following way: A 1, A 2 etc.
alternatives in roductions are represented: A 1 | 2 etc.
2. Derivations
Derivations give explanations how a grammar defines a language:
A is a derivation if A is a production
and and are grammar symbols
If 1 2 ... n then 1 derives n.
The symbol means derives in one step,
* means derives in zero or more steps,
+ means derives in one or more steps.
Axioms:1. * for any string
2. if * and * then
A grammar G defines a languege L: L( G ) so that each string
of terminals w is in L( G ) if and only if S + w, the
string w is called sentence of G.
If S * , where may contain nonterminals, then is
a sentential form of G.
A sentence is a sentential form without nonterminals.
Example:
E - E - ( E ) - ( E + E ) - ( id + E ) - ( id + id )
E + - ( id + E ) - ( id + id )
2.1 Leftmost Derivations
Derivations in which only the leftmost nonterminal in any sentential
form is replaced at each step are called leftmost: lm.
if w A w thenlm.
where A is a production and w is a string of terminals ,
is a string of grammar symbols.
To emphasize the fact that derives by a leftmost derivation
may be written: lm* .
If S lm* then is a left-sentential form of the grammar.
2.2 Rightmost Derivations
Derivations in which only the rightmost nonterminal in any sentential
form is replaced at each step are called rightmost: rm.
if A w w thenrm.
where A is a production and w is a string of terminals ,
is a string of grammar symbols.
To emphasize the fact that derives by a rightmost derivation
may be written: rm* .
If S rm* then is a right-sentential form of the grammar.
3. Parse Trees
A parse tree is a graphical representation for a derivation that
enables to identify the choice regarding the replacement order.
Each internal node of the parse tree is labeled by a nonterminal,
and its children are labeled from left to right by the symbols in
the right hand side of the production by which the node
nonterminal is replaced according to the derivation.
4. Elimination of Left Recursion
The left-recursive pair of productions
AA |
could be replaced by the non-left-recursive productions
AA ' and A ' A ' |
without changing the set of strings derivable from A.
The left-recursive productions
AA1 | A2 | A3 | ... | Am | 1 | 2 | 3 | ... | n
could be replaced by the non-left-recursive productions
A 1 A ' | 2 A ' | ... | n A ' and
A ' 1 A ' | 2 A ' | ... | m A ' |
Algorithm for Elimination of Left Recursion
Initialize: Arrange the nonterminals in some order A1, A2, ..., An
Repeat: for i := 1 to n do
for j := 1 to i - 1 do
replace any production the form AiAj
by the productions Ai1 | 2 | ... | k
where Aj1 | 2 | ... | k
end
eliminate the immediate left-recursion among the Ai
productions( AA' and A' A' | )
Example: SAa | b
AAc | Sd |
for i = 1 nothing happens
for i = 2 we obtainAAc | Aad | bd |
after eliminating the immediate left recursions
SAa | b
Abd A' | A'
A'cA' | adA' |