Context-Free Grammars and Languages

Derivations and Derivation Trees

A recursive definition of derivation trees from XN in a CFG G=(N,T,P,S)

(also called syntax trees) The yields of the trees will be defined at the same time.

1)Basis: X is a d.t. (a tree with a single node whose label is X) from X in G. The corresponding derivation in G is X (a derivation of length 0). X is the yield of the tree.

2)Inductive step: If X is a d.t. from X in G, where

 Y 

i) X is the label of the root,

ii) Y is the label of a leaf,

iii) the string obtained by taking the labels of the leaves from left to right is Y, i.e.,

the yield of the tree is Y,

iv) the corresponding derivation is X*Y, and

v) YY1…Yk is a production in P with each Yi in NT ,

then X is a d.t. from X in G, and a corresponding derivation is

 Y 

Y1 Yk

X*YY1…Yk. Y1…Yk is the yield of the resulting tree.

In particular, in case of k=0, i.e., Y1…Yk=, then X is a d.t. from X in G,

 Y 

and a corresponding derivation is X*Y.  is the yield of the resulting tree.

Examples

Leftmost Derivations and Ambiguity

Let G=(N,T,P,S) be a CFG.

01…n is called a leftmost derivation if for each i, i=xiXii and i+1=xiii for some xi in T*, i in V*, and Xii in P. For leftmost derivations, we sometimes use lm instead of . Rightmost derivations are defined similarly.

THEOREM Let G be a CFG and X be a nonterminal. Then

(1)X* holds in G iff there exists a d.t. from X in G whose yield is .

(2)There is a 1:1 correspondence between the leftmost derivations in G and the d.t.’s in G.

G is said to be ambiguous if there exists x in L(G) that is generated by two distinct leftmost derivations from S, in other words, if there are two different derivation trees for x.

Examples

  1. Identifiers: Gid=(Nid,Tid,Pid,id) Unambiguous

Nid={id,A,L,N}, Tid={a,…,z,A,…Z}{0,1,…9},

Pid={idL|idA, AL|N, La|b|…|z|A|…|Z, N0|1|…|9}.

id idlmid Almid AAlmL AA

id A lmAAAlmALAlmAbA

id A N lmAbNlmAb1

L L 1

A b

  1. Arithmetic expressions: Gexp1=({E},{x,y,z},Pexp1,E)

Pexp1={Ex|y|z|E+E|E-E|E*E|E/E}

Ambiguous  There are two different derivation trees for x*y+z:

E E

E E E E

E E E E

x * y + z x * y + z

(x*y)+z x*(y+z)

CFG s= BNFs

BNF (Backus-Naur form or Backus’ notation)

Expressions to describe the syntax of programming languages

…  metasymbol to represent grammatical element …

::=   is defined to be of the form 

  either  or 

Examples

1. Identifiers

<identifier>::=<letter>|<identifier<alphanumeral>

<alphanumeral>::=<letter>|<numeral>

<letter>::=a|b|…|z|A|B|…|Z

<numeral>::=0|1|…|9

2. Arith. Exps. (an ambiguous grammar)

<arith exp 1>::=x|y|z|<arith exp 1>+<arith exp 1>|

<arith exp 1><arith exp 1>|<arith exp 1>*<arith exp 1>|

<arith exp 1>/<arith exp 1>

3. Another definition of arith. exps. (unambiguous one!)

<arith exp 2>::=<arith exp 2>+<term>|<arit exp 2><term>|<term>

<term>::=<term>*<factor>|<term>/<factor>|<factor>

<factor>::=<factor>|(<arith exp 2>)|<variable>|<constant>

<variable>::=<identifier>

<constant>::=<numeral>

4. Statements in a Pascal-like language

<statement>::=<for statement>|<if statement>|<while statement>|…

for statement>::=for <variable><arith exp 2> to <arith exp 2> do <statement>

if statement>::=if <condition> then <statement>|

if <condition> then <statement> else <statement>

The dangling else problem

Which does “if p thenif q then a else b” mean,

if p then begin if q then a else b end” or “if p then begin if q then a end else b” ?

The Equivalence of BNFs and CFGs

metavariables  nonterminals

::= 

basic symbols  terminals

Exercises

  1. Define the following in terms of BNFs:

a)“Numbers” in language C

b)“while statements” in language C

c)Language AnBn

d)Even integers as strings over alphabet {0,1,…,9}

e)

f)

  1. Draw the syntax trees for the following:

a)Two different syntax trees for “if p then if q then a else b” according to the definition of if statements in the Pascal-like language given above.

b)Arith. exps. x+y*z/x, x+y*z, etc.

  1. Show an equivalent CFG for each of the following BNFs:

a)

b)

  1. Prove the following:

a)If L and L’ are CFLs, then so is LL’. In fact, ℒ2 is closed under , , *, R, substitution, homomorphism, intersection with regular sets, and so on.

b)

c)

  1. Give CFGs for the following languages:

a)

b)

The Pumping Lemma for CFLs

A CFG G=(N,T,P,S) is in Chomsky normal form if every production in P is either of the form XYZ or of the form Xa, where X,Y,Z are in N and a in T.

Lemma For each context-free language L there exists a CFG in Chomsky normal form generating L{}.

THEOREM For any CFL L, there exists an integer n satisfying the following condition: If zL and |z|n, then there exist u, v, w, and x which satisfy the following properties:

1)x=uvwxy,

2)|vwx|n,

3)vx,

4)uviwxiyL for any integer i0.

Proof. Let G=(N,T,P,S) be a CFG in Chomsky normal form generating L{}. Let k=|N|, the number of nonterminals of G, and put n=2k+1.

Assume that zL and |z|n and consider any derivation tree, say T, in G from S generating z. T is a binary tree, since G is in Chomsky normal form. Thus the height of T is at least k+1, which means that there is a path, say p, from the root to a leaf whose length  k+1. Then p must have two identical nonterminals, say X, on it:

Choose a pair of X’s such that the dotted segment of path p contains no identical nonterminals except for the two X’s. Then properties 1) to 4) hold.

u v w x y

Applications of the Pumping Lemma

  1. AnBnCn={anbncn | n0} is not a CFL.

Assume the contrary. Then there exists an integer n satisfying the P.L.

Consider z=anbncnAnBnCn. Since |z|=3nn, z can be decomposed into z=uvwxy such that properties 1) to 4) of the P.L. hold. Since |vwx|n, vx contains at most two letters from a,b,c, that is, vxa*b*b*c*.

Case 1: If vxa+b* (note that vx) then #a(uv2wx2y)>#c(uv2wx2y), which contradicts the fact that uv2wx2y is in AnBnCn (Property 4)). A similar contradiction can be derived from any one of the following possible assumptions: vxa*b+, vxb+c*, vxb*c+.

Case2: If vxa+b+, then uv2wx2y is not of the form anbncn, a contradiction to property 4). Similar contradictions can be derived from vxb+c+.

In any case we obatined a contradiction. Thus AnBnCn is not a CFL.

{aibjaibj | i,j0} can be shown not to be CF in a similar way.

  1. {ap | p is prime} is not a CFL.

Assume the contrary and consider z=an for the n in the P.L. For z=vuwxy,

|uviwxiy|=n(1+|vx|) for i=n+1, a contradiction, since n(1+|vx|) is not prime (note that vx).

  1. L={aibj | j=i2} is not CF.

If it were, there exists n satisfying the P.L. Consider z=anbmL, where m=n2. Consider any decomposition z=uvwxy satisfying the P.L. v cannot contain both a and b; because, otherwise we have a contradiction that uv2wx2y is not of the form aibj. A similar argument shows that x does not contain both a and b.

Case 1: If vwx is in a+, then uv2wx2y=an+|vx|bm and this word is in L by the P.L. However m=n2(n+|vx|)2, since |vx|1, a contradiction.

Case 2: If vwx is in b+, then uv2wx2y=anbm+|vx|and it is in L. However, m+|vx|=n2+|vx|n2, a contradiction.

Case 3: If v is in a+ and x in b+, then z’=uvi+1wxi+1y=an+i|v|bm+i|x| is in L for any i. Note that |v|>0 and |x|>0. Taking i=|x|, we have (n+i|v|)2=(n+|x||v|)2=n2+2n|x||v|+|x|2|v|2n2+|x|2=m+i|x|, a contradiction.

1

#007