1

UNIT – 2

REGULAR EXPRESSIONS AND LANGUAGES

Objectives:

The objectives of this course are as follows:

To learn about the regular languages.

To learn about Pumping lemma for regular languages.

To learn about Closure properties of regular languages.

To learn about Decision properties of Regular languages.

To learn about Equivalence and Minimization of Finite Automata.

Regular Languages:

Operations on Languages

•Let L, L1, L2 be subsets of Σ*

•Concatenation:L1L2 = {xy | x is in L1 and y is in L2}

•Concatenating a language with itself:L0 = {ε}

Li = LLi-1, for all i >= 1

•Kleene Closure:L* = Li = L0 U L1 U L2 U…

•Positive Closure:L+ = Li = L1 U L2 U…

•Question: Does L+ contain ε?

Kleene closure

Say, L1={a, abc, ba}, on Σ ={a,b,c}

Then, L2 = {aa, aabc, aba, abca, abcabc, abcba, baa, baabc, baba}

L3= {a, abc, ba}. L2

L* = {ε, L1, L2, L3, . . .}

Regular Expressions

•Highlights:

–A regular expression is used to specify a language, and it does so precisely.

–Regular expressions are very intuitive.

–Regular expressions are very useful in a variety of contexts.

–Given a regular expression, an NFA-ε can be constructed from it automatically.

–Thus, so can an NFA, a DFA, and a corresponding program, all automatically!

Definition of a Regular Expression

•Let Σ be an alphabet. The regular expressions over Σ are:

–Ø Represents the empty set { }

–ε Represents the set {ε}

–a Represents the set {a}, for any symbol a in Σ

Let r and s be regular expressions that represent the sets R and S, respectively.

–r+sRepresents the set R U S(precedence 3)

–rs Represents the set RS(precedence 2)

–r* Represents the set R* (highest precedence)

–(r) Represents the set R(not an op, provides precedence)

•If r is a regular expression, then L(r) is used to denote the corresponding language.

•Examples: Let Σ = {0, 1}

(0 + 1)*All strings of 0’s and 1’s

0(0 + 1)*All strings of 0’s and 1’s, beginning with a 0

(0 + 1)*1All strings of 0’s and 1’s, ending with a 1

(0 + 1)*0(0 + 1)*All strings of 0’s and 1’s containing at least one 0

(0 + 1)*0(0 + 1)*0(0 + 1)*All strings of 0’s and 1’s containing at least two 0’s

(0 + 1)*01*01*All strings of 0’s and 1’s containing at least two 0’s

(1 + 01*0)*All strings of 0’s and 1’s containing an even number of 0’s

1*(01*01*)*All strings of 0’s and 1’s containing an even number of 0’s

(1*01*0)*1*All strings of 0’s and 1’s containing an even number of 0’s

Equivalence of Regular Expressions and NFA-εs

•Note:

Throughout the following, keep in mind that a string is accepted by an NFA-ε if there exists a path from the start state to a final state.

•Lemma 1: Let r be a regular expression. Then there exists an NFA-ε M such that L(M) = L(r). Furthermore, M has exactly one final state with no transitions out of it.

•Proof: (by induction on the number of operators, denoted by OP(r), in r).

Inductive Hypothesis: Suppose there exists a k  0 such that for any regular expression r where 0  OP(r)  k, there exists an NFA-ε such that L(M) = L(r). Furthermore, suppose that M has exactly one final state.

Inductive Step: Let r be a regular expression with k + 1 operators (OP(r) = k + 1), where k + 1 >= 1.

Case 1)r = r1 + r2

Since OP(r) = k +1, it follows that 0<= OP(r1), OP(r2) <= k. By the inductive hypothesis there exist NFA-ε machines M1 and M2 such that L(M1) = L(r1) and L(M2) = L(r2). Furthermore, both M1 and M2 have exactly one final state.

Construct M as:

Case 2)r = r1r2

Since OP(r) = k+1, it follows that 0<= OP(r1), OP(r2) <= k. By the inductive hypothesis there exist NFA-ε machines M1 and M2 such that L(M1) = L(r1) and L(M2) = L(r2). Furthermore, both M1 and M2 have exactly one final state.

Construct M as:

Case 3)r = r1*

Since OP(r) = k+1, it follows that 0<= OP(r1) <= k. By the inductive hypothesis there exists an NFA-ε machine M1 such that L(M1) = L(r1). Furthermore, M1 has exactly one final state.

Example:

r = 0(0+1)*

r = r1r2

r1 = 0

r2 = (0+1)*

r2 = r3*

r3 = 0+1

r3 = r4 + r5

r4 = 0

r5 = 1

•Example:

r = 0(0+1)*

r = r1r2

r1 = 0

r2 = (0+1)*

r2 = r3*

r3 = 0+1

r3 = r4 + r5

r4 = 0

r5 = 1

•Example:

r = 0(0+1)*

r = r1r2

r1 = 0

r2 = (0+1)*

r2 = r3*

r3 = 0+1

r3 = r4 + r5

r4 = 0

r5 = 1

Example:

r = 0(0+1)*

r = r1r2

r1 = 0

r2 = (0+1)*

r2 = r3*

r3 = 0+1

r3 = r4 + r5

r4 = 0

r5 = 1

Definitions Required Converting a DFA to a Regular Expression

•Let M = (Q, Σ, δ, q1, F) be a DFA with state set Q = {q1, q2, …, qn}, and define:

Ri,j = { x | x is in Σ* and δ(qi,x) = qj}

Ri,j is the set of all strings that define a path in M from qi to qj.

•Note that states have been numbered starting at 1, not 0!

Example:

R2,3 = {0, 001, 00101, 011, …}

R1,4 = {01, 00101, …}

R3,3 = {11, 100, …}

•Another definition:

Rki,j = { x | x is in Σ* and δ(qi,x) = qj, and for no u where 1  |u| < |x| and

x = uv there is no case such that δ(qi,u) = qp where p>k}

•In words: Rki,j is the set of all the strings that define a path in M from qi to qj but that passes through no state numbered greater than k.

•Note that it may be true that i>k or j>k, only the intermediate states may not be >k.

R42,3 = {0, 1000, 011, …}R12,3 = {0}

111 is not in R42,3111 is not in R12,3

101 is not in R12,3

R52,3 = R2,3

•Obeservations:

1) Rni,j = Ri,j

2) Rk-1i,j is a subset of Rki,j

3) L(M) = Rn1,q = R1,q

4) R0i,j = Easily computed from the DFA!

5) Rki,j = Rk-1i,k (Rk-1k,k)* Rk-1k,j U Rk-1i,j

•Notes on 5:

5) Rki,j = Rk-1i,k (Rk-1k,k)* Rk-1k,j U Rk-1i,j

•Consider paths represented by the strings in Rki,j :

•IF x is a string in Rki,j then no state numbered > k is passed through when processing x and either:

–qk is not passed through, i.e., x is in Rk-1i,j

–qk is passed through one or more times, i.e., x is in Rk-1i,k (Rk-1k,k)* Rk-1k,j

•Lemma 2: Let M = (Q, Σ, δ, q1, F) be a DFA. Then there exists a regular expression r such that L(M) = L(r).

•Proof:

First we will show (by induction on k) that for all i,j, and k, where 1 i,j n

and 0  k  n, that there exists a regular expression r such that L(r) = Rki,j .

Basis: k=0

R0i,j contains single symbols, one for each transition from qi to qj, and possibly ε if i=j.

case 1) No transitions from qi to qj and i != j

r0i,j = Ø

case 2) At least one (m  1) transition from qi to qj and i != j

r0i,j = a1 + a2 + a3 + … + amwhere δ(qi, ap) = qj,

for all 1  p  m

case 3) No transitions from qi to qj and i = j

r0i,j = ε

case 4) At least one (m  1) transition from qi to qj and i = j

r0i,j = a1 + a2 + a3 + … + am + ε where δ(qi, ap) = qj

for all 1  p  m

Inductive Hypothesis:

Suppose that Rk-1i,j can be represented by the regular expression rk-1i,j for all

1 i,j n, and some k1.

Inductive Step:

Consider Rki,j = Rk-1i,k (Rk-1k,k)* Rk-1k,j U Rk-1i,j . By the inductive hypothesis there exist regular expressions rk-1i,k , rk-1k,k , rk-1k,j , and rk-1i,j generating Rk-1i,k , Rk-1k,k , Rk-1k,j , and Rk-1i,j , respectively. Thus, if we let

rki,j = rk-1i,k (rk-1k,k)* rk-1k,j + rk-1i,j

then rki,j is a regular expression generating Rki,j ,i.e., L(rki,j) = Rki,j .

•Finally, if F = {qj1, qj2, …, qjr}, then

rn1,j1 + rn1,j2 + … + rn1,jr

is a regular expression generating L(M).

•Note: not only does this prove that the regular expressions generate the regular languages, but it also provides an algorithm for computing it!

•All remaining columns are computed from the previous column using the formula.

r12,3= r02,1 (r01,1 )* r01,3 + r02,3

= 0 (ε)* 1 + 1

= 01 + 1

r21,3= r11,2 (r12,2 )* r12,3 + r11,3

= 0 (ε + 00)* (1 + 01) + 1

= 0*1

•To complete the regular expression, we compute:

r31,2+ r31,3

•Theorem: Let L be a language. Then there exists an a regular expression r such that L = L(r) if and only if there exits a DFA M such that L = L(M).

•Proof:

(if) Suppose there exists a DFA M such that L = L(M). Then by Lemma 2 there exists a regular expression r such that L = L(r).

(only if) Suppose there exists a regular expression r such that L = L(r). Then by Lemma 1 there exists a DFA M such that L = L(M).

•Corollary: The regular expressions define the regular languages.

•Note: The conversion from a regular expression to a DFA and a program accepting L(r) is now complete, and fully automated!

Applications of Regular Expression

1.Regular expressions in Unix

In the UNIX operating system various commands use an extended regular expressions language that provideshorthands for many common expressions. In this we can write character classes (A character class is a pattern that defines a set of characters and matches exactly one character from that set.) to represent large set of characters. There are some rules for forming this character classes:

The dot symbol (.) is to represent ‘any character’.

The regular expression a+b+c+…+z is represented by [abc…z]

Within a character class representation, - can be used to define a set of characters in terms of a range. For example, a-z defines the set of lower-case letters and A-Z defines the set of upper-case letters. The endpoints of a range may be specified in either order (i.e. both 0-9 and 9-0 define the set of digits).

If our expression involves operators such as minus then we can place it first or last to avoid confusion with the range specifier. i.e. [-.0-9]. The special characters in UNIX regular language can be represented as characters using \ symbol i.e. \ provides the usual escapes within character class brackets. Thus [[\]] matches either [ or ], because \ causes the first ] in the

character class representation to be taken as a normal character rather than the closing bracket of the representation.

Special notations

[: digit : ] same as [0-9]

[: alpha:] same as [A-Za-z]

[: alnum :] same as [A-Za-z0-9]

Operators

| Used in place of +

? 0 or 1 of

R? Means 0 or 1 occurrence of R

 1 or more of

R+ means 1 or more occurrence of R

{n} n copies of

R {3} means RRR

^ Compliment of

If the first character after the opening bracket of a character class is ^, the set defined by the remainder of the class is complemented with respect to the computer's character set. Using this notation, the character class represented by ‘.’ can be described as [^\n]. If ^ appears as any character of a class except the first, it is not considered to be an operator. Thus [^abc] matches any character except a, b, or c but [a^bc] or [abc^] matches a, b, c or ^.

When more than one expression can match the current character sequence, a choice is made as follows:

  1. The longest match is preferred.
  2. Among rules, which match the same number of characters, the rule given first is preferred.

2.Lexical analysis

Compilers – in a nutshell

Purpose: translate a program in some language (the source language) into a lower-level language (the target language).

Phases:

Lexical Analysis:

Converts a sequence of characters into words, or tokens

Syntax Analysis:

Converts a sequence of tokens into a parse tree

Semantic Analysis:

Manipulates parse tree to verify symbol and type information

Intermediate Code Generation:

Converts parse tree into a sequence of intermediate code instructions

Optimization:

Manipulates intermediate code to produce a more efficient program

Final Code Generation:

Translates intermediate code into final (machine/assembly) code

Overview of Lexical Analysis

  • Convert character sequence into tokens, skip comments & whitespace
  • Handle lexical errors
  • Efficiency is crucial
  • Tokens are specified as regular expressions, e.g. IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*
  • Lexical Analyzers are implemented by regular expressions.

There is a problem that more than one token may be recognized at once. Suppose the string else matches for regular expression as well as the expression for identifiers. This problem is resolved by giving priority to first expression listed.

Regular Grammars

A grammar is a quadruple

G = (V, T, S, P) where

V is a finite set of variables

T is a finite set of symbols, called terminals

S is in V and is called the startsymbol

P is a finite set of productions, which are rules of the form
α → β

•whereα and β are strings consisting of terminals and variables.

A grammar is said to be right-linear if every production in P is of the form

A → xB or

A → x

where A and B are variables (perhaps the same, perhaps the start symbol S) in V

and x is any string of terminal symbols (including the empty string λ)

An alternate (and better) definition of a right-linear grammar says that every production in P is of the form

A → aB or

A → a or

S → λ(to allow λ to be in the language)

where A and B are variables (perhaps the same, but B can't be S) in V

and a is any terminal symbol

An alternate (and better) definition of a right-linear grammar says that every production in P is of the form

A → aB or

A → a or

S → λ(to allow λ to be in the language)

where A and B are variables (perhaps the same, but B can't be S) in V

and a is any terminal symbol

A grammar is said to be left-linear if every production in P is of the form

A → Bx or

A → x

where A and B are variables (perhaps the same, perhaps the start symbol S) in V

and x is any string of terminal symbols (including the empty string λ)

The alternate definition of a left-linear grammar says that every production in P is of the form

A → Ba or

A → a or

S → λ

where A and B are variables (perhaps the same, but B can't be S) in V

and a is any terminal symbol

Any left-linear or right-linear grammar is called a regular grammar.

For brevity, we often write a set of productions such as

A → x1

A → x2

A → x3

As

A → x1 | x2 | x3

A derivation in grammar G is any sequence of strings in V and T,

connected with

starting with S and ending with a string containing no variables

where each subsequent string is obtained by applying a production in P is called a derivation.

S  x1 x2 x3 . . . xn

abbreviated as:

S xn

S  x1 x2 x3 . . . xn

 abbreviated as:

 S xn

We say that xn is a sentence of the language generated by G, L(G).

 We say that the other x's are sentential forms.

L(G) = {w | w  T* and S xn}

We call L(G) the language generated by G

 L(G) is the set of all sentences over grammar G

Example 1

S →abS | a is an example of a right-linear grammar.

Can you figure out what language it generates?

L = {w  {a,b}* | w contains alternating a's and b's , begins with an a, and ends with a b}  {a}

L((ab)*a)

Example 2

S → Aab
A → Aab | aB
B → a
is an example of a left-linear grammar.

Can you figure out what language it generates?

L = {w  {a,b}* | w is aa followed by at least one set of alternating ab's}

L(aaab(ab)*)

Regular Grammars and NFA's

We get a feel for this by example.

Let S → aA A → abS | b

Regular Grammars and Regular Expressions

 Example: L(aab*a)

We can easily construct a regular language for this expression:

S → aA

A → aB

B → b

B → a

Types of grammars:

Key Terms:

Introduction to regular operators, regular languages, Precedence of regular operators

Regular expressions, Formal definition of regular expressions,

Equivalence of Regular Expressions and Finite Automata.

Theorem for conversion from regular expression to epsilon FA.

Application of regular expressions

Algebraic Laws for Regular Expressions.

Multiple choice questions:

1. The regular expression 01*.

a) The language consisting of strings of length 2.

b) The language consisting of all strings that are a single 0followed by any number of 1’s

c) The language consisting of all strings that is a single 0.

d) The language consisting of all strings that are a single 1followed by any number of 0’s.

2. Union and concatenation are associative.

a)True b)False

3. The regular expression 10*.

a) The language consisting of strings of length 2.

b) The language consisting of all strings that are either a single 0followed by any number of 1’s

c) The language consisting of all strings that is a single 0.

d) The language consisting of all strings that are a single 1followed by any number of 0’s.

4. The Kleene closure is represented by

a) L b)L+ c)L* d)L1

5. In a regular expression L(E+F) is equal to

a) L(E) + L(F) b) L(E) U L(F) c) L(EUF) d) L(EnF)

6. Union of a regular expression is commutative.

a)True b)False

7. . In a regular expression L(E*) is equal to

a) L(E) * b) (L(E) *) c) (L(E) ) * d) all of the above

8. The regular expression operators are

a)union, intersection and concatenation

b) union, concatenation and closure

c) closure, intersection and concatenation

d) union, intersection and closure

9. Concatenation of a regular expression is commutative.

a) Trueb)False

10. The regular expression (10)*.

a) The language consisting of strings of length 2.

b) The language consisting of all strings that are either a single 0 followed by any number of 1’s

c) The language consisting of alternating strings that begin with1 and end with 0.

d) The language consisting of all strings that are a single 1followed by any number of 0’s.

11. Every language is a regular language.

a) Trueb) False

12. The inverse homomorphism of a regular language is regular

a) Trueb) False

13. The positive closure is represented by

a) L b) L+ c) L* d)L1

14. The regular expression for the set of strings that end with ‘1’ and hasno substring ‘00’ is given by

a) (0+1)*0101(0+1)*

b) 11(1+0+0)*11

c) (1+01)*(10+11)*1

d) none

Closure property : New recognizers for languages that are constructed from other languages by certain operations can be built.

Decision Property: This property gives algorithms for answering important questions about automata.

Pumping Lemma for Regular Languages

•Pumping Lemma relates the size of string accepted with the number of states in a DFA

•Lemma: (the pumping lemma)

Let M be a DFA with |Q| = n states. If there exists a string x in L(M), such that |x|  n, then there exists a way to write it as x = uvw, where u,v, and w are all in Σ* and:

–1 |uv|  n

–|v|  1

–such that, the strings uviw are also in L(M), for all i  0

•Proof:

Let x = a1a2 … am where m  n, x is in L(M), and δ(q0, a1a2 … ap) = qjp

a1a2a3… am

qj0qj1qj2qj3…qjmm  nandqj0 is q0

Consider the first n symbols, and first n+1 states on the above path:

a1a2a3… an

qj0qj1qj2qj3…qjn

Since |Q| = n, it follows from the pigeon-hole principle that js = jt for some 0  s<t  n, i.e., some state appears on this path twice (perhaps many states appear more than once, but at least one does).

•Let:

–u = a1…as

–v = as+1…at

•Since 0  s<t  n and uv = a1…at it follows that:

–1  |v| and therefore 1  |uv|

–|uv|  n and therefore 1  |uv|  n

–In addition, let:

–w = at+1…am

–It follows that uviw = a1…as(as+1…at)iat+1…am is in L(M), for all i  0.

In other words, when processing the accepted string x, the loop was traversed once, but could have been traversed as many times as desired, and the corresponding strings would be accepted.

u = εu = bu = bb

v = borv = bv = b

w = bbabw = babw = ab

(b)ibbab is in L(M), for all i  0b(b)ibab is in L(M), for all i  0

NonRegularity Example

•Theorem:The language:

L = {0k1k | k  0}(1)

is not regular.

•Proof: (by contradiction) Suppose that L is regular. Then there exists a DFA M such that:

L = L(M)(2)

We will show that M accepts some strings not in L, contradicting (2).

Suppose that M has n states, and consider a string x=0m1m, where m>n.

By (1), x is in L.

By (2), x is also in L(M), note that the machine accepts a language not just a string

Since |x| = m > n, it follows from the pumping lemma that:

–x = uvw

–1  |uv|  n

–1  |v|, and

–uviw is in L(M), for all i  0

Since 1  |uv|  n and n<m, it follows that 1  |uv| < m.

Also, since x = 0m1m it follows that uv is a substring of 0m.

In other words v=0j, for some j  1.

Since uviw is in L(M), for all i  0, it follows that 0m+cj1m is in L(M), for all c  1.

But by (1) and (2), 0m+cj1m is not in L(M), for any c  1, a contradiction.

•Note that L basically corresponds to balanced parenthesis.

•Theorem:The language:

L = {0k1k2k | k  0}(1) is not regular.

•Proof: (by contradiction) Suppose that L is regular. Then there exists a DFA M such that:

L = L(M)(2)

We will show that M accepts some strings not in L, contradicting (2).

Suppose that M has n states, and consider a string x=0m1m2m, where m>n.

By (1), x is in L.

By (2), x is also in L(M), note that the machine accepts a language not just a string

Since |x| = m > n, it follows from the pumping lemma that:

–x = uvw

–1  |uv|  n

–1  |v|, and

–uviw is in L(M), for all i  0

Since 1  |uv|  n and n<m, it follows that 1  |uv|  m.

Also, since x = 0m1m2m it follows that uv is a substring of 0m.

In other words v=0j, for some j  1.

Since uviw is in L(M), for all i  0, it follows that 0m+cj1m2m is in L(M), for all c  1.

But by (1) and (2), 0m+cj1m2m is not in L(M), for any c  1, a contradiction.

•Note that the above proof is almost identical to the previous proof.

•Theorem:The language:

L = {0m1n2m+n | m,n 0}(1)

is not regular.

•Proof: (by contradiction) Suppose that L is regular. Then there exists a DFA M such that:

L = L(M)(2)

We will show that M accepts some strings not in L, contradicting (2).

Suppose that M has n states, and consider a string x=0m1n2m+n, where m>n.

By (1), x is in L.

By (2), x is also in L(M).

Since |x| = m > n, it follows from the pumping lemma that:

–x = uvw

–1  |uv|  n

–1  |v|, and

–uviw is in L(M), for all i  0

Since 1  |uv|  n and n<m, it follows that 1  |uv|  m.

Also, since x = 0m1n2m+n it follows that uv is a substring of 0m.

In other words v=0j, for some j  1.

Since uviw is in L(M), for all i  0, it follows that 0m+cj1m2m+n is in L(M), for all c  1. In other words v can be “pumped” as many times as we like, and we still get a string in L(M).

But by (1) and (2), 0m+cj1n2m+n is not in L(M), for any c  1, because the acceptable expression should be 0m+cj1n2m+cj+n, a contradiction.

•Note that the above proof is almost identical to the previous proof.

•Theorem: Let M = (Q, Σ, δ, q0, F) be a DFA. Then L(M) is finite iff |x| < |Q| for all x in L(M).

•Proof:

(if) Suppose that |x| < |Q| for all x in L(M). Since the number of states |Q| and the number of input symbols |Σ| are both fixed, it follows that there are at most a finite number of strings of length less than |Q|. It follows that L(M) is finite (exercise: give an upper bound on the number of such strings).

(only if) By contradiction. Suppose that L(M) is finite, but that |x|  |Q| for some x in L(M). From the pumping lemma it follows that x=uvw, |v|  1 and uviw is in L(M) for all i  0. But then L(M) would be infinite, a contradiction.

•Theorem: Let M = (Q, Σ, δ, q0, F) be a DFA. Then L(M) is infinite iff there exists an x in L(M) such that |x|  |Q|.

•Proof:

(if) Suppose there exists an x in L(M) such that |x|  |Q|. From the pumping lemma it follows that x=uvw, |v|  1 and uviw is in L(M) for all i  0. Therefore L(M) is infinite.

(only if) By contradiction. Suppose that L(M) is infinite, but that there is no x in L(M) such that |x|  |Q|. It follows that each x in L(M) has length less than |Q|. Since the number of states |Q| and the number of input symbols |Σ| are both fixed, it follows that there are at most a finite number of strings of length less than |Q|. It follows that L(M) is finite. A contradiction.