University of Portsmouth

Theoretical Computer Science

Level 2

THCSA

Lecture notes

Ver 2.1

Last updated: Bruce Bassett 17/02/2003

(ver 1. Jonathan Britt)

http://www.tech.port.ac.uk/staffweb/bassettb/

Based on the books:

Hein, J.L. (1996), Theory of Computation: An Introduction Jones & Bartlett Publishers, ISBN 0-86720-497-4

And

Feynman R., Feynman Lectures on Computation

ISBN: 0738202967 - ONLY CHAPTER 3.

SECTION Page

Introduction to Languages 1

Regular Languages 13

Finite Automata 19

Regular Grammars 31

Turing Machines 39

Chapter 1.

Introduction to Languages

A language is a set of strings. If A is an alphabet, then a language over A is a collection of strings whose components come from A. Now A* denotes the set of all strings over A. So A* is the biggest possible language over A, and every other language over A is a subset of A*. Four simple examples of languages over an alphabet A are the sets Æ, {L}, A, and A*. For example, if A={a} then these four simple languages over A are

Æ, {L}, {a}, and {L, a, aa, aaa, …}.

A string in a language is often called a well-formed formula-or wff for short (pronounce wff as "woof")-because the definition of the language usually allows only certain well-formed strings.

The natural operation of concatenation of strings places two strings in juxtaposition. For example, if A = {a, b}, then the concatenation of the two strings aab and ba is the string aabba. We will use the name "cat" to explicitly denote this operation. For example, we'll write cat(aab, ba) = aabba.

Combining Languages

Since languages are sets of strings, they can be combined by the usual set operations of union, intersection, difference, and complement. Another important way to combine two languages L and M is to form the set of all concatenations of strings in L with strings in M. This new language is called the product of L and M and is denoted by L · M. A formal definition can be given as follows:

L · M = {cat(s, t) | s Î L and t Î M}.

For example, if L = {ab, ac} and M = {a, bc, abc}, then the product L. M is the language

L · M = {aba, abbc, ababc, aca, acbc, acabc}.

It's easy to see, from the definition of product, that the following simple properties hold for any language L:

L· {L} = {L}· L = L.

L · Æ= Æ ·.L = Æ.

But it's also easy to see that the product is not commutative in general. In other words, we can find two languages L and M such that L · M:¹ M· L.

It's easy to see that the product is associative. In other words, if L, M, and N are languages, then L · (M · N) = (L · M) · N. So we can write down products without using parentheses. If L is a language, then the product L · L is denoted by L2. In fact, we'll define the language product Ln for every n Î ù as follows:

L0 = {L}

Ln = L · Ln-1 if n > 0

For example, if L = { a, bb} then the first few powers of L are

L0 = {L}

L1 = L = {a,bb}

L2 = L · L = {aa,abb,bba,bbbb}

L3 = L · L2 = {aaa,aabb,abba,abbbb,bbaa,bbabb,bbbba,bbbbbb}

If L is a language over A (i.e. L Ì A*) then the closure of L is the language denoted by L* and is defined as follows:

L* = L0 È L1 È L2 È….

The positive closure of L is the language denoted by L+ and defined as follows:

L+ = L1 È L2 È L3 È….

It follows from the definition that L* = L+ È {L}. But it’s not necessarily true that L+ = L* - {L}. For example, if we let our alphabet be A = {a} and our language be L = {L, a}, then L+ = L*.

Can you find a condition on a language L such that L+ = L* - {L}?

The closure of A coincides with our definition of A* as the set of all strings over A. In other words, we have a nice representation of A* as follows:

A* = A0 È A1 È A2 È….

where An is the set of all strings over a having length n

Some basic properties of the closure operation are given next. We’ll leave the proofs as exercises.

Properties of Closure

Let L and M be languages over the alphabet A. Then

a)  {L}* = Æ* = {L}

b)  L* = L* · L* = (L*)*

c)  LÎ L if and only if L+ = L*

d)  (L* · M*)* = (L* È M*)* = (L È M)*

e)  L · (M · L)* = (L · M)* · L

Grammars

Informally, a grammar is a set of rules used to define the structure of the strings in a language. Grammars are important in computer science not only to define programming languages, but also to define data sets for programs. Typical applications involve building algorithms to parse strings. To parse a string means to see whether it satisfies the rules of a grammar.

Let's describe the general structure of grammars for arbitrary languages. If L is a language over an alphabet A, then a grammar for L consists of a set of grammar rules of the following form:

a à b

where a and b denote strings of symbols taken from A and from a set of grammar symbols that is disjoint from A. A grammar rule a à b is often called a production, and it can be read in any of several ways as follows:

"replace a by b", “a produces b," "a rewrites to b” , or "a reduces to b."

Every grammar has a special grammar symbol called a start symbol, and there must be at least one production with left side consisting of only the start symbol. For example, if S is the start symbol for a grammar, then there must be at least one production of the form

Sà b.

Let's give an example of a grammar for a language and then discuss the process of deriving strings from the productions. Let A = {a, b, c}. Then a grammar for the language A* can be described by the following four productions:

S àL

S à aS

S àbS

S à cS.

How do we know that this grammar describes the language A* ? We must be able to describe each string of the language in terms of the grammar rules. For example, let's see how we can use the productions above to show that the string aacb is in A*.

We'll begin with the start symbol S. Then we'll replace S by the right side of production S à aS. We chose production S àaS because aacb matches the right hand side of S à aS by letting S = acb. The process of replacing S by aS is an example of a derivation, and we say, "S derives aS." We'll denote this derivation by writing

S Þ aS

The symbol => means "derives in one step." The right-hand side of this derivation contains the symbol S. So we again replace S by aS using the production S à aS a second time. This results in the derivation

S Þ aS Þ aaS.

The right-hand side of this derivation contains S. In this case we'll replace S by the right side of S à cS. This gives the derivation

S Þ aS Þ aaS Þ aacS.

Continuing, we replace S by the right side of S à bS. This gives the derivation

S Þ aS Þ aaS Þ aacS Þ aacbS..

Since we want this derivation to produce the string aacb, we now replace S by the right side of S à A. This gives the desired derivation of the string aacb:

S Þ aS Þ aaS Þ aacS Þ aacbS Þ aacbL = aacb.

To indicate that a derivation of aacb exists, we'll use the shorthand notation

S Þ* aacb.

The symbol Þ* means "derives in zero or more steps."

Now that we've introduced the idea of a grammar, let's take a minute to describe the four main ingredients of any grammar.

The Four Parts of a Grammar

1. An alphabet N of grammar symbols called nonterminals.

2. An alphabet T of symbols called terminals. The terminals are distinct from the nonterminals.

3. A specific nonterminal called the start symbol.

4. A finite set of productions of the form a à b, where a and b are strings over the alphabet N È T with the restriction that a is not the empty string. There is at least one production with only the start symbol on its left side. Each nonterminal must appear on the left side of some production.

From now on all grammar productions will have a single nonterminal on the left side. Later on we will consider grammars that allow productions to have strings of more than one symbol on the left side.

When two or more productions have the same left side, we can simplify the notation by writing one production with alternate right sides separated by the vertical line |. For example, the four productions above can be written in the following shorthand form:

S à L | aS | bS | cS,

and we say, "S can be replaced by either L, or aS, or bS, or cS."

We can represent a grammar G as a 4-tuple G = áN, T, S, Pñ, where P is the set of productions. For example, if P is the set of productions above, then the grammar can be represented by the 4-tuple

á{S}, {a, b, c}, S, Pñ.

The 4-tuple notation is useful for discussing general properties of grammars. But for a particular grammar it's common practice to only write down the productions of the grammar, where the nonterminals are capital letters and the first production listed contains the start symbol on its left side. For example, suppose we're given the following grammar:

S à AB

Aà L | aA

B à L | bB.

We can deduce that the nonterminals are S, A, and B, the start symbol is S, and the terminals are a, and b.

To discuss grammars further, we need to formalize things a bit. Suppose we're given some grammar. A string made up of terminals and/or nonterminals is called a sentential form. Now we can formalize the idea of a derivation. If x and y are sentential forms and a à b is a production, then the replacement of a by b in xay is called a derivation, and we denote it by writing

xay Þ xby.

The following three symbols with their associated meanings are used quite often in discussing derivations:

Þ derives in one step,

Þ+ derives in one or more steps,

Þ* derives in zero or more steps.

For example, suppose we have the following grammar:

S à AB

A à L | aA

B à L | bB.

Let's consider the string aab. The statement S Þ+ aab means that there exists a derivation of aab that takes one or more steps. For example, we have

S Þ AB Þ aAB Þ aaAB Þ aaB Þ aabB Þ aab.

When a grammar contains more than one nonterminal -as the preceding grammar does- it may be possible to find several different derivations of the same string. Here's another derivation of aab:

S Þ AB Þ AbB Þ Ab Þ aAb Þ aaAb Þ aab.

Sometimes it's easy to write a grammar, and sometimes it can be quite difficult. The most important aspect of grammar writing is knowledge of the language under discussion. So we had better nail down the idea of the language associated with a grammar. If G is a grammar, then the language of G is the set of terminal strings derived from the start symbol of G. The language of G is denoted by

L(G).

We can also describe L(G) more formally. If G is a grammar with start symbol S and set of terminals T, then the language of G is the following set:

L(G) = {s | s Î T* and S Þ+ s}.

When we're trying to write a grammar for a language, we should at least check to see whether the language is finite or infinite. If the language is finite, then a grammar can consist of all productions of the form S à w for each string w in the language. For example, the language {a, ab} can be described by the grammar S à a | ab.

If the language is infinite, then some production or sequence of productions must be used repeatedly to construct the derivations. To see this, notice that there is no bound on the length of strings in an infinite language. Therefore there is no bound on the number of derivation steps used to derive the strings. If the grammar has n productions, then any derivation consisting of n + 1 steps must use some production twice. For example, the infinite language { anb | n ³ 0 } can be described by the grammar

S à b | aS.

To derive the string anb, we would use the production S à aS repeatedly-n times to be exact-and then stop the derivation by using the production S à b. The production S à aS allows us to say “ If S derives w, then it also derives aw,"

A production is called recursive if its left side occurs on its right side. For example, the production S à aS is recursive. A production A à a is indirectly recursive if A derives a sentential form that contains A. For example, suppose we have the following grammar:

S à b | aA

A à c | bS.

The productions S à aA and A à bS are both indirectly recursive because of the following derivations:

S Þ aA Þ abS,

A => bS Þ baA.

A grammar is recursive if it contains either a recursive production or an indirectly recursive production. We can make the following general statement about grammars for infinite languages:

A grammar for an infinite language must be recursive.

We should note that a language can have more than one grammar. So we shouldn't be surprised when two people come up with two different grammars for the same language. The following list contains a few languages along with a grammar for each one.

Language / Grammar
{a, ab, abb, abbb} / S à a | ab | abb | abbb
{L, a,aa, …, an, …} / S à L | aS
{b,bbb,…, b2n+1, …} / S à b | bbS
{b,abc, aabcc, …, anbcn, …} / S à. b | aSc
{ac, abc, abbc, …, abnc, …} / S à aBc
B à L | bB

Sometimes a language can be written in terms of simpler languages, and a grammar can be constructed for the language in terms of the grammars for the simpler languages. We'll concentrate here on the operations of union, product, and closure.

Combining Grammars

Suppose M and N are languages whose grammars have disjoint sets of nonterminals. (Rename them if necessary.) Suppose also that the start symbols for the grammars of M and N are A and B, respectively. Then we have the following new languages and grammars: