Notes on Programming Language Syntax

by Geoffrey Smith and Jai Navlakha

At this point in the semester, we shift our focus from programming in F# to a more general study of programming languages. We will consider two main questions:

·  How can we preciselyspecifya programming language?

·  How can weimplementa programming language?

For specification, we will explore three formalisms:

1.  context-free grammarsfor specifying syntax,

2.  natural semanticsfor specifying behavior, and

3.  type systemsfor specifying "well-formed" programs.

When we consider language implementation, we will see that each of these formalisms can help us in building interpreters or compilers.

In these notes, we discussprogramming language syntax. Any programming language has precise rules for the syntax of legal programs, which must be checked by an interpreter or compiler. Syntax specification is traditionally divided into two parts: the specification of the legal "words" ortokensof the language, and the specification of which combinationsof tokens are legal. This division leads to two tasks for a compiler:lexical analysisandparsing.

Lexical Analysis

Lexical analysisis the process of transforming the source file from a sequence of characters into a sequence oftokens, usually including the stripping out of white space and comments. For instance, a source file like

if x <= 27 (* comment *) else

would be transformed into the tokens

IF ID(x) LEQ INT(27) ELSE EOF

Note thatxand27are recognized asID(identifier) andINT(integer literal). As far as syntax analysis is concerned, it makes no differencewhichIDorINTappears. But for type checking and code generation later, we do need to remember thesemantic valuesxand27. Also note thatEOFdenotes theend of file.

Tools likelexmake it very easy to construct a lexical analyzer, often called alexer. One writesregular expressionsthat describe the legal tokens in the language, along with the action to be taken when they are found. For example, here is alex-like specification:

if {return IF;}

[a-z][a-z0-9]* {return ID;}

[0-9]+ {return INT;}

(" "|\n)+ { }

Given this specification,lexautomatically constructs a lexer.

Context-Free Grammars

Parsingis the problem of checking whether the sequence of tokens produced by the lexer is legal. To specify what is legal, we use the formalism ofcontext-free grammarsor CFGs.

A CFG consists of a set ofnonterminals, one for eachsyntactic categoryto be defined. For example, we might have one nonterminal forexpressions, another forcommands, and another fordeclarations. Then we have a set of rules that describe all of the possible forms that each nonterminal can take. Each rule is of the formA -> alpha, whereA is a nonterminal andalphais a string of nonterminals and/or tokens.

Suppose for example that we wish to specifyarithmetic expressionsbuilt from identifiers using the binary operators+,-,*, and/. Then we need just one nonterminalE, seven tokens (i,+,-,*,/,(, and)), and six rules:

E -> i

E -> E+E

E -> E-E

E -> E*E

E -> E/E

E -> (E)

Normally we would write the rules more compactly as follows:

E -> i | E+E | E-E | E*E | E/E | (E)

Using this CFG we can generate legal arithmetic expressions by repeatedly replacingEusing whichever rule we like. For example, here is a derivation of the expression(i-i)+i*i.

E => E+E

=> E+E*E

=> E+i*E

=> (E)+i*E

=> (E)+i*i

=> (E-E)+i*i

=> (i-E)+i*i

=> (i-i)+i*i

Note that in each step we replaceoneEwith one of the rule right-hand sides,copyingall the other symbols verbatim. To avoid all this copying, it is much nicer to show a derivation as aparse tree, in which each nonterminal has aschildreneach of the symbols of the right-hand side of the chosen rule:

The parse tree shows that a given expression is syntactically legal. More importantly, it reveals thestructureof the expression. For example, it shows that the+operation is to be donelast, taking as its two operandsi-iandi*i.

However, there is a major concern about the structuring of expressions. How do we know that every expression has a unique parse tree? If not, then there could be more than one way to structure it. This leads to an important definition:

Definition: A context-free grammar isambiguousif there is at least one string with more than one parse tree.

It is easy to see that our example grammar is ambiguous. For example, here are two parse trees fori+i*i:

Which of these two trees matches the usual mathematical conventions?

You should also verify that our grammar gives two parse trees toi-i-i.

The trouble with our grammar is that it does not enforce eitherprecedenceorassociativityamong the operators. However, it turns out that we can rewrite our grammar so that the following properties hold:

·  The set of legal expressions is not changed.

·  The new grammar is unambiguous.

·  The new grammar gives*and/higher precedence than+and-.

·  The new grammar makes all operators associate to theleft.

The new grammar is a bit more complicated, as it requires three nonterminalsE,T, andFto separately defineexpressions,terms, andfactors:

E -> E+T | E-T | T

T -> T*F | T/F | F

F -> i | (E)

To convince yourself that the new grammar has the properties given above, you should draw the parse trees for a variety of expressions, such asi+i*i,(i+i)*i, andi-i-i.

Recursive-Descent Parsing

Given a CFG, theparsing problemis to find an algorithm that takes as input a sequence of tokens, and either

·  builds a parse tree, or

·  reports a syntax error (at some token).

Parsing is a very well-studied problem, and there exist nice tools likeyaccthat take as input a CFG and try to build a parser automatically.

Here I will briefly describerecursive-descent parsing, a simple (but not too general) parsing method that is easy to code by hand. [A parser that uses a set of recursive procedures to recognize its input with no backtracking is called a recursive-descent parser.] It is based on the observation that for some CFGs, a singlelookaheadtoken is enough to determine which rule to use next. Consider for example the following grammar for a little programming language:

S -> if E then S else S | begin S L | print E

L -> end | ; S L

E -> i

(For simplicity, we are limiting expressions to just identifiers.) It can generate programs like

if a then begin print b; print c end else print d

Suppose we are trying to parse anS. Notice that there are only three tokens that anScan start with,if,begin, andprint, each corresponding to a different rule. Hence if the parser knows which token is coming next, it can know which rule it needs to use. The same property holds forLand forEHence we can get the following C pseudo-code:

// lookahead token

int tok = nextToken();

void advance() {tok = nextToken();}

// used whenever a specific token t must appear next

void eat(int t) {if (tok == t) advance(); else error();}

void S() {

switch (tok) {

case IF: advance(); E(); eat(THEN); S(); eat(ELSE); S(); break;

case BEGIN: advance(); S(); L(); break;

case PRINT: advance(); E(); break;

default: error();

}

}

void L() {

switch (tok) {

case END: advance(); break;

case SEMICOLON: advance(); S(); L(); break;

default: error();

}

}

void E() {

if (tok == ID) then advance(); else error();

}

void main() {

S();

if (tok == EOF) accept(); else error();

}

To get a feel for recursive-descent parsing, you should try running this code (by hand) on sample inputs like

begin print a; print b end

and

begin print a; print b; end

(Note that the latter input has a syntax error.)

Finally, as written, the parser code only tests whether the input is syntactically legal or not. But it is quite straightforward to modify the code so thatS(),T(), andE()return parse treesfor whatever they parse.

Recursive-Descent Parsing of Arithmetic Expressions

I mentioned that recursive-decent parsing is "not too general". The reason is that many CFGs do not have the property that a single lookahead token determines which rule to use next. In particular our unambiguous grammar for arithmetic expressions doesnothave the property. Consider the rules

E -> E+T | E-T | T

If the lookahead isID, we have no way of knowing which rule to use.

Nevertheless, we can "hack" a solution in this case. Observe that the grammar forEmust ultimately generate aTfollowed by zero or more occurrences of+Tor-T. Hence we can parse anEusing the following code:

void E() {

T();

while (tok == PLUS || tok == MINUS) {

advance();

T();

}

}