Abdullah Sheneamer 2012
DCSPM
Develop and Compile Subset of
PASCAL Language to MSIL
By
Abdullah Sheneamer
A project submitted to the Faculty of Graduate School of the
University of Colorado at Colorado Springs
in Partial Fulfillment of the Requirements
for the Degree of
Master of Science in Computer Science
Department of Computer Science
Fall 2012
© Copyright by Abdullah Sheneamer 2012
All Rights Reserved
This project for the Master of Science degree by
Abdullah Sheneamer
has been approved for the
Department of Computer Science
By
______
Dr. Albert Glock, Advisor
______
Dr. C. Edward Chow, Committee member
______
Albert Brouillette, Committee member
______
Date
DCSPM
Develop and Compile Subset of
PASCAL Language to MSIL
Abstract
The focus of this project is to design the Intermediate language (IL or MSIL) for PASCAL Language. This project aims to design a compiler, called DCSPM, that can compile a program written in subset of PASCAL Language to MSIL including, Assignment statement, Writelineinstructions, If statement, If/else statement, While statement, For statement, Switch statement, If logic statement, and One dimensional array. The compilation time is important so, we have evaluated these different implementations for their speed performance in Lexical Analysis and Parser which can become bottleneck. It is shown that the DCSPM implementation is pretty fast and the generated code is reliable and efficient. First, lexical analysis is built, which reads the Pascal source code and produces tokens to be passed to the parser. MSIL of PASCAL was generated as the output of the parser. One of the most difficulties in this research is to verify the correctness of the MSIL code generated by DSCPM. I need to compare with the MSIL generated by a similar C# code and verify if they generated the same execution results. DSCPM supports simple nested if statement, if statement of a complex condition with a single level, and a simple one dimensional array with limited operation such as inside a print statement. DSCPM produces efficient MSIL intermediate code which can then be assembled into .NET managed executable. DSCPM can serve as an education tool for students studying PASCAL, compiler technology, and MSIL. The lessons learned can be applied to other programming languages.
Acknowledgements
I would never have been able to finish my dissertation without the guidance of my committee members, and support from my family and wife.
I would like to express my deepest gratitude to my advisor, Dr. Albert Glock, for his excellent guidance, caring, patience, and providing me with an excellent atmosphere for doing research.
I offer my sincerest gratitude to Dr. Edward Chow, who let me experience the research of practical issues beyond the textbooks, patiently corrected my writing research and giving important questionable ideas to me through his comments on my proposal.
I would also like to thankAlbert Brouillette for being interested in getting my proposal succeeded and giving important questionable ideas to me through his comments on my proposal.
Contents
1Introduction
1.1Motivation:
2Background
2.1Overview of Compilation Process
3Related Works
3.1Research Paper: “The Design and Implementation of C-like Language Interpreter” [XX11]
3.2Research Paper: “Simple Calculator Compiler Using Lex and YACC” [U11]
4History
5Design
5.1Introduction to Symbol Table and Lexical Analysis
5.1.1Symbol Table Design
5.1.2Lexical Analysis Design
5.2Parser and MSIL (Microsoft Intermediate Language) of PASCAL Language Design
5.2.1Introduction to Parser (Syntax Analysis)
5.2.2Parser (Syntax Analysis) Design
5.2.3Introduction to MSIL (Microsoft Intermediate Language)
5.2.4Intermediate language Instructions
5.2.5MSIL (Microsoft Intermediate Language) Design
5.2.6Design Common Syntax Errors Table
6Implementation
7Improvements and Evaluations
7.1Improvements
7.1.1Lexical Analysis Improvement
7.1.2Microsoft Intermediate Language (MSIL) of If Statement Improvement
7.2Evaluations and performance
8Lessons Learned
9Future Works
10Conclusion
11Bibliography
Appendix A:
PASCAL Grammar BNF.
Appendix B:
List of Figures
Figure 1: The compilation and execution process of PASCAL programs
Figure 2: Compilation process
Figure 3: A Compiler
Figure 4 : Class Token in Lexical Analysis
Figure 5: State diagram for the lexical Analyzer (states 0,1,2)
Figure 6: State diagram for the lexical Analyzer (states 3, 4, 5)
Figure 7: Syntax Tree
Figure 8: Typical Data Structure for the given Syntax Tree
Figure 9: Steps in the top-down construction of Parse Tree
Figure 10: Method memory categories
Figure 19: MSIL of One dimensional Array has one element
Figure 20: MSIL of One dimensional Array has four elements
Figure 11: Application Code using .NET
Figure 12: JIT Compilation
Figure 13: NET CLR
Figure 14: Array list data structure vs. Dictionary data structure
Figure 15: Parser phase results
Figure 16: initial and improved IF/Else MSIL results
Figure 17: Benchmark between size files of initial and improved IF/Else.il
Figure 18: How Branches of If/else statements logic works
List of Tables
Table 1: Part of Symbol Table
Table 2 : Two Characters Tokens
Table 3: Array list data structure vs. Dictionary data structure
Table 4: Complexity of ArrayList vs. Dictionary
Table 5: Parser phase results
Table 6: benchmark between unimproved and improved IF/Else MSIL
Table 7: benchmark between initial and improved IF/Else.il files
1Introduction
In the computer world, techniques evolve rapidly from theories, algorithms, programming languages, software systems, and software engineering.
“Programming languages are notations for describing computations to people and to machines. The world as we know it depends on programming languages, because all the software running on all the computers was written in some programming language. But, before a program can be run, it first must be translated into a form in which it can be executed by a computer. The software systems that do this translation are called compilers.” [ASU11]
Fortunately, compilers allow programmers to write at a high level, and automated processing takes care of creating the machine-specific instructions. My project designs and createsa compiler that translates PASCAL source code into Microsoft Intermediate Language (MSIL). When compiling the source code to managed code in .Net environment, the compiler translates the source into Microsoft Intermediate Language (MSIL).MSIL includes instructions for loading, storing, initializing, and calling methods on objects, as well as instructions for arithmetic and logical operations. There is currently no PASCAL compiler which compiles to MSIL. The Just-in-time (JIT) compiler will convert the MSIL to CPU- Specific code [MC5tk].
The advantage in compiling to MSIL is that 1) legacy PASCAL can now be run on modern machines, 2) MSIL is platform independent and 3) JIT compilers can be optimized for specific machines and architectures. The JIT compiler can also do aggressive optimizations specifically for the machine where the code is running.
“Before you can run Microsoft intermediate language (MSIL), it must be converted by a .NET Framework just-in-time (JIT) compiler to native code, which is CPU-specific code that runs on the same computer architecture as the JIT compiler. Because the common language runtime supplies a JIT compiler for each supported CPU architecture, developers can write a set of MSIL that can be JIT-compiled and run on computers with different architectures. However, your managed code will run only on a specific operating system if it calls platform-specific native APIs, or a platform-specific class library.
JIT compilation takes into account the fact that some code might never get called during execution. Rather than using time and memory to convert all the MSIL in a portable executable (PE) file to native code, it converts the MSIL as needed during execution and stores the resulting native code so that it is accessible for subsequent calls. The loader creates and attaches a stub to each of a type's methods when the type is loaded. On the initial call to the method, the stub passes control to the JIT compiler, which converts the MSIL for that method into native code and modifies the stub to direct execution to the location of the native code. Subsequent calls of the JIT-compiled method proceed directly to the native code that was previously generated, reducing the time it takes to JIT-compile and run the code.
The runtime supplies another mode of compilation called install-time code generation. The install-time code generation mode converts MSIL to native code just as the regular JIT compiler does, but it converts larger units of code at a time, storing the resulting native code for use when the assembly is subsequently loaded and run. When using install-time code generation, the entire assembly that is being installed is converted into native code, taking into account what is known about other assemblies that are already installed. The resulting file loads and starts more quickly than it would have if it were being converted to native code by the standard JIT option.
As part of compiling MSIL to native code, code must pass a verification process unless an administrator has established a security policy that allows code to bypass verification. Verification examines MSIL and metadata to find out whether the code is type safe, which means that it only accesses the memory locations it is authorized to access. Type safety helps isolate objects from each other and therefore helps protect them from inadvertent or malicious corruption. It also provides assurance that security restrictions on code can be reliably enforced.
The runtime relies on the fact that the following statements are true for code that is verifiably type safe:
- A reference to a type is strictly compatible with the type being referenced.
- Only appropriately defined operations are invoked on an object.
- Identities are what they claim to be.
During the verification process, MSIL code is examined in an attempt to confirm that the code can access memory locations and call methods only through properly defined types. For example, code cannot allow an object's fields to be accessed in a manner that allows memory locations to be overrun. Additionally, verification inspects code to determine whether the MSIL has been correctly generated, because incorrect MSIL can lead to a violation of the type safety rules. The verification process passes a well-defined set of type-safe code, and it passes only code that is type safe. However, some type-safe code might not pass verification because of limitations of the verification process, and some languages, by design, do not produce verifiably type-safe code. If type-safe code is required by security policy and the code does not pass verification, an exception is thrown when the code is run.” [MHt8e].
-Compilation process:takes PASCAL source code and produces MSIL. The PASCAL compiler includes lexical and syntax analysis, and the creation of the symbol table. MSIL is created when compiling to manage native code.MSIL is a CPU-independent set of instructions that can be efficiently converted to native code. Such as Figure 2.
-Execution process:MSIL must be converted to CPU-specific code, usually by ajust-in-time (JIT) compiler. Native code is computer programming (code) that is compiled to run with a particular processor(such as an Intelx86-class processor) and its set ofinstructions.
1.1Motivation:
“During compilation of MSIL, thesource codeis translated into MSIL code rather than platform or processor-specificobject code. MSIL is aCPU- and platform-independent instruction set that can be executed in any environment supporting the Common Language Infrastructure,such as the.NET runtimeonWindows, or thecross-platformMonoruntime. In theory, this eliminates the need to distribute different executable files for different platforms and CPU types. MSIL code is verified for safety during runtime, providing better security and reliability than natively compiled executable files. ” [ K08].
Since, there is currently no PASCAL compiler which compiles to MSIL so, I designed MSIL of subset of PASCAL language which has the advantage in compiling to MSIL is that 1) legacy PASCAL can now be run on modern machines, 2) MSIL is platform independent and 3) JIT compilers can be optimized for specific machines and architectures.
2 Background
2.1Overview of Compilation Process
A compiler is a program that can read a program in one language- the source language – and translate it into equivalent program in another language as Figure 2. An important role of the compiler is to report any errors in the source program that it detects during the translation process [ASU11].
“Microsoft Intermediate Language (MSIL) is a language used as the output of a number of compilers (C#, VB, .NET, and so forth).The ILDasm(Intermediate Language Disassembler) tool that ships with the .NET Framework SDK (FrameworkSDK\Bin\ildasm.exe) allows the user to see MSIL code in human-readable format. By using this utility, we can open any .NET executable file (EXE or DLL) and see MSIL code.
TheILAsm (Intermediate Language Assembler) tool generates an executable file from the MSIL language. We can find this program in the WINNT\Microsoft.NET\Framework\vn.nn.nn directory. Any PASCAL programmer starting with .NET development is interested in what happens in the low level of the .NET Framework. Learning MSIL gives a user the chance to understand some things that are hidden from a programmer working with PASCAL or another language. Knowing MSIL gives more power to a .NET programmer. We never need to write programs in MSIL directly, but in some difficult cases it is very useful to open the MSIL code in ILDasm and see how things are done.” [ CodeMSIL].
3Related Works
3.1Research Paper: “The Design and Implementation of C-like Language Interpreter” [XX11]
This paper designs and implements a C-like language interpreter using C++ based on the idea of modularity. The function of lexical analyzer is to read character strings from the source program, split them into separate words, and constructs the internal expression of these words, that is, TOKEN. The basic idea of lexical analyzer design is: first, to judge the start and the end position of a word; second, to judge the attribute of a word. After a word is separated, the next thing is to determine its attribute [xx11]. Lexical Analyzer of DCSPM was presented the same technique in this paper but there are different in types of tokens. DCSPM reads Pascal code from a file, split it into separate words and types of tokens can be divided into:Language-defined tokens, Reserved words, sets of symbols defined by the language to have special meaning, e.g., “begin”, “defun”, etc. Operators: =, +, *, etc. Dividers: { }; User-defined tokens:Identifiers: names of variables, procedures, etc. Literals: numbers, strings, etc. specified in the program. The symbol table is concerned only with storing identifiersand their attributes.
3.2Research Paper: “Simple Calculator Compiler Using Lex and YACC” [U11]
Mohit Upadhyaya presented a This paper containings the details of how one can develop the simple compiler for procedural language usingLex (Lexical Analyzer Generator) and YACC (Yet Another Compiler-Compiler). Lex tool helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-scripts type transformations and for segmenting input in preparation for a parsing routine. Lex tool source is the table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. On the other hand YACC tool receives input of the user grammar. Starting from this grammar it generates the C source code for the parser. YACC invokes Lex to scan the source code and uses the tokens [Upad11]. In lexical analyzer of DCSPM reads identifier, reserved words, sets of symbols, operators, dividers, variables, procedures, numbers, and strings. In the parser of DCSPM, not just regular expression is constructed, but also most of the statements of Pascal language.
4History
“Pascalis an influential imperative and procedural programming language, designed in 1968–1969 and published in 1970 byNiklaus Wirtha small and efficient language intended to encourage good programming practices using structured programming and data structuring. A derivative known asObject Pascaldesigned for object-oriented programming was developed in 1985.
Pascal, named in honor of the French mathematician and philosopher Blaise Pascal, was developed by Niklaus Wirthand based on the ALGOL programming language
Prior to his work on Pascal, Wirth had developedEulerandALGOL Wand later went on to develop the Pascal-like languagesModula-2andOberon.
Initially, Pascal was largely, but not exclusively, intended to teach students structured programming.A generation of students used Pascal as an introductory language in undergraduate courses. Variants of Pascal have also frequently been used for everything from research projects to PC games and embedded systems.. Newer Pascal compilers exist which are widely used.” [WikiPascal].
Grace Murray Hopper coined the term compiler in the early 1950s. Translation was viewed as the “compilation” of a sequence of machine language subprograms selected from a library. One of the first real compilers was the FORTRAN compiler of the late 1950s. It allowed a programmer to use a problem-oriented source language. Ambitious “optimizations” were used to produce efficient machine code, which was vital for early computers with quite limited capabilities. Efficient use of machine resources is still an essential requirement for modern compilers [PagesCs].
5Design
5.1Introduction to Symbol Table and Lexical Analysis
A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the identifier (information about storage allocation, type,…, etc.). When the lexical analyzer detects an identifier in the source, the identifier is entered into the symbol table. However, its attributes will be entered in the following phases. These attributes are also used later phases.
The lexical Analysis is the first phase of a compiler is called lexical analysis or scanning. The lexical Analysis reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes.