Implementation of Microarray Data Analysis Algorithms As a Windows Application

TABLE OF CONTENTS

LIST OF FIGURES……………………………………………………………………….ii

ACKNOWLEDGEMENTS iii

1. INTRODUCTION 1

1.1 Introduction to microarray: 1

1.2 Image Analysis: 5

1.3 Data Analysis: 6

2. TECHNOLOGY: 7

2.1 Excel Database: 7

2.2 Microsoft .NET: 8

2.2.1) Why .NET? 8

2.2.2) VS.NET for Office Solutions: 8

2.2.3) Primary Interop Assemblies (PIA) 9

2.2.4) Coding with .NET Libraries: 9

3. IMPLEMENTATION OVERVIEW: 10

3.1 Application Design Overview: 10

3.2 Algorithm Outline 11

3.3 Data description 12

3.4 User Interface: 13

1) Parent Form: 13

2) Detection Calls Option: 13

3) T-test of Signal Log Ratios Option: 14

4) Set the criteria for Fold Change Option: 14

3.5 Why DLLs: 14

3.6 Class Diagram of the methods: 16

3.7 Description of methods used in the dlls: 16

3.7.1 VedmetDLL.dll – SetWorksheet (): 17

3.8 Description of methods used in the Options form: 21

3.8.1 Class: Form 1 21

3.9 Features: 22

3.10 Screen Shots 23

4. TESTING: 28

5. ENHANCEMENTS: 32

6. REFLECTIONS: 32

7. CONCLUSION: 33

8. REFERENCES: 34

9. APPENDIX 35

a) Transferring the GCOS CAB files into Excel: 35

LIST OF FIGURES

Figure 1: Gene Chip Array………………………………………………………………2

Figure 2: Hybridization of tagged and untagged probes………………………………..3

Figure 3: Gene Chip Technology………………………………………………………..4

Figure 4: Scanning of tagged and untagged probes…………………………………….6

Figure 5: Excel Object Model…………………………………………………………..9

Figure 6: Application Design Overview……………………………………………….11

Figure 7: Algorithm Overview…………………………………………………………12

Figure 8: Class Diagram………………………………………………………………..16

ACKNOWLEDGEMENTS

I sincerely thank Dr. Dan Andresen, my major professor, for his invaluable guidance and advice throughout the whole project. I also thank him for being flexible and adjusting during the course of the project.

I am grateful to Dr. Daniel Marcus who has helped me to understand the background of this project. I thank him for his valuable suggestions and being supportive.

I thank Dr. Mitch Neilsen for serving in my committee and agreeing to review my report.

I am indebted to my parents for their love and encouragement. Last but not least I thank all my friends including Palani and my brother Dinesh for their support.

1. INTRODUCTION

The analysis of Micro array data has been a time consuming task which involves implementing different algorithms on the genome databases. This was done earlier using a collection of different software packages that suited the varied purposes. As a step toward automating the procedure, a Windows application was developed which has a biologist friendly user-interface interacting with the genome databases (gene lists) obtained from the Gene Chip Operating Software (GCOS) 1.2. GCOS is a software package which is used specifically for acquiring and analyzing gene array data from the Affymetrix Gene Chip platform.

1.1 Introduction to microarray:

The fundamental basis of DNA microarrays is the process of hybridization. Two DNA strands hybridize (stick together) if they are complementary to each other. Complementarity reflects the Watson-Crick rule that adenine (A) binds to thymine (T) and Cytosine (C) binds to guanine (G). Hybridization has for decades been used in molecular biology as the basis for such techniques as Southern blotting and Northern blotting. Where before it was possible to run a couple of Northern blots or a couple of Southern blots in a day to identify a few expressed genes, it is now possible with DNA arrays to run hybridizations to test for expression of tens of thousands of genes. This has in some sense revolutionized molecular biology and medicine. Instead of studying one gene and one messenger at a time, experimentalists are now studying many genes and many messages at the same time. In fact, DNA arrays are often used to study all known messages for genes of an organism. This has opened the possibility of an entirely new, systematic view of how cells react in response to certain stimuli. It is also an entirely new way to study human disease by viewing how it affects the expression of all genes inside the cell.

The Technology behind DNA Microarrays:

A microarray is a solid support on which DNA of known sequence is deposited in a regular grid-like array. The DNA may take the form of cDNA or oligonucleotides, although other materials may be deposited as well. Typically, several nanograms (per chip) of DNA are immobilized on the surface of an array.

Figure 1: GeneChip Array

Image courtesy of Affymetrix - http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx

RNA is extracted from biological sources of interest, such as cell lines with or without drug treatment, tissues from wild-type or mutant organisms, or samples studied across a time course. The RNA (or mRNA) is often converted to cDNA, labeled with fluorescence or radioactivity, and hybridized to the array.

Figure 2: Hybridization of tagged probes

Image courtesy of Affymetrix - http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx

During this hybridization, cDNAs derived from RNA molecules in the biological starting material can hybridize selectively to their corresponding nucleic acids on the microarray surface. Following washing of the microarray, image analysis and data analysis are performed to quantitate the signals that are detected. Through this process, microarray technology allows the simultaneous measurement of the expression levels of thousands of genes represented on the array.

Figure 3: GeneChip Technology:

Image courtesy of Affymetrix - http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx

Advantages of Microarrays:

1. Its fast; one can obtain data on the expression levels of over 50000 genes within one week.

2. The entire genome can be represented on a chip and thus it is comprehensive.

3. It is flexible because cDNAs or oligonucleotides corresponding to any gene can be represented on a chip.

Disadvantages of Microarrays:

1. Many researchers find it prohibitively expensive to perform sufficient replicates and other controls.

2. There are many artifacts associated with image analysis and data analysis. Researchers are still figuring out how to get the “best” answers from microarray experiments.

3. It is just not enough to do microarrays; usually the microarray results have to be validated using some technique like RTPCR.

4. There is NO standard way to analyze microarray data.

5. It is best to combine knowledge of biology, statistics and computers to get answers and hence the learning curve is high.

Applications of Microarray:

1. Studying the effects of drug treatment

2. Gene knock out effects

3. Gene cloning

4. Cancer research

5. Developmental biology (like stem cell populations)

1.2 Image Analysis:

When the microarray chip is illuminated by a laser beam, the RNA that has been hybridized fluoresces, producing brightness proportional to the amount of hybridized RNA. This image is captured by a camera and it is then processed by a computer to get the expression levels of all the genes.

Figure 4: Scanning of tagged and untagged probes

Image courtesy of Affymetrix - http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx

Background subtraction:

The first step of data analysis is to correct for background across the entire array. The array is divided into equally spaced zones and is assigned an average background to the center of each zone. The calculated background computed from each cell establishes an intensity floor that is subtracted from all intensity values.

1.3 Data Analysis:

The data analysis starts with the normalization and/or scaling of the data which is a built-in step when the micro array data image is acquired by the Gene Chip Operating software GCOS 1.2. (The normalization and scaling can be performed by either data acquisition software or data analysis software). The data correlation between the samples can be viewed by performing a lot of methods like Cluster analysis, Condition trees, Principal component analysis etc. Once the data is acquired, generally it is exported to an Excel file where analysis is done. The post-analysis steps include Gene Ontology (i.e., classifying the genes into different functional groups) and inputting the genes into pathways from different databases (which can be performed in different software).

There are a number of commercially-available software packages that might include the analysis steps that are implemented in this Windows application. As such, there is no perfect method for data analysis and it is up to the investigators to decide which of the steps or algorithms to follow for their data analysis. This application is highly customized for use in the Cellular Biophysics laboratory of the Department of Anatomy and Physiology, College of Veterinary Medicine, Kansas State University.

2. TECHNOLOGY:

The front end has been implemented in Visual C#.net as a Windows application. For the back-end, the spreadsheet that is obtained from GCOS is Microsoft Excel and the output data sheet that biologists expect out of a tool is also an Excel spreadsheet.

2.1 Excel Database:

Excel spreadsheets is a straightforward solution for storing tabular data, and recent versions of the ubiquitous Microsoft Excel include some surprisingly sophisticated data access and manipulation functions.

Issues:

There are a number of reasons why Excel is not to be preferred for data management and/or statistical analysis; some of the simple reasons being

a) There is no way to record what you have done

b) Poor statistical routines-it is impossible to view the source code that implements the statistical routines; several Excel procedures are misleading.

c) Routines for handling missing data were incorrect in prior versions of Excel 2000. [In reference to pre-2000,"Excel does not calculate the paired t-test correctly when some observations have one of the measurements but not the other." E. Goldwater, ref. (1)].

Nonetheless, it is a manual process to copy the formulas to all the cells and sometimes it is dangerous to sort the columns and the datasheets too.

However, the conventional method of data analysis for huge genomes used by biologists is to use Microsoft Excel. Though the calculation of the cells in Excel for the complete genome, sorting & filtering the data, and copying different results to different worksheets for subsequent manipulation is a menial task, it is still considered as an easy solution. So the solution I have proposed is to develop a Windows application tool that will aid the analysis task by interfacing and programming in Excel worksheet.

2.2 Microsoft .NET:

2.2.1) Why .NET?

The reasons for choosing .NET include the following:

.NET provides the ability to create rich clients that execute within the Common Language Runtime. These applications utilize a new Windows forms processing engine, called Windows Forms. Any .NET language can use Windows Forms to build Windows applications. These applications have access to the complete .NET Framework of namespaces and objects, and have all of the advantages which the Framework can offer.
It is object-oriented and has many programming tools that allow for faster development and more functionality.
All applications in .NET are "garbage-collected", which means that objects are destroyed automatically when they are no longer in use.

2.2.2) VS.NET for Office Solutions:

The key benefits of choosing VS.NET as development environment for Office Solutions are,

· Power of writing managed .NET code that executes behind Word and Excel documents

· Developers get the full, robust advantages of the Visual Studio .NET environment

· Allows developers to create applications with a more robust security model, restricting code that can execute only on a fully trusted corporate server.

· Code-behind .NET projects can be started in .NET with new Office documents, applied to existing Excel spreadsheets or Word documents and templates, and even co-exist with current VBA-based logic.

· Using VS.NET facilitates language freedom, easier debugging, better memory management, and a more robust security model.

VBA programming model still exists and the .NET Office tools are just another choice.

2.2.3) Primary Interop Assemblies (PIA)

Microsoft provides official wrapper assemblies for writing managed code against the programmable unmanaged Microsoft Office libraries. These are called the primary interop assemblies.

The Office PIAs are installed due to the following reasons:

· Develop managed Office applications within the robust Visual Studio development environment

o Create Excel and Word solutions directly from within Visual Studio

· Develop applications more quickly with less code

· Utilize Visual Studio’s vast array of tools, easy access to web services, and access to the .NET Framework

· Leverage existing Visual Studio/VB/C# experience

2.2.4) Coding with .NET Libraries:
Here is a quick walkthrough about Excel Object Model; the complete Excel Object model is little complicated, there are only few objects required for our Windows application,

Figure 5 – Excel Object Model

· Application object is the controller object of all other subsystems in the Excel Application.

· Each application can have multiple Workbooks; there will be one default workbook for each application, the default workbook is returned to the ‘ThisWorkbook’ object variable.

· Each Workbook will have 3 Worksheets (actual data presentation area). Developer can present data in any one of these sheets or they can create their own additional worksheets.

3. IMPLEMENTATION OVERVIEW:

3.1 Application Design Overview:

The application has references to the Microsoft Excel 11.0 Object Library and to the dlls that contains the analysis modules and the user interface for those analysis modules. All the references and their associated items can be managed by Solution Explorer which is provided as a part of the Integrated Development Environment IDE. The solution is a container for the projects and solution items that can be built into the application. The Windows application interacts with the Excel database through the solution level that has all the build configurations.

Figure 5 – Application Design Overview

3.2 Algorithm Outline

The working of this application is given in the architecture diagram given below. The user can import the Excel file (See Appendix (a) for transferring a CAB file into Excel) containing the gene list to be processed and can build the analysis strategy by selecting the options given in the first form. All the analysis modules and their user interface are placed in different dynamic link libraries and the dlls are added as reference to the Windows application. Depending on the steps selected by the user, appropriate forms are opened by calling the dlls wherein the user can set the criteria for that step. The actions in those forms call the corresponding functions in the dlls. These modules in the dlls are interfaced with the Excel spreadsheet and hence the analysis module corresponding to the first option selected by the user filters the data set and returns an array list containing the row numbers of the genes that match the criteria. This array list is passed to the next analysis module corresponding to the second option selected by the user and thus subsequent filtering is performed. Finally the dataset which has been filtered by all the analysis steps is exported as an Excel worksheet. It can also be previewed within the “Preview Result” textbox of the form.