/ UNIVERSITY OF SOUTH AUSTRALIA
Assignment Cover Sheet – Internal

An Assignment cover sheet needs to be included with each assignment. Please complete all details clearly.

If you are submitting the assignment on paper, please staple this sheet to the front of each assignment. If you are submitting the assignment online, please ensure this cover sheet is included at the start of your document. (This is preferable to a separate attachment.)

Please check your Course Information Booklet or contact your School Office for assignment submission locations.

Name: Phirun Son
Student ID / 1 / 0 / 0 / 0 / 6 / 4 / 5 / 1 / 7
Email:
Course code and title: CIS Research Methods INFT 4017
School: CIS / Program code: LHIS
Course Coordinator: Prof. Paul Swatman / Tutor: Prof. Paul Swatman
Day, Time, Location of Tutorial/Practical: Thursday 9-12
Assignment number: / Due date: 14 June 2009
Assignment topic as stated in Course Information Booklet:
Research Proposal

Further Information: (e.g. state if extension was granted and attach evidence of approval, Revised Submission Date)

I declare that the work contained in this assignment is my own, except where acknowledgement of sources is made.

I authorise the University to test any work submitted by me, using text comparison software, for instances of plagiarism. I understand this will involve the University or its contractor copying my work and storing it on a database to be used in future to test work submitted by others.

I understand that I can obtain further information on this matter at http://www.unisanet.unisa.edu.au/learningconnection/student/studying/integrity.asp

Note: The attachment of this statement on any electronically submitted assignments will be deemed to have the same authority as a signed statement.

Signed: Phirun Son / Date: 14 June 2009
Date received from student / Assessment/grade / Assessed by:
Recorded: / Dispatched (if applicable):

University of South Australia

School of Computer and Information Science

Bachelor of Information Science

(Advanced Computer and Information Science)

CIS Research Methods

Research Proposal

Modelling Microarray Data with Bayesian Networks Using Structure Learning

Student: Phirun Son

Student ID: 100064517

Supervisor: Dr Lin Liu

Table of Contents

Abstract 4

Introduction 5

Background 5

Research Question 6

Purpose and Limitations 7

Literature Review 9

DNA Microarray 9

Bayesian Networks 10

Structure Learning 11

Summary 12

Research Method 13

Platform 14

Expected Outcome 15

References 17

Abstract

Gene microarrays are a technology that allows the measurement of several thousand genes’ expression levels. It is captured as an array of thousands of small dots, which represents a snapshot of the expression levels of the genes. This allows the acquisition of large amounts of data with regards to what genes are responsible for causing (Shalon et al, 1996). However, as this data becomes more prevalent, due to the complexity of the data, it would be required that some automation is used in order to analyse the data.

The challenge of determining methods of analysing the vast amounts of data is important as it may allow the discovery of causal dependencies and relations of different genes, and maybe eventually, model the overall structure of the gene network. There have been past attempts to model microarray data into causal networks, including the use of Bayesian Networks. There are characteristics of Bayesian Networks that are desirable when working with microarray data as it can attempt to learn network structure from data that is incomplete or noisy, as microarray data is (Heckerman, 1998). The difficulty is determining the correctness of the structures found, but even partially correct graphs can assist in further research being performed to validate the precision of the results.

In order to determine the correctness of structures found by the learning of data, what can be done is take existing known structures of networks and use sample data to try to recreate the structure. This can help validate the accuracy of the algorithms used to determine the structure, or find flaws in the algorithm. By testing and comparing various proposed algorithms, an optimal one may be found which can be applied on new or existing gene data.

Introduction

Background

Genetic research is a major field of research in the bioinformatics field of recent. Specifically, a large amount of attention has been growing in relation to understanding the gene regulatory networks, also known as gene expression networks, which govern the way our bodies control cellular functions.

Deoxyribonucleic acid (DNA) is the template that our bodies use to create our cell structures. It is a form of nucleic acid which contains genetic instructions, which it relays to other components of our genetic makeup for the purpose of protein creation. This is done by creating messenger ribonucleic acid (mRNA), whose purpose is to transcribe DNA sequence information into RNA sequence information. Ribonucleic acid (RNA) contains the information and components needed to create cellular proteins (Knox et al, 2001).

The process of transcribing this information is known as gene expression, or gene regulation. The levels in which these genes are expressed are able to be measured in a variety of ways, but a certain method that has allowed an unprecedented scope of data is the use of microarrays (Shalon et al, 1996).

A gene microarray is a technology which is able to capture thousands of genes’ expression levels at a time, giving a huge amount of data. This is done by having a solid surface, such as a glass or silicon chip, attached with thousands of microscopic spots. These spots are known as probes, and are generally a short section of a gene, which can be processed using chemicals such as fluorophores in order to determine the level of expression of the gene. The expression levels of these targeted genes are generally captured over a period of time, in which the genes are exposed to some form of stimuli, such as a drug treatment, giving the progression of expression levels as a reaction to the stimuli (Shalon et al, 1996).

It can be ascertained that this method of data capture of gene expression levels creates a huge amount of data to interpret. There have been many attempts at realising this data into a formal structure, so that causal relationships between genes can be found, including the use of Boolean network and linear structures, a large focus is put on Bayesian networks as a means of constructing the gene regulatory network. Bayesian networks have several desirable features which will be described later, which give it an advantage in modelling the data formed by microarray experiments.

Research Question

There have been many novel proposed techniques which aim to ascertain a gene regulatory network by evaluating microarray data, however no one technique is foolproof in creating an accurate model, and little information is known about gene structures to validate whether the results of these techniques are correct. The way in which we can move forward in this area is to prove that certain techniques are viable with existing known structures, by taking these known structures and trying to recreate them using only data, and seeing whether they are sufficiently accurate toward the original data structure.

This paper will aim to examine the following questions:

Are Bayesian network structures suitable for modelling gene regulatory networks?

This question will determine whether Bayesian networks are truly a viable method of determining the structure of a gene network.

Which methods of structure learning are best used to create an accurate model of a gene network?

This question will help to identify the most promising of current techniques used in Bayesian networks for use in determining a network structure from data.

Are these methods viable for use in helping to determine the network structure of an unknown network?

This question will illustrate the practical use of using Bayesian network to discover actual gene networks, and help validate that the techniques used are feasible for real world use.

In order to answer these questions, a testing environment will be setup in MATLAB, an established programming environment used in scientific and mathematical research.

Purpose and Limitations

Gene research is an important element in discovering more and more about how our bodies function. It can help in founding new medical research and in developing treatments and cures for diseases. If we could better understand the way that our genes interact with one another to produce the building blocks of our bodies, it is possible to further research into genetic science.

This can be achieved by using the data collected from microarray experiments and trying to estimate a causal network of which genes interact and are dependent on other genes. This is a large problem as, although there have been established methods of modelling data, the challenges of working with gene data are many, the most daunting of which is the sheer number of genes present in our DNA makeup, let alone an experiment which only covers a few thousand of those.

As stated earlier, there is already a good deal of novel approaches towards modelling gene networks. Most of these techniques take existing algorithms in the field of structure learning of a Bayesian network, and modify them in order to create more accurate results. The scope of creating such a modification such as these is too great to be done within the time limitation of this research, therefore this research will focus on taking these known existing techniques and applying them in order to compare and validate which ones are most viable.

The work done will contribute to the knowledge base of which techniques of structure learning are most suitable, and whether certain techniques should be pursued further. It may help to guide future researchers in determining which techniques would be best suited to focus upon and expand the ideas of that technique, and which ones should possibly be ignored due to the inaccuracy or infeasibility of it.

Literature Review

DNA Microarray

DNA microarray technology is a fairly recent technology, being incepted in the mid 1990’s. Along with allowing an unprecedented amount of genes to be monitored at once, microarray technology solved problems that earlier methods of monitoring gene expression levels faced, such as in blotting. For instance, blotting required the use of porous membranes on which the gene probes were attached. This limited the scope of genes that could be measured due to the requirement of needing radioactive, chemiluminescent or colormetric detection methods. These methods cause the probe readings to scatter and disperse. This is not the case with microarray technology, which can be applied on a glass surface, and uses fluorescent detection methods, resulting in lower background interference and greater probe density (Shalon et al, 1996). This not only allows for more genes to be measured at once, but increases consistency as they are measured during the same experiment rather than several, which would be later normalised.

Microarray technology is not the only technology in recent years to be discovered; there have been other types of methods that have emerged, such as serial analysis of gene expression (SAGE) (Velculescu et al, 1995), however microarray technology has taken off the most due to being relatively easy to use, due to not requiring radioactive materials, and therefore specialised labs (Russo et al, 2003). Another advantage is that microarray technology is relatively cheap, and there are various types of microarray technology with incremental costs, allowing a range of firms of various sizes to afford to perform some kind of microarray-based technology (Granjeaud et al, 1999).

Due to the popularity and relative ease of microarray technology, there is a huge influx of microarray data of which there is little ability to process for valuable information (Granjeaud et al, 1999). To compound on this, there has been little in the way of standardising the format of microarray data. There are many different manufacturers and procedures used to collect the gene data, so naturally there is variation in how the final data is represented. There have been attempts to standardise the format of microarray data to relieve this problem, with the most widely accepted being ‘Minimum Information About a Microarray Experiment’ (MIAME) developed by the Microarray and Gene Expression Data Society (MGED) (Brazma et al, 2001). This standardisation has allowed repositories of microarray data to be formed, housing free gene expression data to the public. Such databases include ArrayExpress from the European Bioinformatics Institute (Brazma et al, 2003) and Gene Expression Omnibus (GEO) from the National Center for Biotechnology Information (Edgar et al, 2002), both of which contain an abundance of freely available microarray data.

Although there is an abundance of data available in a standard format, there is still the problem of modelling this data. This is a difficult challenge due to the nature of gene microarray data, but there have been many attempts put forth.

Bayesian Networks

There have been different ideas as to what model would be best suited to model gene networks, such as neural networks (Xu et al, 2007), linear equations (Gebert et al, 2007) and Boolean networks (Shmulevich et al, 2002), but the model that seem to have the most promise are Bayesian networks.

Bayesian networks, also known as belief networks, represent a set of variables in the form of nodes on a directed acyclic graph (DAG). It maps the conditional independencies of these variables. According to Heckerman (1998), Bayesian networks bring us four advantages as a data modelling tool, three of which are directly beneficial to working with gene microarray data. Firstly, Bayesian networks are able to handle incomplete or noisy data, which is a common trait of microarray experiments due to the nature of how the data is captured. Secondly, Bayesian networks are able to ascertain causal relationships through conditional independencies, allowing the modelling of relationships between genes. The last advantage is that Bayesian networks are able to incorporate existing knowledge, or pre-known data into its learning, allowing more accurate results by using what we already know. These points are re-iterated in many papers as reasoning of why Bayesian networks are a viable solution to modelling microarray data (Chen et al, 2006; Spirtes et al 2001; Wang et al, 2007; Yavari et al, 2008).