CS445/545, BIO495/695 Introduction to Bioinformatics

Sample Project 3

Microarray Data Analysis

This project involves analysis of a set of real microarray data. It is to be done in your project group. The effort and responsibilities of the students in a group must be clearly enough delimited and stated in a statement (attached to your project report) so that each of you can be graded fairly and separately.

Each group will give a 10-12 minute oral presentation of their results on the analysis. The presentation should be prepared using Microsoft PowerPoint. You should prepare about 10 slides. The first few slides should be dedicated to the description of the background, the data and the problem you are solving. The rest for the method you use and the results, etc.

Project 3 – Part I:

  1. Download the microarray gene expression data for Leukemia class classification
  2. Read info on the web site:

http://www.broad.mit.edu/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43

  1. Read Files description file

http://www.broad.mit.edu/mpr/publications/projects/Leukemia/Files_descriptions.txt

  1. Download expression data and class files
  2. Training dataset (ALL_vs_AML_train_set_38_sorted.res)
  3. Testing dataset (Leuk_ALL_AML.test.res)
  4. Training dataset class vector (ALL_vs_AML_train_set_38_sorted.cls)
  5. Test dataset class vector (Leuk_ALL_AML.test.cls)

Note: The .cls files provide the info about the subtypes of each sample. 0 for ALL and 1 for AML.

  1. Write a Perl program to preprocess the data

·  Eliminate the “endogenous control” genes (housekeeping genes);

·  Eliminate the genes with all As across the experiments;

·  Replace all the expression values below some threshold cut-off value to that threshold value (pick 20 to be the threshold cut-off value);

·  Eliminate the genes with less than two fold change across the experiments (max/min <2);

·  Save the Affymetrics Id of the genes and its expression values into a file. The file should look like the following:

Exp1 (ALL) / Exp2 (ALL) / Exp3 (ALL) / Exp4 (AML) / Exp5 (AML) / …
AF007111_at / 330 / 449 / 122 / 144 / 124 / …
AF007551_at / 141 / 87 / 244 / 465 / 398 / …
AF007875_at / 260 / 361 / 389 / 283 / 345 / …
AF008445_at / 124 / 20 / 24 / 20 / 56 / …
AF008937_at / 339 / 403 / 440 / 317 / 678 / …
AF009301_at / 288 / 142 / 214 / 77 / 190 / …
AF009368_at / 1032 / 1079 / 658 / 1424 / 1786 / …
AF009426_at / 36 / 38 / 120 / 16 / 97 / …
AF010193_at / 87 / 20 / 159 / 119 / 299 / …
AF014958_at / 318 / 108 / 40 / 329 / 544 / …
  1. Preprocess the training dataset using your Perl program

(The preprocessed data and the testing dataset will be used in part II of the project.)

Project 3 – Part II:

4.  Feature selection using the training data set (to be done by the biologists in each group)

a.  Sort genes in your preprocessed training dataset based on their p-values

You may use excel spread sheet to do it.

·  Syntax: TTEST(array1,array2,tails,type)

·  Use 2 for tails (two tailed distribution) and 3 for type (assuming unequal variances)

·  Enter t-test to get help from Excel if you need

b.  Select 50 top genes (with smallest p-values) and save the expression data of the 50 genes (clearly the number of top genes is a parameter and can be varied. If you prefer, you can plot the p-values and then decide the # of features to choose.)

Ex:

Exp1 (ALL) / Exp2 (ALL) / Exp… (ALL) / Exp15 (AML) / Exp16 (AML) / Exp… (AML) / p-val
(ALL vs AML)
AF008445_at / 1124 / 1068 / … / 24 / 20 / … / 0.001
AF008937_at / 339 / 403 / … / 440 / 317 / … / 0.65
AF009301_at / 288 / 242 / … / 2113 / 1977 / … / 0.0002
AF009368_at / 1032 / 1079 / … / 58 / 144 / … / 0.05
AF009426_at / 36 / 38 / … / 120 / 126 / … / 0.005
AF010193_at / 57 / 20 / … / 29 / 139 / … / 0.87
… / … / … / … / … / … / … / …

5.  Write a Perl program to process the testing data set

a.  Obtain the gene expression values of the selected top genes;

  1. Replace all the expression values below some threshold cut-off value to that threshold value (pick 20 to be the threshold cut-off value);
  2. Save the Affymetrics Id of the genes and its expression values into a file. The file should look like the following:

Test_Exp1 / Test_Exp2 / Test_Exp3 / Test_Exp4 / …
AF008445_at / 124 / 20 / 24 / 29 / …
AF008937_at / 339 / 403 / 440 / 317 / …
AF009301_at / 288 / 142 / 214 / 77 / …
AF009368_at / 1032 / 1079 / 658 / 1424 / …
AF009426_at / 36 / 38 / 120 / 16 / …
AF010193_at / 87 / 20 / 159 / 119 / …
… / … / … / … / … / …

Project 3 – Part II:

  1. Write a Perl program to implement the kNN classifier. Each group could choice your own favorite distance measures and values for the parameters involved in kNN. For example:
  2. Use Euclidean distance to calculate the distance between samples and pick 3 for k
  3. Use Euclidean distance to calculate the distance between samples and pick 5 for k
  4. Use Manhattan distance to calculate the distance between samples and pick 3 for k
  5. Calculate distances based on one individual gene, pick 3 for k and then use jury decision (majority rule) for the classification
  6. Calculate distances based on one individual gene, pick 5 for k and then use jury decision (majority rule) for the classification
  1. Use your classifier to classify the testing samples
  1. Report your results. Your report should include:
  1. Description of the problem and data
  2. Description of the algorithm you implemented
  3. Description of the accuracy of your classifier for classifying the set of data you used.
  1. Project presentation (10 min), Q&A (2 min)

The presentation should be prepared using Microsoft PowerPoint. You should prepare about 10 slides. The first few slides should be dedicated to the description of the background, the data and the problem you are solving. The rest for the method you use and the results, etc.

Note: This is a research project. There are many parameter values that you can play with to see their impacts on the final results. Also you might have more clever ideas for the classification problem. You can certainly implement your own ideas instead the ones specified above.

Submission instructions

When you are ready to submit, obtain a printed copy of your report and your Perl program. Remember to sign and attach the required Academic Integrity Pledge cover sheet. The printed report and program must be turned in by the start of class on the date the program is due. You also need to email a zipped copy of the program to me. Identify the email with the subject: Bioinformatics Project 3 and be sure that your name appears on the subject line and inside the message. The attached zip file must be named as your last name and the program number (example: smith_3.zip). The zip file must contain a folder having the same name (smith_3). The program files (all related files) should be inside this folder. Be sure to email your working solution before the due date!