CIS595: Homework 5

Assigned: April 05, 2005

Due:in class,Monday, April11, 2005

Homework Policy

  • All assignments are INDIVIDUAL! You may discuss the problems with your colleagues, but you must solve the homework by yourself. Please acknowledge all sources you use in the homework (papers, code or ideas from someone else).
  • Assignments should be submitted in class on the day when they are due. No credit is given for assignments submitted at a later time, unless you have a medical problem.

Problems

(40 points) 1. Write two 1-page reports describing motivation, methodology, experimental results of the following two papers (you can find the papers easily by pasting the paper titles to Google search):

a)Jones DT., (1999) Protein secondary structure prediction based on position-specific scoring matrices, J Mol Biol., 17;292(2):195-202.

b)Peng, K., Vucetic, S., Radivojac, P., Brown, CJ, Dunker, AK., Obradovic, Z., Optimizing Long Intrinsic Disorder Predictors with Protein Evolutionary Information, Journal of Bioinformatics and Computational Biology, Vol 3, No. 1, pp. xx-xx, 2005.

Download all files from into a separate folder. To solve this problem you will have to be able to use and modify the file example_disorder.m

(20 points) Study the program example_disorder.m. What is the purpose of the program? Going through the program line-by-line, explain the whole process of constructing the predictor and applying it on unlabeled.txt sequences, starting from the order.txt and disorder.txt sequences in fasta format. The Peng et al. paper you read should be helpful in understanding the program.

(10 points) a) Repeat training and testing of a neural network for disorder prediction 10 times (simply run the file example_disorder.m 10 times) and report on the testing accuracies obtained in each of the 10 experiments

(5 points) b) Repeat the problem 3.a), but this time use a neural network with 15 hidden nodes (in line 79). Report the results. Discuss the differences and similarities with the accuracies obtainedin 3.a). Discuss the differences in the time needed to learn one neural network in 3.a) and 3.b)

(5 points) c) Remove line 69 from example_disorder.m. So, the normalization step is not going to be performed. Repeat the experiment described in 3.a) using the modified file. Discuss the differences and similarities with the accuracies obtained in 3.a).

(20 points) Apply a neural network obtained using example_disorder.m on myco.txt sequences corresponding to Mycoplasma bacterial genome, which is the smallest known bacterial genome with about 500 genes. For each myco.aa sequence calculate the number and percentage of amino acids with predicted disorder. Summarize your results (such as, out of xx myco.aa sequences xx% has more than xx% predicted disordered amino acids, while …).

Downloand install WEKA 3: Data Mining Software in Java from Learn how to use “Explorer GUI” – it is very user friendly and should not take long. Start from the data set you obtained in line 41 of example_disorder.m.

(10 points) Reformat the data to WEKA format (you should save the data in ASCII format, and make few minor changes to the resulting file to adjust it to WEKA format)

(20 points) Run 5-fold cross validation classification experiments using the following algorithms (you can leave the default parameters of each algorithm):

ZeroR (trivial predictor)

J48 (decision tree)

NaiveBayes

IBk (k-nearest neighbor)

MultilayerPerc (neural network)

SMO (support vector machine)

Bagging of 30 decision trees (meta learning algorithm)

Based on the J48 tree result, discuss which attributes are important for classification and which are not. Comment if this agrees with your intuition. Report the accuracy of each algorithm. Rank the algorithms by their speed. Compare the MultiPerc accuracy result with the result you obtained using Matlab.

(20 points)Try to improve the accuracy of each of the above algorithms by changing some of the default parameters. Explain your choices and present the results. Hopefully, you will be able to improve accuracy of each algorithm other than ZeroR.

GOOD LUCK!!!