Due: in Class, Tuesday, October07, 2003

CIS526: Homework 4

Assigned: October01, 2003

Due: in class, Tuesday, October07, 2003

Homework Policy

All assignments are INDIVIDUAL! You may discuss the problems with your colleagues, but you must solve the homework by yourself. Please acknowledge all sources you use in the homework (papers, code or ideas from someone else).

Assignments should be submitted in class on the day when they are due. No credit is given for assignments submitted at a later time, unless you have a medical problem.

Problems

In this homework you are expected to perform a number of experiments in Matlab related to two datasets: Boston Housing dataset and Pima Indians dataset. You can download the datasets (housing.txt and pima.txt) and their description (housing_desc.txt and pima_desc.txt) from

At the same location you are provided with Matlab .m files (divideset, hw4_regression, hw4_classification, neural_simple, neural_simple_class, normalize) that should be extremely useful for solving the given problems. For people without the Neural Network toolbox, there are several decent freely available toolboxes with neural networks: Netlab ( PRTools (

Problems 1-6 are related to regression using Boston Housing dataset.

1. (5 points) Initial analysis of the housing_desc.txt.

Read the description of the data set in housing_desc.txt. What relationships you expect between attributes CRIM, INDUS, DIS and the housing prices (target attribute MEDV)? Report your conclusions in 1-2 sentences for each of relationships: CRIM-MEDV, INDUS-MEDV, DIS-MEDV.

2. (10 points) Exploratory analysis of the housing.txt.

Examine the dataset housing.txt using Matlab. How many binary attributes are there in the data set?Calculate and report the correlations between the first 13 attributes (columns) and the target attribute (column 14). What are the attribute names with the highest positive and negative correlations to target attribute? Calculate all correlations between the 14 columns (using the corrcoef function). Which two attributes have the largest mutual correlation in the dataset?Note that correlation is a linear measure of similarity. Examine scatter plots (using the plotmatrix function) between the 13 attributes and the target attribute. Which scatter plot looks the most linear, and which looks the most nonlinear? Plot these scatter plots and briefly (in 1-2 sentences) explain your choice.

3. (10 points) Learning a neural network to predict housing prices

hw4_regression.m is a Matlab file you can use to load the data, normalize it, prepare and learn a neural network, and report its accuracy on a separate test set. Note that from neural_simple.m (line 10) you can see that neural network with one hidden layer is used, where logsig nonlinearity is used for hidden neurons and output neuron is linear. “trainlm” denotes Levenberg-Marquardt backpropagation which is very fast algorithm based on second derivatives. (Note: for people not using Matlab Neural Network Toolbox, you can choose any available training algorithm)

Run the hw4_regression.m once and report the following:

a)what is the mean and standard deviation of each column in data after line 46;

b)how many data points were assigned to tr, val, test after line 52;

c)how many hidden neurons were assigned to the neural network (although it is very important design parameter, we will not change this number in the following problems);

d)what is the performance of the resulting neural network: mean squared error (mse) and R2 (R_square)

e)locate and report the data example in test set where the neural networs has the highest absolute error. How large is the error in the $1,000.

NOTE: partitioning into tr:val:test is a standard procedure: tr is used for neural network training, val is used for early stopping of training, test is used to check the accuracy of trained neural network on unseen data.

3. (5 points) Accuracy on training and test data

Calculate the accuracy of trained neural network (from problem 3) on training and validation subsets (tr and val).

4. (10 points) Learning 30 neural networks on different partitions into val, tr, test from the problem 3.

Note that divideset.m randomly divides a given set into subsets. Repeat the whole experiment lab2_regression.m 30 times with different random partitions of data into tr, val, and test sets. What are the average MSE and R_squared and their standard deviation over the 30 experiments. What can you conclude from there?

5. (10 points) Effect of normalization.

Describe briefly the purpose of normalization.m function (used in line 46). To better understand its benefits, try to delete it, and see how the neural network performs now. Repeat the previous problem, but do not use nomalization prior to training of neural networks. What are the average MSE and R_squared and their standard deviation over the 30 experiments. What can you conclude from there?

6. (10 points) Effect of training algorithm.

For people using Neural Networks toolbox:

Let us try to change the training algorithm from “trainlm” to some other type. Examine how to change neural_simple.m for training algorithms denoted as “trainrp” (resilient BP, a very nice robust method), “traingd” (standard gradient descent), “traingdm” (gradient descent with momentum), and “trainbr” (BP with regularization of weights). For each of these training algorithms repeat problem 4 (train 30 neural networks on different partitions into tr:val:test)

For people not using Neural Network toolbox:Depending on the software you use, examine how change in some parameters of the training algorithm influences accuracy and training speed.

Problems 7-10 are related to classification using Pima Indians dataset.

(10 points) Briefly explain the properties of the data set “pima.txt” (how many examples, what is the meaning of attributes and target, correlations between attributes, presence of missing data…).
(10 points) Examine hw4_classification.m file. hw4_classification.m is a Matlab file you can use to load the data, normalize it, prepare and learn a neural network for classification, and report its accuracy on a separate test set. Note that from neural_simple_class.m you can see that neural network with one hidden layer is used, where logsig nonlinearity is used both for hidden and output neurons. Note that neural network for classification has 2 outputs. Modify hw4_classification.m file to perform 5-cross-validation (5-CV) of a given neural network. Report the obtained accuracy and provide the printout of the modified program.
(10 points) Repeat 5-CV procedure using neural networks with 2, 10, and 50 hidden nodes. Report the accuracy on training and test sets. Compare the accuracies, and discuss the best choice with respect both to the accuracy and time needed for training.

(10 points) Note that several attributes in pima.txt are likely to have missing values. Detect examples and attributes with possible missing values. Propose a solution for dealing with missing values (for example, exclude examples with missing data from training, replace missing values with some reasonable guess, etc.). Apply this solution, and examine the accuracy of the improved neural network.

Deliverables: Bring a hard copy of your homework with representative parts of the code. Also, zip all the code you make and send it by mail with subject: “hw4 code”

Good Luck!!