Using Clustering to Make Prediction Intervals

For Neural Networks

Claus Benjaminsen

ECE 53912/21 2005

Abstract

This project describes a way of estimating prediction intervals using clustering. The method is developed and implemented in matlab, and tested on a small 2D dataset. The performance of the method is compared with the standard method of making prediction intervals using several identical neural networks, and a baseline of just applying equal sized intervals to all predictions. The results show that the method using clustering indeed can be used to estimate prediction intervals, and that the performance can be even better than the standard method.

Table of Contents

Abstract

Table of Contents

Introduction

Approach

Data set

How the models are set up

Implementation

Results

Comparing the 3 methods

Testing with different values of the bias and slag parameters

Discussion

Conclusion

References

Introduction

Neural networks are widely used nowadays to solve all kinds of problems related to classification and regression. A special group of problems has to do with prediction, and being able to estimate the output of an unknown function, given a set of new inputs. This could for instance be forecasting the totalsnowfall over nightusing today’s weather data, or predicting how many hats of a certain brand will be sold this Christmas, given money spend on marketing and the average temperature in December. Both these examples are regression problems in that the output is a numerical value, and are typical problems,which could be solved by neural networks. Taking the first example, a trained neural network might predict, that the total snowfall will be 2 inches. This estimate might be right or wrong, close to the actual value of the snowfall or far away. Sometimes the neural network will be very accurate, and sometime it won’t, but for the person who wants to plan when to get up tomorrow in order to make it to work on time, only the estimate might not be sufficient information. For him it could be more relevant to know, what the maximum possible snowfall could be, so that he can take precautions and get up early, if there is a chance there could fall more than for instance 5 inches of snow.

This kind of inference has to do with prediction intervals. Instead of predicting only an estimate of the actual snowfall, an interval of the possible range of snowfall could be estimated. These intervals can be given in many forms, for instance by a max and min value and possibly a percentage of confidence, or by a mean value and a variance of a Gaussian distribution. In anyway the output is supplied with a lot of extra information and can be used to determine, how precise the estimated output value is (width of prediction interval), and what the possible max and min values of the actual output value are (limits of prediction interval).

The normal way of estimating prediction intervals using neural networks is by training multiple identical networks, and using the variance of the predicted outputs to estimate a prediction interval for the given output. This method normally has some requirements to the noise added to the unknown function, and can have problems in certain situations. Therefore this project willinvestigate the possibility of implementing interval prediction with the help of clustering.

The idea is that, if the noise in the unknown function is input dependent, inputs close together might be subject to the same underlying noise function, and hence their outputs might be about equally hard to predict. This can then be used to give the prediction intervals for similar inputs an equal size, and in that way give a good estimate of the possible range of output values.
Motivation

The motivation for this project is to test the possibility of using clustering in interval prediction. In many real world problems prediction intervals can be a big help in making prediction based decisions, and so a good method of estimating prediction intervals can be used extensively. Also in many situations the standard method of training several identical neural networks is unfeasible, because the model easily gets really big, and it can take a long time if several neural networks have to be trained on a large dataset. The proposed method only requires the training of one neural network and clustering of the training data, which would normally be much faster than training many neural networks.

Approach

In this project the main focus has been put on implementing and testing the performance of doing interval predictions using a neural network and clustering. To evaluate the performance two other methods of doing prediction intervals have been included. The first is the baseline, which consists of just applying equally large prediction intervals to all predictions. The second method is the standard way of implementing prediction intervals using neural networks. In this project it uses 10 identical neural networks, trained on the same data, to estimate the individual variance of predicted outputs, and from these variances estimate the corresponding prediction intervals.

Data set

For this project I have chosen to use some easy accessible data coming from a neural network competition[1] set up on the internet. The reason is twofold: First it meant, that I didn’t have to spend a lot of time collecting data and transforming it into a format, which can easily be processed. Secondly the competition, the data comes from, was what gave me the inspiration for this project. It focuses exactly on making good prediction intervals using neural networks, and so the datasets from the competition are very relevant for testing the implementations in this project.

The competition has 4 different datasets:

  • Synthetic – a small synthetic dataset
  • Precip – a large dataset consisting of precipitation data
  • Temp – a large dataset with temperature data
  • SO2 – a medium size dataset of various measurement for predicting SO2 levels

I have only used the Synthetic dataset to develop the methodsas it is 2D (one input and one output), and therefore easy to work with and plot. It consists of 256 points divided into two files, one containing all the x-coordinates (synthetic_train.inputs), and one with the corresponding y-coordinates (synthetic_train.targets). A plot of this data set is shown below in figure 1.

Figur 1. A matlab plot of the Synthetic dataset used in this project.

How the models are set up

When training a neural network, it is often desirable, that the feature vectors(input part) of the training/testing data are scaled into a certain range.This is both to give the same weight to different features,and to make it easier for the neural network to adapt to the inputs. Also since the output of a neural network is computed by a nonlinear function with a limited range of output values, the targets (output values of training/test set) will need to be scaled into this range, in order for the network to be able to make predictions. Therefore the Synthetic dataset is being randomized and then scaled, before it is divided into a training set and a testing set.

Using only the training set 10 identical neural networks are trained using back-propagation and a small random perturbation of the weights,and the weights giving the lowest training error are stored. Then both the training feature vectors and the test feature vectors are fed to all 10 networks, and the mean and the variance of each predicted output is calculated from the 10 individual predictions. The mean values found will serve as the predicted point estimates of the targets. The training error iscalculated as the sum of the distance between these predicted point estimates and the training targets. This value gives the minimum total sum of prediction intervals needed to cover all the training data for the found point predictions. Since all prediction intervals in this project are symmetric, the sum of the prediction intervals is calculated as the sum of half the interval length.

Giving the training error plus an optional slag parameter, the baseline prediction intervals are formed as equal size intervals with a total sum equal to the training error. The standard method (from now on referred to as the variance method) scales the variance of each prediction and adds a possible bias term, to make the sum of the variances (prediction intervals) equal to the training error.

Finally the clustering method is applied, in which a k-means clustering algorithm is used on the input feature vectors to find a number of cluster centers. The membership of each of the training feature vectors to these cluster centers is then found, and the mean of the training errors, for all the feature vectors in a given cluster, is assigned to that cluster center. These mean errors are now scaled and possibly added with a bias term to have a total sum equal to the total training error, and thereby they can serve as prediction intervals for the training data.

When calculating the prediction intervals for the test data, the baseline method just assigns the size of the training prediction intervals to all the test prediction intervals. The variance method scales the variance of the test predictions by the same amount as the training set variances were scaled. Finally the clustering method first determines the membership of the given test feature vector, and then assigns the corresponding mean error associated with the found cluster center to the feature vector. This mean error is then scaled by the same factor as under training, and the corresponding value defines the test prediction interval.

The performance measures used in this project are the number of targets, which falls within the prediction intervals, and the cost function, which is determined as the mean squared distance from the edge of the prediction intervals to the corresponding targets, for all the targets which fall outside the prediction intervals.

Implementation

The implementation of the methods described above is done in Matlab, and makes extensive use of the matlab programs developed by professor Yu Hen Hu[2], or modification of these programs. The main file is IntervalNN, which sets up all the parameters for training the 10 Neural Networks, scales the data and divides it into a training set and a test set. Then it calls NNfunction, which is a modified version of bp.m, and takes care of training a neural network with the given training data and returns the found weights and the training data predictions. Given the weights, the test set predictions are calculated,and this procedure is repeated 10 times. Then the mean and variances of the training and test predictions are determined along with the training and test errors. Finally all the values are saved in IntNNvar.mat and the program c_test is called.

In c_test the prediction intervals for the baseline method and the variance method are calculated, and the performance of these intervals on both the train and test data is found. Then the cluster centers of the training feature vectors are found using the function cluster.m, which is a modified version of clusterdemo.m by professor Yu Hen Hu, and the membership of each of the training features is determined using kmeantest.m also developed by professor Yu Hen Hu. Then the prediction intervals are formed, and the training performance evaluated, after which the same is done for the test set. The whole scheme is run multiple times for different number of cluster centers, so the performance can be compared.

Finally disp_performance.m can be run. It uses the prediction intervals given by the number of cluster centers, which give the best performance in terms of the minimum cost function for the test data.These prediction intervalsare plotted along with the training data and training predictions,and the same is done for the other two methods.Similar plots are made for the test data.

Also another small program can be run called test_bias.m. This program tests the performance of each of the three methods, when the bias term in the prediction intervals and the training error slag parameter are changed. The bias term alters how much of the total available prediction interval should be divided as an equal amount to all individual prediction intervals, (the rest is divided by scaling the prediction intervals). The slag parameter changes the total sum of available prediction interval to become larger or smaller than the total training error. For each new combination of the slag parameter and the bias value, c_test2.m is called and the performance results recorded. c_test2.m is a very slightly modified version of c_test.m, and is just made to be called from test_bias.m.

Results

I have run the programs described above multiple times with different parameters, and here I will present some of the results I found. The training and test feature vectors were scaled to the range from -5 to 5, and the targets to the range from 0.2 to 0.8. Out of the total 256 samples in the synthetic dataset, 100 samples were randomly picked out to use for testing. The 10 neural networks were set up to have 3 layers (1 input, 1 hidden and 1 output layer) with 5 neurons in the hidden layer. Sigmoidal activation functions were used for all the hidden neurons, and a tangent hyperbolic function for the output neuron. The step size alpha was set to 0.1, the momentum term to 0.8 and the epoch size to 20. The stopping criteria used was no improvement in training error for 200 consecutive iterations.

Comparing the 3 methods

After obtaining the estimated point predictions I evaluated the prediction interval performance using n = 4, 8, …, 100 cluster centers. The best one in terms of the minimum cost function for the test data was picked out, and the training and test data along with the point predictions and prediction intervals are shown in the figures below for each of the 3 methods.

Figur 2 and3. Left figure:Training set and prediction intervals using clustering with 52 cluster centers. Right figure:The test set and corresponding prediction intervals using the same cluster centers.

Figur 4 and 5. Left figure:Training set and prediction intervals using the variance method. Right figure:Test set and prediction intervals using the variance method.

Figur 6 and 7. Left figure:Training set and prediction intervals using the baseline method. Right figure:Test set and prediction intervals using the baseline method.

From the plots it can be seen how the clustering and variance methods try to fit the sizes of the prediction intervals to match the easy and difficult parts of the data. For the clustering method it can be seen, how the training areas (in the feature vector space), which gives large prediction intervals, also gives large prediction intervals for the test data. The same is in someway true for the variance method.

Looking at the corresponding performance data for this experiment gives an easier way of comparing the 3 methods. The performance measures are shown in the table below.

n_centers / c_clus / c_test_clus / cost_clus / cost_test_clus / c_var / c_basis
4 / 98 / 61 / 0.001842 / 0.001679 / 79 / 108
8 / 90 / 59 / 0.001762 / 0.001547
12 / 88 / 58 / 0.001702 / 0.001421 / c_test_var / c_test_basis
16 / 83 / 54 / 0.001430 / 0.001480 / 56 / 62
20 / 86 / 57 / 0.001372 / 0.001415
24 / 84 / 56 / 0.001387 / 0.001506 / cost_var / cost_basis
28 / 89 / 58 / 0.001092 / 0.001620 / 0.002516 / 0.003075
32 / 89 / 56 / 0.001112 / 0.001322
36 / 82 / 56 / 0.001105 / 0.001595 / cost_test_var / cost_test_basis
40 / 86 / 55 / 0.001067 / 0.001314 / 0.004856 / 0.002414
44 / 89 / 57 / 0.000986 / 0.001682
48 / 90 / 52 / 0.001038 / 0.001345
52 / 89 / 55 / 0.000960 / 0.001277 / The slag / The bias term
56 / 79 / 54 / 0.000926 / 0.001617 / parameter
60 / 89 / 50 / 0.000898 / 0.001845 / e_slag / bias
64 / 76 / 54 / 0.000908 / 0.001887 / 0 / 0
68 / 94 / 53 / 0.000851 / 0.002278
72 / 97 / 55 / 0.000807 / 0.002155
76 / 96 / 56 / 0.000866 / 0.002276 / Number of / Number of
80 / 66 / 52 / 0.000503 / 0.002347 / training targets / test targets
84 / 101 / 55 / 0.000525 / 0.002298 / 156 / 100
88 / 105 / 54 / 0.000524 / 0.002301
92 / 110 / 55 / 0.000513 / 0.002278
96 / 110 / 52 / 0.000472 / 0.002425
100 / 111 / 55 / 0.000431 / 0.002969

n_centers = number of cluster centers

c_ = number of training targets inside prediction intervals

c_test_ = number of test targets inside prediction intervals

cost_ = cost function for training data

cost_test_ = cost function for test data.

From the values in the table it can be seen that the number of cluster centers has a significant influence on the performance of the clustering method. Comparing the number of targets found inside the prediction intervals, it ranges quite a lot from 66 to 111 for the training data for the clustering method,and it includes the 79 and 108 for the variance and the base line methods respectively. For the test data the range of values is smaller, but the maximum is only 61 compared to the variance method 56 and the base line 62. So the base line method actually has a higher number of test targets, which falls within the prediction intervals. The values found might seem pretty low.The maximum percentage of data inside the prediction intervals is (111/156 ~) 71 %, but it has to be compared with the fact that the total sum of the prediction intervals, is equal to the minimum needed in order to include all the targets in the intervals[3].

On the other hand when comparing the cost functions (precision) of the different methods, the clustering method turns out to perform far better than the other methods. For the training data it generally decreases with the number of cluster centers, and should theoretically be able to become zero for n_centers = 156 (the number of training targets). For the test data it has a minimum at 52 cluster centers and the value is about half the value of the baseline method, and about one fourth of the value of the variance method. The increase in the cost function, for the test data for high number of cluster centers, can be viewed as a form of overfitting. This comes from the fact that the use of too many cluster centers, will relate the size of the prediction interval to specific training samples instead of areas of the input feature space. So if a sample with low noise is in an area with generally high noise levels, and a test sample is close to it, the test sample will get a small prediction interval, even though it might have a high level of noise, because it comes from the area with generally high noise levels.