3.3 Training Model

A.Estimation

(Ordinary) Least Squares

where

Training (learning): Estimating the unknown parameters is a neural network.

Error function: measure of discrepancy between the observed and fitted values. The objectiveof training a model is to find the parameters that minimize the error function

Two well-known error functions:

  • OLS (ordinary least square)
  • LAD (least absolute deviations)

The choice among different error functions depends on statistical considerations such as the distribution of the target variable. These considerations are known as the theory of point estimation in mathematical statistics.

Maximum Likelihood Estimation

Intuition: The likelihood function is the joint probability of the observed data, expressed as a function of the parameters. The ML estimates are located at the mode of this distribution. Consequently, the ML estimates are the parameter values that have the highest probability of having generated the observed data. ML estimates have many desirable statistical properties such as asymptotic efficiency and consistency.

Error function based on the likelihood function:

Note: To form the error function based on likelihood function, we need to know the probability distribution of observed data. Moreover, we need to assume that the cases are independent, the likelihood is formed by taking the product of the individual probability density function (or probability mass function).

Deviance

Intuition: The deviance is scaled to be nonnegative by subtracting the negative log likelihood of the fully saturated model in order to take the advantage in numerical computation while training a model.

The fully saturated model: Estimate the mean of each case by the target value and thus give a perfect fit, i.e.

Deviance:

where is the scale parameter.

For example, if ,

So if we take the scale parameter , it yields that

Upshot: If the distribution of the target is a member of the exponential family, then the deviance can be used as the error function. Maximizing the likelihood is equivalent to minimizing the deviance.

Applications: Many useful probability distributions are in the exponential family. They have a clean from for the deviance.

Normal:

Poisson:

Gamma:

Bernoulli:

PROC NEURAL

  • The deviance is the default error function when an exponential family distribution is specified.
  • For interval targets (LEVEL=INT), the default is the normal deviance, least squares.
  • For nominal or ordinal categorical targets (LEVEL=NOM or LEVEL=ORD), the default is the multiple Bernoulli deviance.

(Iterative) Weighted Least Squares

The nature of least square:

  • Good measure of discrepancy between the observed and fitted values
  • Relatively easy in implementation
  • Intrinsic connection to maximum likelihood estimation (the least square estimate is the same as the ML estimate if the target variable is normally distributed)

Weighted least square:

Intuition: It seems sensible to allow cases that are observed with more precision to have more influence on the estimation of the parameters. This consideration leads to weights that are inversely proportional to the variance of the target.

Feature of WLS: When the target distribution is from the exponential family, maximizing the likelihood (or minimizing the deviance) is equivalent to WLS.

Applications:

Computation difficulty: Since the weights in general are functions of the unknown parameters, we can’t solve for the parameters directly as in the case of OLS. We have to iteratively solve for the parameters using the algorithm for OLS.

Iterative Algorithm:

Step 1. Start with OLS () and obtain the initial estimate of

.

Step 2. Update the weights and proceed with the OLS to get the

updated estimate of .

Step 3. Repeat Step 2 until converges.

Nonnegative Targets

  • For an integer target-The Poisson distribution is recommended
  • For a nonnegative continuous target-The Gamma distribution is recommended
  • Both distributions have the characters of being positively skewed and with greater inherent variability for large means.

The approaches for training the model:

  1. Poisson or Gamma error function needs to be coupled with a link function that constrains the expected target to be nonnegative.

PROC NEURAL

  • The Poisson and Gamma deviance are specified un the TARGET statement with the ERROR=option.
  • Use the ACT=EXP on the TARGET statement to specify the log link. The square root link function (ACT=SQUARE) could also be used.
  1. Transformation & OLS

The purpose of the transformation is to stabilize the variance and then apply the OLS. The rule of thumb:

  • Poisson-: Use log transformation.
  • Gamma-: Use square root transformation.

Example 3.8 (Diagnostics) The Boston housing data concerns 506 census tracts. The target is median home value. The 13 inputs are attributes of the tracts thought to affect housing values. One of the inputs is a binary indicator; all others are continuous. In the original analysis, a linear model was fitted by least squares to the log-transformed target. This transformation is appropriate for positive responses where the variance is proportional to the square of the mean. Here we explore different link functions and error functions to account for the alleged nonnormality.

  1. Input Data Source node:
  • Select the Neuralnt.HOUSING data set.
  • Change the model role of MEDV to target and the measurement scales of RAD and CHAS to interval. CHAS could have been left with a binary measurement scale. PROC NEURAL will automatically create a dummy variable for it. However, it is already coded numerically (0/1).
  1. Neural Network node:
  • In the Basic tab, set the number of neurons to 3.
  • In the Output tab, check the Process or Score box and check the Residuals box in the Variables under-tab.
  1. SAS Code node:
  • Type in the following program:

procgplotdata=&_train;

symbolv=circle;

plot r_medv*p_medv / vref=0frame;

run;

quit;

data &_train;

set &_train;

ar=abs(r_medv);

run;

proccorrspearmannosimpledata=&_train;

var ar;

with p_medv;

run;

This program produces a plot of the residuals versus the predicted values and the Spearman correlation between the absolute residuals and the predicted values.

  1. Run the low from the Code node and view the results.

The residuals versus predicted values plot is a standard diagnostic tool. The absolute value of the residuals measures the variability. A flaring pattern indicates that the variance is related to the mean. In this problem, there appears to be a slight increase in variance as the mean increases.

Spearman correlation:

Let be a paired sample and be their corresponding ranks in each group. The Spearman correlation is an ordinary correlation coefficient based on ranks, i.e.

where

In this problem, indicating a small increase in variability.

  1. Reopen the Neural Network node:

In the original analysis, MEDV was log-transformed. An asymptotically equivalent analysis uses the log link function and the gamma distribution.

  • In the General tab, check the Advanced User Interface box.
  • In the Advanced tab, double-click on the output layer. In the Target tab, set the activation and error function:
  1. Revise the program in the Code node.

data &_train;

set &_train;

wr=r_medv/p_medv;

ar=abs(wr);

run;

procgplotdata=&_train;

symbolv=circle;

plot wr*p_medv / vref=0frame;

run;

quit;

proccorrspearmannosimpledata=&_train;

var ar;

with p_medv;

run;

The weighted residuals (Pearson residuals) are the product of the ordinary residuals and the square root of the weight. They reflect the effect of the estimation method.

The residual plot and Spearman correlation, shows that the

Gamma distribution overcorrected the variance heterogeneity.

  1. Reopen the Neural Network Node..
  • Change the error function from gamma to Poisson.
  • In the SAS Code, modify the program to reflect the new weights:

wr=r_medv/sqrt(p_medv);

  1. Run the flow and view the results.

The residual plot and correlation, show that the Poisson distribution slightly overcorrected the variance heterogeneity. Both the normal and Poisson models have adequately constant variance. The residual diagnostics gives little indication of a need for a “transformation” analysis.

Binary Targets

The usual distribution for binary targets is the Bernoulli distribution. The Bernoulli deviance favors small predicted values when y=0 and large predicted values when y=1. The Bernoulli distribution is a member of the exponential family. Thus ML estimation is equivalent to weighted least squares where cases with probabilities close to zero or one are given greater weight because their variance is small.

A binary target is a special case of a polychotomous target. The Bernoulli distribution is a special case of the multiple Bernoulli distribution (multinomial with one trial). The mean of a (multiple) Bernoulli random variable needs to be constrained to be between zero and one. Consequently, the (multiple) Bernoulli error function needs to be coupled with the (generalized or cumulative) logistic link function.

The entropy error function (ENT) is identical to the Bernoulli when the target is binary (0/1). However, the entropy error function can also be used with continuous targets that are between zero and one (proportions).

The multiple entropy error function (MEN) is identical to the multiple Bernoulli when the targets are a set of 0/1 indicators for a polychotomous outcome. However, the entropy error function can also be used with a set of proportions that sum to one (compositional data).

Example 3.9 (Composite Data) Proportions are interval-scaled variables that are bounded by zero and one. The PRODMIX data set consists of 200 cases (commercial customers). The customers have purchased at least one of three different products. The products are not mutually exclusive. The targets ate the proportion of sales on each of these three products (PROD1, PROD2, PROD3). The sum of the three proportions is 1 for all cases. A single input variable is available, a scaled measure of the customers tenure with the company. (other inputs were available, but only one was used for pedagogic purposes.)

Univariate Analysis:

The default identity link and normal error function could be used for these interval targets. A more appropriate analysis would constrain the expected target to be between zero and one and account for the likely occurrence that proportions closer t0 ½ have greater variability. Consequently the logit and cross-entropy error function are specified.

procdmdb data=neuralnt.prodmix out=dmdpm

dmdbcat=dmcpm;

var prod1 prod2 prod3 cage;

run;

procneural data=dmdpm dmdbcat=dmcpm graph;

input cage / level=int;

target prod1 / level=int act=logistic

error=entropy;

archi mlp hidden=4;

train;

score data=neuralnt.prodmix out=pred1 nodmdb;

run;

procneural data=dmdpm dmdbcat=dmcpm graph;

Input cage / level=int;

target prod2 / level=int act=logistic

error=entropy;

archi mlp hidden=4;

train;

score data=neuralnt.prodmix out=pred2 nodmdb;

run;

procneural data=dmdpm dmdbcat=dmcpm graph;

input cage / level=int;

target prod3 / level=int act=logistic

error=entropy;

archi mlp hidden=4;

train;

score data=neuralnt.prodmix out=pred3 nodmdb;

run;

The predicted values from the three neural networks are merged into one data set and plotted. Additionally, the three predicted values are overlaid with their sum.

data pred;

merge pred1 pred2 pred3;

tot=sum(of p_prod1-p_prod3);

run;

goptionsreset=all;

procgplotdata=pred;

symbol1c=b v=none i=splines;

symbol2c=bl v=circle i=none;

symbol3c=r v=none i=splines;

plot p_prod1*cage=1 prod1*cage=2 / overlayframe;

plot p_prod2*cage=1 prod2*cage=2 / overlayframe;

plot p_prod3*cage=1 prod3*cage=2 / overlayframe;

plot (p_prod1 p_prod2 p_prod3)*cage=1 tot*cage=3

/ overlayframevaxis=0 to 1.1 by .1;

run;

quit;

The three predicted proportions do not sum to one as they should.

Multivariate Models:

The softmax output activation function constrains the predicted proportions to sum to one. The multiple entropy function is the interval variable analogue to the multiple Bernoulli error function.

procneural data=dmdpm dmdbcat=dmcpm graph;

input cage / level=int;

target prod1 prod2 prod3 / level=int act=softmax

error=mentropy;

archi mlp hidden=4;

train;

score data=neuralnt.prodmix out=pred nodmdb;

run;

data pred;

set pred;

tot=sum(of p_prod1-p_prod3);

run;

goptionsreset=all;

procgplotdata=pred;

symbol1c=b v=none i=splines;

symbol2c=bl v=circle i=none;

symbol3c=r v=none i=splines;

plot p_prod1*cage=1 prod1*cage=2 / overlayframe;

plot p_prod2*cage=1 prod2*cage=2 / overlayframe;

plot p_prod3*cage=1 prod3*cage=2 / overlayframe;

plot (p_prod1 p_prod2 p_prod3)*cage=1 tot*cage=3

/ overlayframevaxis=0 to 1.1 by .1;

run;

quit;

Optimization Results

Parameter Estimates

Gradient

Objective

N Parameter Estimate Function

1 cage_H11 0.384378 -0.000002954

2 cage_H12 -0.280187 0.000130

3 cage_H13 -1.095511 0.000009684

4 cage_H14 -1.774301 -0.000067746

5 BIAS_H11 -1.211838 -0.000213

6 BIAS_H12 0.904729 -0.000732

7 BIAS_H13 -4.650418 -0.000000629

8 BIAS_H14 4.377352 -0.000064350

9 H11_prod1 2.901825 -0.000070088

10 H12_prod1 1.403374 0.000061612

11 H13_prod1 -1.568585 -0.000082627

12 H14_prod1 2.984321 0.000079331

13 H11_prod2 1.229019 0.000688

14 H12_prod2 2.256894 -0.000602

15 H13_prod2 3.123593 0.000800

16 H14_prod2 4.000011 -0.000799

17 BIAS_prod1 -3.629372 0.000078869

18 BIAS_prod2 -2.159585 -0.000802

Value of Objective Function = 0.0223055805

M-Estimation

General error function:

Special Cases:

  • OLS (ordinary least square)-
  • LAD (least absolute deviation)-
  • Huber’s M-estimation-

Origin of Huber’s M-estimation method-

It is a robust alternative to the ordinary least square method.

Essence of Huber’s M-estimation method-

It is designed to reduce the influence of extreme observations (outliers). It is robust in the sense that it nearly as efficient as least squares when the data is normal and can be much more efficient when the distribution has relatively more mass concentrated in the tails. (Data may be contaminated)

Characteristics of Huber’s M-estimation method-

M-estimation can be thought of as a special case of ML estimation where the probability density function is proportional to. The normal and Laplace (double exponential) distributions are special cases where the function is the square and absolute value, respectively. M-estimation can be used even when is not a proper density. In robust estimation the function is chosen to be insensitive to extreme values. Huber’s function corresponds to a combination of the normal and Laplace distributions. It behaves like least square for small residuals and like LAD for large residuals.

Example 3.10 (Robust Regression) The normal distribution is symmetric and unimodal. Consequently, ordinary least squares estimation is not robust to outliers. To examine the effect of outliers, two cases in the EXPCAR are altered. In the 45th case, the input is increased by an order of magnitude. In the 85th case the target is reduced by an order of magnitude.

data a;

set neuralnt.expcar;

if _n_=45thendo;

x=10*x;

ey=.;

end;

if _n_=85thendo;

y1=y1/10;

ey=.;

end;

run;

procsortdata=a;

by x;

run;

The default estimation for an interval target is ordinary least squares.

procdmdb data=a out=dexp dmdbcat=cexp;

var x y1;

run;

procneural data=dexp dmdbcat=cexp graph;

input x / level=int;

target y1 / level=int;

archi mlp hidden=3;

train;

score data=a out=pred nodmdb;

run;

goptionsreset=all;

procgplotdata=pred;

symbol1c=b v=none i=join;

symbol2c=bl v=circle i=none;

symbol3c=r v=none i=join;

plot ey*x=1 y1*x=2 p_y1*x=3 / frameoverlay;

run;

quit;

The fitted curve shows the disastrous influence caused by these outliers. To fix this problem, we consider the Huber’s M-estimation method.

Huber’s M-estimate is specified with the ERROR=HUBER on the TARGET statement.

procneural data=dexp dmdbcat=cexp graph;

input x / level=int;

target y1 / level=int error=huber;

/* M-estimate */

archi mlp hidden=3;

train;

score data=a out=pred nodmdb;

run;

goptionsreset=all;

procgplotdata=pred;

symbol1c=b v=none i=join;

symbol2c=bl v=circle i=none;

symbol3c=r v=none i=join;

plot ey*x=1 y1*x=2 p_y1*x=3 / frameoverlay;

run;

quit;

The NEURAL Procedure

Optimization Results

Parameter Estimates

Gradient

Objective

N Parameter Estimate Function

1 x_H11 -3.660918 0

2 x_H12 -6.187237 0.001458

3 x_H13 -0.607332 0.002924

4 BIAS_H11 -29.728186 0

5 BIAS_H12 -3.776346 0.000006314

6 BIAS_H13 0.181101 0.010342

7 H11_y1 5.060957 -0.000561

8 H12_y1 -50.308829 -0.000593

9 H13_y1 39.264104 -0.000763

10 BIAS_y1 30.205240 0.000561

11 SCALE_y1 4.729456 0.000166

Value of Objective Function = 6.3859385772

Link/Error Combinations
Link
/ Output Act. /
Error
Amounts / Identity / Identity / Normal
Identity / Identity / Huber
Log / Exponential / Gamma
Log / Exponential / Poisson
Names & Grades / Logit / Logistic / Bernoulli
Gen. logit / Softmax / MBernoulli
Cum. logit / Logistic / MBernoulli
Proportions / Logit / Logistic / Entropy
Gen. logit / Softmax / Mentropy

B.Optimization

Once an appropriate error function is specified, it needs to be optimized. For some simple statistical models (e.g. linear regression) there exist closed-form analytical formulas for the optimum parameter estimates. For nonlinear models like neural networks, the parameter estimates need to be determined numerically, using an iterative algorithm. The error function defines a surface in the parameter space. Numerical optimization algorithms use features of the surface to intelligently find the optimum.

  1. Basis of Iterative Algorithms

Algorithm Ingredients:

  • Initial estimates-the p-dimensional vector . (Assume there are p unknown parameters to be estimated)
  • Recursive update formula-the p-vector which gives the step size and direction for the (k+1)th iteration.
  • Convergence criterion.

Neural Network Difficulties:

  • Large number of parameters.
  • Non-parabolic features such as flat regions and troughs.
  • Inferior local minima.
  1. Some Well-Known Numerical Algorithms
  1. Gradient Descent
  • The gradient, , is the vector of partial derivatives of the objective function with respect to the parameters, evaluated at the kth estimates.
  • The gradient points in the steepest direction uphill. Consequently, the step is made in the direction that is locally steepest downhill.
  • The step length is called learning rate by neural network. If is too large, the algorithm will diverge; excruciatingly slow progress is made when is too small
  • Unfortunately, the local descent direction does not usually point to the minimum. Consequently, gradient descent is prone to large switchbacks (hemstitching) of the step direction across troughs.

Steepest Descent

  • At each iteration, steepest descent incorporates a line search

, for the minimum in the descent direction.

  • The process ensures descent and gives pairwise orthogonal steps.
  • However the line search may requires many function evaluations. Furthermore, hemstitching could still be a problem.

Back-propagation