Statistical Anomaly Problem Detection Correction

Distributions, Anomalies, and Econometric Definitions

Normal Distribution: ( = population mean, 2 = population variance)

Probability density function for a normally distributed random variable, x, with a population mean of  and a population variance of 2

Central Limit Theorem: Regardless of the distribution from which a random variable, X, originates,

Standard Normal Distribution:

Let X be a normally distributed random variable with population mean  and population variance 2.

Chi-Square Distribution:

Let Z1, Z2, ..., Zk be independent standard normal random variables. Let Q be defined as

The population mean of Q is k.

The population variance of Q is 2k.

The chi-square distribution is skewed to the right. As k approaches infinity, the distribution approaches symmetry.

For k > 100, the transformation is approximately distributed standard normal.

t - Distribution:

Let Z be a standard normal distributed variable. Let Q be a chi-square distributed variable with k degrees of freedom.

The population mean of T is zero.

The population variance of T is .

As k approaches infinity, the t-distribution approaches the standard normal distribution.

F Distribution:

Let Q1 and Q2 be independently distributed chi-square variables with k1 and k2 degrees of freedom, respectively.

The population mean of F is

The population variance of F is

The F distribution is skewed to the right. As k1 and k2 approach infinity, the distribution approaches the normal distribution.

Statistical Anomaly

Omitted Variable: A significant exogenous regressor has been omitted.

Extraneous Variable[*]: An insignificant exogenous regressor has been included.

Regime Shift: The values of the parameters change at some point(s) in the data set.

Serial Correlation[†]: Current values of the error term are correlated with past values.

Non-Zero Errors: The expected value of the error term is not zero.

Non-Linearity: The model is of a different functional form than the equation that generated the data.[‡]

Non-stationarity: The dependent variable is a random walk.

Multicollinearity: One or more of the exogenous regressors are significantly correlated.

Problem

Parameter estimates are biased and inconsistent.

Parameter estimates are inefficient.

Parameter estimates are biased and possibly inconsistent.[§]

Standard errors of parameter estimates are biased.[**] (and thus t-statistics are invalid)

Standard errors are biased. Correlation coefficient is biased.

Parameter estimates are biased and inconsistent.

Parameter estimates are biased and inconsistent. Standard errors are biased. Correlation coefficient is biased.

When combined with the extraneous variable anomaly, parameter estimates are biased but consistent; parameter estimates are inefficient.

Detection

A new exogenous regressor can be found which is statistically significant.

One or more of the exogenous regressors are statistically insignificant.

Parameter estimates change significantly when the sample is split.

Durbin-Watson statistic is significantly different from 2.[††]

Mean of the estimated residuals is not equal to zero.

A different functional form can be found which yields a greater adjusted R2 while using the same exogenous regressors.[‡‡]

Regress dependent variable on constant and itself lagged. Slope coefficient will be equal to 1. Also, D.W. is usually less than R2.

Correlation between exogenous regressors is significant. Parameter estimates are highly sensitive to changes in observations.

Correction

Include the missing regressor.

Exclude the extraneous regressor.

Determine the type of shift (slope or intercept) and location. Include a dummy variable to account for the shift.

Perform AR and/or MA correction.

Include a constant term in the regression.

Alter the functional form of the regression equation.

Difference dependent variable until it becomes stationary.

Remove one of the multicollinear regressors from the regression. Not a problem unless the correlation of the exogenous variables exceeds the regression correlation.

Statistical Anomaly

Heteroskedasticity: The variance of the error term is not constant over the data set.

Measurement Error: There is a random component attached to one or more of the exogenous regressors.[§§]

Truncated Regression: No data is drawn when values of y are beyond certain limits.

Suppressor Variable: Independent variable is uncorrelated with the dependent variable, but appears significantly in a multiple regression model.

Problem

Standard errors of parameter estimates are biased.

Parameter estimates are biased downward.

Parameter estimates are biased.

Parameter estimates are biased and inconsistent.

Detection

Regress the squared residuals a time dummy and the exogenous regressors. The presence of significant coefficients indicates heteroskedasticity.

Examination of the construction of the exogenous variable.

Examination of the criteria for data selection.

Significance when variable appears in multiple regression, but no significance when variable appears in single regression.

Correction

Divide the regression by a regressor that is correlated with the squared error term. Iterated WLS: obtain divide equation by ; re-estimate, etc.; standard errors are unbiased and asymptotically efficient.

Two-stage least squares procedure.

Include truncated observations.

Eliminate the variable from the model.

Unbiasedness:

The estimate of is equal to (on average).

Efficiency:

The standard error of the estimate of is the smallest attainable standard error among the class of standard errors associated with linear unbiased estimators.

Consistency:

The estimate of approaches as the number of observations approaches infinity.

Parts of the Regression Equation: Y =  + X + u

In the above model, Y is the dependent variable (also called the endogenous variable), X and a constant term are independent variables (also called exogenous variables or explanatory variables or exogenous regressors), u is the error term (also called the stochastic term),  and  are unknown parameters (or regression coefficients) that can be estimated with some known error. The estimates of the parameters are called (alpha hat) and (beta hat) and are, collectively, called parameter estimates or regression coefficient estimates.

The ordinary least squares (OLS) method of estimation calculates the parameters of the model (and the standard errors of the parameters) as follows (where N = # of observations, df = degrees of freedom):

u is called the regression error; is called the regression residual.

In matrix notation:

for the multiple regression model: Y =  + 1X1 + 2X2 + u

Salkever’s Method of Computing Forecasts and Forecast Variances

Regress on . This generates the LS coefficient vector followed by the predictions. Residuals are 0 for the predictions so, the error covariance matrix is the covariance matrix for the coefficient estimates and the variances of the forecasts.

Degrees of Freedom: df = # of observations - # of parameters in the model

t-Stat:

The probability of a parameter estimate being significantly different from zero is at least x% when the absolute value of the t-Stat associated with that parameter is greater than or equal to the critical value associated with the probability x%.

For example, the critical value associated with a probability of 95% (when there are 30 degrees of freedom) is 2.042. Thus, there is a 95% chance that any parameter estimate with a t-Stat greater than 2.042 (in absolute value) is not equal to zero.

Durbin-Watson Statistic:

To use the D.W. statistic, find the values of du and dl associated with the number of observations and degrees of freedom in the regression. The following are true:

D.W. < dl  positive serial correlation

D.W. > 4 - dl  negative serial correlation

du < D.W. < 4 - du  no serial correlation

Correlation and Adjusted Correlation:

Note: R2 uses var(resid) where df = N – 1, uses var(resid) where df = N – k, and is proportional to the mean squared error.

F-statistic:

Cross-Validity Correlation Coefficient:

Correlation indicates the percentage of variation in the dependent variable that is explained by variations in all of the independent variables combined. Because correlation always increases when the number of independent variables increases, the adjusted correlation provides a measure of correlation that accounts for the number of independent regressors. Criticisms of the adjusted correlation measure claim that the measure may not penalize a model enough for decreases in degrees of freedom. Because this issue is more acute in out-of-sample analysis, the cross-validity correlation coefficient has been proposed as an alternate measure of adjusted correlation.

Correlations (and adjusted correlations) from different models can be compared only if the dependent variables in the models are the same.

White Noise Error: An error term that is uncorrelated with all explanatory variables, is uncorrelated with past values of itself, and has a constant variance over time.

Autocorrelation of Order k: One of two types of serial correlation.

When the error term is autocorrelated of order k, the error term is not white noise, but is generated by the process:

where  is a constant less than one in absolute value and  is a white noise error.

Moving-Average of Order k: One of two types of serial correlation.

When the error term is moving average of order k, the error term is not white noise, but is generated by the process:

where  is a constant less than one in absolute value and  is a white noise error.

Geometric Lag Model: A model that allows us to express an infinite series of lags as a function of a few parameters. The model achieves this by assigning arbitrarily small weights to past lags.

The implied coefficients from this model are:

. The model is estimated by regressing y on x and y lagged. We have:

The mean lag of the model is λ / (1 – λ)

The median lag of the model is ln(0.5) / ln λ

The short-run multiplier is β(1 – λ)

The long-run multiplier is β

Two-Stage Least Squares Procedure: A procedure that corrects the Errors in Variable anomaly.

Suppose we are attempting to estimate the regression equation: Y =  + X + u where there is a random component imbedded in the exogenous variable X.

(1) Find another variable, Z, which is correlated with X and which does not contain a random component.

(2) Regress X on a constant and Z and compute .

(3) Regress Y on a constant and .

The resulting standard errors will be more efficient, although still less efficient than the straight OLS standard errors.

Test for Parameter Restrictions: A method for testing the hypothesis that a set of parameter restrictions are valid.

Run the regression without the restriction. Capture the sum squared residuals, RR.
Run the regression with the restriction. Capture the sum squared residuals, RU.
The test statistic is distributed under the null hypothesis that the restriction(s) are valid (where KU is the number of parameters in the unrestricted model, KR is the number of parameters in the restricted model, and N is the number of observations)
Note: If the errors are not normally distributed, via the central limit theorem,

Seemingly Unrelated Regression (SUR)

Forecast Errors:

Let the estimated relationship between X and Y be:

The forecasted value of Y when X is equal to some pre-specified value, X0, is Yf and is given by:

The variance of the mean value of the Y’s when X=X0 is given by:

The variance of one observation of Y when X=X0 is given by:

In matrix notation, the variance of the mean prediction is given by:

In matrix notation, the variance of the individual prediction is given by:

Transformed Dependent Variables

R2’s from models with different dependent variables cannot be directly compared. Suppose Z = f(Y) is a transformation of Y. Using Z as the dependent variable, obtain fitted values . Employ the inverse function to the fitted values, to obtain . Find the squared correlation between and Y. This is the transformed R2.

OLS Assumptions:

The error term has a mean of zero.

There is no serial correlation in the error term.

The error term is homoskedastic.

There are no extraneous regressors.

There are no omitted regressors.

The exogenous regressors are not systematically correlated.

The exogenous regressors have no random component.

The true relationship between the dependent and independent variables is linear.

The regression coefficients are constant over the sample.

Test for significance of a correlation (or multiple correlation coefficient)

for correlation, r, and n observations.Three Possible Goals of Regression Analysis

Goal: Find determination equation for YSeek: All regressors that jointly yield statistically significant regression coefficients.

Goal: Measure the effect of X on YSeek: Coefficient for X (regardless of its significance) and all other regressors that jointly yield statistically significant regression coefficients.

Goal: Forecast YSeek: High adjusted R2, all t-stats greater than or equal to 1, statistically significant F statistic.

[*] Research indicates that exogenous regressors which produce t-stats which are insignificant though greater than 1 should be included in the regression when the purpose of the regression is to produce a forecast.

[†] Serial correlation can sometimes be caused by an omitted variable and is corrected by including the omitted variable in the regression.

[‡] In the model y=+X+u,  is interpreted as the change in y given a change in X. In the model ln(y)=+X+u,  is interpreted as the relative change in y given a change in X. In the model y=+ln(X)+u,  is interpreted as the change in y given a relative change in X. Multiplying a relative change by 100 yields a percentage change.

[§] Whether or not the parameter estimates are to be considered “consistent” depends on the where in the data set the regime shift occurs. If, for example, the regime shift occurs close to the beginning of the data set, as more observations are included after the regime shift the parameter estimates become (on average) more unbiased estimates of the true parameters that exist after the regime shift, but no less biased (and possibly more biased) estimates of the true parameters that exist before the regime shift.

[**] When the standard errors of a parameter estimate are inefficient, they are larger than the OLS standard errors in the absence of any statistical anomalies. Thus, a parameter estimate that is significant even in the presence of inefficient standard errors would be even more significant in the presence of efficient standard errors. By contrast, biased standard errors could be larger or smaller (we usually don’t know which is the case) than the OLS standard errors in the absence of statistical anomalies. Thus, we can conclude nothing about the significance of a parameter estimate in the presence of biased standard errors.

[††] Note the following: (1) failure to use a constant term in a regression makes the Durbin-Watson statistic biased, (2) the D-W statistic tests only for serial correlation of order 1, (3) the D-W statistic is unreliable in the presence of regressors with stochastic components (in fact, the combination of a lagged dependent variable as a regressor and a positively autocorrelated error term will bias the D-W statistic upward), (4) it has been shown that when the regressors are slowly changing series (as is the case with many economic series), the true critical value of the D-W statistic will be close to the Durbin-Watson upper bound.

[‡‡] The correlation coefficients from different regressions can only be compared if the dependent variable is the same in both regressions.

[§§] This problem also exists when the lagged dependent variable is used as an explanatory regressor and when the error term from that regression exhibits a degree of serial correlation greater than or equal to the lag.