Department of Statistics, Yale University

Department of Statistics, Yale University

Department of Statistics, Yale University

STAT242b Theory of Statistics

Suggested Solutions to Homework 10

Compiled by Marco Pistagnesi

Problem 1

a) We consider standardizing the probability:

(1)

Now, we consider the limiting case of the random variable within (1). We consider its mean and its variance:

(2)

(3)

Thus, we know that the random variable in (1) is approximately distributed:

(4)

And thus the proposed confidence interval is not accurate any longer because of the introduction of some error in the individual measurement.

(5)

b) We start at the same point, by standardizing the proposed confidence interval:

(6)

We consider the mean and variance of the random variable inside (6):
(7)

(8)

Thus we may define the 95% confidence interval

(9)

Problem 2

We note that this relationship is the same as saying that

(1)

Where in (1) and . Given this simple linear regression model, we take the information we have determined in class and say that the values that minimize the RSS are precisely those below in (2)

and (2)

And given our data, the estimated values are

and , thus (3)

Now we also take the variance of each of the estimators:

(4)

And from this we may take a 95 percent confidence interval for the estimators of 0 and 1. First, however, we estimate the standard error of the independent error measurements and calculate some other useful things:

(5)

(6a)

(6b)

(6c)

Then we may continue with the confidence intervals for the estimators:

(7a)

(7b)

Above we examine plots of the data with the regression line and then of the residuals. the model is a good fit to the data (disregarding the outlier, nearly every residual is within 0.2 of the actual data), noting that the RSE is 0.1689 on 31df, and the R-Squared value is 0.9975.

Finally, we consider the prediction interval for a temperature value of 1165 given the estimation model we have determined. We form an approximate 95% prediction interval:

The prediction is(8)

The variance is

(9)

Thus the prediction interval for Y is

(10)

And, because Y is the log of the pressure, we consider the prediction interval for the pressure:

(11)

And because the observed value, 1.922, is outside the 95% prediction interval for the pressure, we may say that it is not a statistically probable outcome.

Problem 3

We define and . We construct a simple linear model

the confidence intervals for the estimated parameters and the prediction interval for the given observation are constructed exactly as in problem 2, so please refer to that.

Problem 4

in this problem there was some misinterpretation and wrong approaches. You should acknowledge that the linear regression model’s capability goes well beyond that of accounting for linear relations between certain variables. It has to be viewed more like a frame within which to analyze more complicated relations that can be made linear by means of appropriate transformations. To get into the specifics of the problem, you know (since 3rd grade?) that the volume of a cylinder is proportional to the square of the radius times the height. Is it a linear relation, or close to be linear? Is the simple linear model a good fit to such relation? No to both. The many who thought yes, didn’t think carefully. Second “challenging” question: what suitable transformation could we apply to the data such that the transformed data are linearly related? Alas.. in your future, if you’ll ever work with data, pray you get such easy situations! If we have a model like a=b*c^2, to make it linear just take the log, to get log a=log b+2log c. this is a simple multivariate linear model to be analyzed by the usual techniques. The confidence and prediction intervals are constructed always the same way and we discussed it in question 2.

People tried a bunch of different models, all unacceptable apart from one. Constructing a new variable as the product of the square of the diameter and the height, and then regress the volume on it with a univariate model. It is sensible but less appropriated than the one I proposed. Roughly speaking, the reason is that you provide the model with only one coefficient (instead of 2) to account for the effect of 2 variable. Hence the estimate would be more variable and the predicition less precise. Not by chance ios that those people that used this model (or any wrong other) got a very wide prediction intervalm as a consequence of the high s.e. of the estimate.

Note that this problem of wide prediction interval arose to many also in the previous problems. The reason is (most likely) the same. Poorly specified models with highly volatile estimates.