Week 3--IES 612-STA 4-573-STA 4-576.doc

IES 612/STA 4-573/STA 4-576 Winter 2009

Checking Model Assumptions (OL 13.4) – an initial visit

RECALL: Basic Model

Yi = 0 +1Xi + i [“simple linear regression”]

i ~ indep. N(0, 2)

[Defn 1] (Raw) Residuals = observed response – predicted response or

[Defn 2] (Standardized Residuals)

[Defn 3] (Studentized Residuals)

Assumption / Diagnostic? How do you check the assumption? / Remediation?
1. E(i) = 0 ] –>
E(Yi) = 0 +1Xi–>
line is a reasonable model for describing mean change as a function of x / D1.1: Plot ei vs.
D1.2: Plot ei vs. xi
[check to see if pattern exists]
D1.3: Plot Yi vs. xi and superimpose plot of vs. xi.
D1.4: Large R2/signif. slope / Curvature? Polynomial regression model or nonlinear regression model
Smooth regression? LOWESS
Transformation? Log/square root
2. V(i) = 2–>
V(Yi) =2 –>
constant variance –>
scatter about the line is the same regardless of the value of x / D2.1: Plot ei vs.
[check to see if you have a constant band about zero] / Weighted Least Squares?
Transformation
3. i ~ Normal / D3.1: Normal probability plot of ei [see if linear]
D3.2: Histogram of residuals [bell-shaped?] / Transformation?
Generalized Linear Models (e.g. logistic/probit regression for dichotomous responses; Poisson regression for count responses)
4. i independent / D4.1: Generally examining the design can suggest if this is true
D4.2: Durbin-Watson test / Correlated regression models? Time series/spatial methods
5.* no important omitted variables {relates to pt. 1} / D5.1: Plot ei vs. omitted variables [see if pattern] / Add omitted variable to a model (multiple regression)
6.* no points exerting undo influence / D6.1: Look at statistics that quantify influence (e.g. DFBETAS, DFFITS, etc.)
D6.2: Look for extreme X values (break in stemplots of X) / Smooth model-robust fitting procedure (e.g. Least Absolute Value regression)
7.* no extreme outliers impacting inference / D7.1: Large residual (e.g. standardized/studentized residual >3? 2?)
D7.2: “Break” in stemplot of residuals / Check to see if data sheet correct – fix? Don’t simply omit. Report analysis both including/excluding point?

Example: Manatee Deaths predicted from Number of Boats Registered

Using R and RCommander

Manatee.lmfit = lm(Manatees.Killed ~ Number.Boats, data=Manatee.df)

Regression plot – model adequate? Constant variance?

plot(killed~nboats, data=Manatee.df)

abline(Manatee.lmfit)

Residuals plot – model adequate? Constant variance?

Studentized Residuals – outliers? Definition way back when was:______

Normal errors? - Normal quantile-quantile plot

par(mfrow=c(3,2))

plot(Manatee.lmfit, which=1:6)

par(mfrow=c(1,1))

Example when Assumptions go South—Exercise 11.21

Using R and RCommander

Regression plot – model adequate? Constant variance?

# OL 11.21 example

x.speeds <- rep(c(60,80,100,120,140),c(4,4,4,4,4))

y.holes <- scan()

4.6 3.8 4.9 4.5 4.7 5.8 5.5 5.4 5.0 4.5

3.2 4.8 4.1 4.5 4.0 3.8 3.6 3.0 3.5 3.4

OL11p21.df <- data.frame(x.speeds, y.holes)

scatterplot(y.holes~x.speeds, data=OL11p21.df, reg.line=lm, smooth=TRUE,

labels=FALSE, boxplots=”xy”, span=0.5)

scatter.smooth(predict(OL11p21.lmfit), resid(OL11p21.lmfit),xlab="Fitted Value", ylab="Residual")

abline(h=0)

Residuals plot – model adequate? Constant variance?

Studentized Residuals – outliers? Definition way back when was:______

Normal errors? - Normal quantile-quantile plot

par(mfrow=c(3,2))

plot(lm(y~x), which=1:6)

par(mfrow=c(1,1))

Using SAS on the Manatee Data

ODSRTF BODYTITLE;

*file='D:\baileraj\Classes\Fall 2003\sta402\SAS-programs\linreg-output.rtf';

proc reg data=manatee_deaths;

title "";

model manatees = nboats / p r cli clm;

symbol1 value=dot color=black ;

symbol2 value=plus color=black ;

plot manatees*nboats p.*nboats / overlay;

plot r.*nboats r.*p.; * residuals vs. x and yhat;

plot r.*nqq.; * normal qqplot;

run;

ODSRTFCLOSE;

Residuals plot – model adequate? Constant variance?

* now in Excel

* now in Excel

Studentized Residuals – outliers?

Output Statistics
Obs / -2-1012 / Cook'sD
1 / | | | / 0.017
2 / | |** | / 0.178
3 / | |** | / 0.149
4 / | **| | / 0.091
5 / | | | / 0.006
6 / | *| | / 0.021
7 / | ****| | / 0.244
8 / | |** | / 0.073
9 / | | | / 0.005
10 / | *| | / 0.015
11 / | | | / 0.000
12 / | | | / 0.000
13 / | |* | / 0.091
14 / | | | / 0.027

Normal errors? - Normal quantile-quantile plot

EXAMPLE: Quadratic regression and unequal variances

* fit a linear model

* examine residuals

* use quadratic regression (polynomial regression is a form of multiple regression)

* use weighted least squares to deal with non-constant variance

#

# quadratic-reg-var-R.doc

#

data() # list data sets in R

data(Loblolly)

plot(height~age, data=Loblolly)

lob.lmfit <- lm(height~ age + I(age^2), data=Loblolly)

plot(predict(lob.lmfit),resid(lob.lmfit))

abline(h=0)

qqnorm(resid(lob.lmfit))

# get VAR and N for "height" at each AGE

var.ht<- unlist(lapply(split(Loblolly$height,Loblolly$age),var))

var.ht

# 3 5 10 15 20 25

#0.1628951 0.6651654 2.3650951 3.8057940 4.8921824 5.1476071

rep.ht<- unlist(lapply(split(Loblolly$height,Loblolly$age),length))

#rep.ht

# 3 5 10 15 20 25

#14 14 14 14 14 14

Loblolly$wts<- rep(1/var.ht, rep.ht)

# wt=1/var(Ht) at each age

lob.wfit <- lm(height~ age + I(age^2), data=Loblolly, weights=wts)

# wt=1/(predicted HT) with predictions from OLS

lob.wfit2 <- lm(height~ age + I(age^2), data=Loblolly, weights=1/predict(lob.lmfit))

# compare the SE from the different fits

summary(lob.lmfit)

summary(lob.wfit)

summary(lob.wfit2)

plot(height~age, data=Loblolly)

age.span <- seq(from=min(Loblolly$age),max(Loblolly$age), length=250)

lines(age.span, predict(lob.lmfit,newdata=data.frame(age=age.span)))

lines(age.span, predict(lob.wfit,newdata=data.frame(age=age.span)),lty=2)

lines(age.span, predict(lob.wfit2,newdata=data.frame(age=age.span)),lty=2,col=2)

#

# output excerpts

#

summary(lob.lmfit)

Call:

lm(formula = height ~ age + I(age^2), data = Loblolly)

Residuals:

Min 1Q Median 3Q Max

-3.7902 -1.1496 0.0183 0.9401 4.1815

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -7.607232 0.608017 -12.51 <2e-16 ***

age 3.959044 0.109236 36.24 <2e-16 ***

I(age^2) -0.049838 0.003884 -12.83 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.702 on 81 degrees of freedom

Multiple R-squared: 0.9934, Adjusted R-squared: 0.9932

F-statistic: 6079 on 2 and 81 DF, p-value: < 2.2e-16

summary(lob.wfit)

Call:

lm(formula = height ~ age + I(age^2), data = Loblolly, weights = wts)

Residuals:

Min 1Q Median 3Q Max

-3.44569 -1.33317 -0.52965 0.09271 4.25344

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -7.658170 0.440810 -17.37 <2e-16 ***

age 4.096511 0.083095 49.30 <2e-16 ***

I(age^2) -0.052930 0.003005 -17.62 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.56 on 81 degrees of freedom

Multiple R-squared: 0.9964, Adjusted R-squared: 0.9963

F-statistic: 1.12e+04 on 2 and 81 DF, p-value: < 2.2e-16

summary(lob.wfit2)

Call:

lm(formula = height ~ age + I(age^2), data = Loblolly, weights = 1/predict(lob.lmfit))

Residuals:

Min 1Q Median 3Q Max

-0.59995 -0.18919 -0.02848 0.21584 0.65256

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -7.088373 0.330465 -21.45 <2e-16 ***

age 3.850410 0.089583 42.98 <2e-16 ***

I(age^2) -0.045965 0.003633 -12.65 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2989 on 81 degrees of freedom

Multiple R-squared: 0.9954, Adjusted R-squared: 0.9953

F-statistic: 8839 on 2 and 81 DF, p-value: < 2.2e-16
Multiple Regression (OL Chapter 12)

* More than one predictor variable

Example: Polynomial regression

or

(Q: how does the interpretation of the intercept vary between these 2 models?)

Example: Indicator variables – e.g. different lines in different groups

where Igroup2 = 1 (group 2) and Igroup2 = 0 (group 1)

So,

GROUP 2 INTERCEPT differs from GROUP 1 intercept by 1

GROUP 2 SLOPE differs from GROUP 1 slope by 3

Example: Lung function in miners exposed to coal dust (aside: guess the R2 for this analysis. . .)

GENERAL FORM:

Comments:

1.“LINEAR” model because the regression coefficients enter the model in a linear way – compare

So, how does a multiple regression model (MR) differ from simple linear regression (SLR)?

i.SLR is the equation of LINE; MR is the equation of a (hyper-)PLANE

ii.0 is the mean response when X=0 in SLR while 0 is the mean response when ALL X’s=0 in MR

iii.2 regression coefficients in SLR; k+1 regression coefficients in MR

iv.interpretation of coefficients? Partial coefficients in MR

v.Model scope (space covered by the Xs)

Estimating regression coefficients

Least squares – minimize:

Estimate of 2

F Test of any relationship between Y and set of predictor variables

Ho: 1 = 2 = …=k = 0  ie:

HA: at least one of i ≠ 0

TS:

RR: Reject H0 if Fobs > F, k, n-k-1 or p-value < 

Conclusions

(Partial) Test of j

Ho:  j= 0  ie:
HA:  j≠ 0
[some assoc.] / HA:  j<0
[negative assoc.] / HA:  j >0
[positive assoc.]
TS:
RR: Reject H0 if / |tobs | > t, n-2 / tobs < -t, n-2 / tobs > t, n-2
P-value: / 2*P(tn-2> |tobs|) / P(tn-2< tobs) / P(tn-2> tobs)
Conclusions:

where

==> R2 here is the % of one pred. variable accounted for by all of the other predictors

==> VIF=[1/(1-R2)] is a diagnostic of collinearity (max>10 concern – Neter et al.)

Testing a subset of the predictors

Model: y = 0 +  1X1+ 2X2 + … +gXg + g+1Xg+1+g+2Xg+2 + … +kXk +

Ho: g+1 = g+2 = … = k = 0 [implies only need “g” of the “k” predictor variables]

HA: not H0

TS:

Two Models (ie multiple regression models) are fit.

The COMPLETE Model is a model with ALL “k” predictor/independent/X’s in the model

The REDUCED Model is the multiple regression model with only “first” g of the X’s in the model.

For example, suppose our Complete/Full model has 5 independent variables, X1, X2, X3, X4, and X5 in it.

So COMPLETE MODEL =

Suppose we wish to test whether X2, X3, and X5 can be dropped from model. Our hypotheses become:

Ho: 2= 3= 5= 0

HA: at least one of 2,3,or 5is not zero

If Ho is true, how does our model change? IE, does anything “drop out” of our COMPLETE MODEL?

So the REDUCED MODEL = ______

Hence we fit two models, one with all five X’s in the model and another with only X1 and X4.

Our test then compares the Residual Sums of Squares for the two models, with the thought that if the variables of interest ARE NOT IMPORTANT, the Residual Sums of Squares should be approximately equal; if any of the variables IS IMPORTANT, the Residual Sums of Squares for the COMPLETE MODEL should be much smaller than the SS(Resid, Reduced).

Example: Life Expectancy across different countries

Using R

* put “country-csv.csv” in the LECTURES folder link on my home page

# using Rcmdr

country <- read.table("C:/Users/baileraj/BAILERAJ/Classes/Web-CLASSES/ies-612/lectures/country-csv.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)

showData(country, placement='-20+200', font=getRcmdr('logFont'), maxwidth=80, maxheight=30)

fix(country) # to fix some of the variable types

names(country)

country$log10gnp <- log10(country$pcgnp)

Country = read.csv(file("C:\\MyDocs\\Class\\1 Winter 2008\\IES 612\\Data\\Bailer Data Sets\\Country.csv"), header=TRUE)

scatterplot.matrix(~lifewom+lifemen+liter+pcturban+area+popnsize+pcgnp, reg.line=lm, smooth=TRUE, span=0.5, diagonal = 'density', data=Country)

Country$logarea = log10(Country$area)

Country$logpopn = log10(Country$popnsize)

Country$loggnp = log10(Country$pcgnp)

Country$ienglish = as.numeric(Country$lang=="English")

scatterplot.matrix(~lifewom+lifemen+liter+pcturban+logarea+logpopn+loggnp,reg.line=lm, smooth=TRUE, span=0.5, diagonal = 'density', data=Country)

SLR of Life Expentancy of Women versus LogGNP

Reg.1 = lm(lifewom ~ loggnp, data=Country)

summary(Reg.1)

Call:

lm(formula = lifewom ~ loggnp, data = Country)

Residuals:

Min 1Q Median 3Q Max

-11.39129 -3.61693 -0.06252 4.03316 14.61683

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 19.425 3.778 5.142 2.06e-06 ***

loggnp 14.834 1.216 12.204 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.673 on 76 degrees of freedom

(1 observation deleted due to missingness)

Multiple R-Squared: 0.6621, Adjusted R-squared: 0.6577

F-statistic: 148.9 on 1 and 76 DF, p-value: < 2.2e-16

par(mfrow=c(3,2))

plot(Reg.1, which=1:6)

par(mfrow=c(1,1))

Have we missed an important variable?

plot(Country$liter, residuals(Reg.1))

What form should the variable be included in the model?

scatterplot(Country$liter, residuals(Reg.1), smooth=TRUE)

How about both variables in the model?

Reg.2 = lm( lifewom ~ liter, data=Country)

summary(Reg.2)

Call:

lm(formula = lifewom ~ liter, data = Country)

Residuals:

Min 1Q Median 3Q Max

-18.979 -2.834 1.527 3.852 10.925

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 41.85909 1.97590 21.18 <2e-16 ***

liter 0.32529 0.02664 12.21 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.614 on 75 degrees of freedom

(2 observations deleted due to missingness)

Multiple R-Squared: 0.6654, Adjusted R-squared: 0.6609

F-statistic: 149.1 on 1 and 75 DF, p-value: < 2.2e-16

Reg.3 = lm( lifewom ~ liter + loggnp, data=Country)

summary(Reg.4)

Call:

lm(formula = lifewom ~ liter + loggnp, data = Country)

Residuals:

Min 1Q Median 3Q Max

-10.8597 -2.6123 0.9206 3.1004 10.3162

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 23.51270 2.96162 7.939 1.69e-11 ***

liter 0.20117 0.02678 7.513 1.07e-10 ***

loggnp 8.86394 1.22709 7.224 3.76e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.328 on 74 degrees of freedom

(2 observations deleted due to missingness)

Multiple R-Squared: 0.8038, Adjusted R-squared: 0.7984

F-statistic: 151.5 on 2 and 74 DF, p-value: < 2.2e-16

How about all of the variables in the model?

Reg.4 = lm( lifewom ~ pcturban + liter + logarea + logpopn + loggnp, data=Country)

summary(Reg.4)

Call:

lm(formula = lifewom ~ pcturban + liter + logarea + logpopn +

loggnp, data = Country)

Residuals:

Min 1Q Median 3Q Max

-11.489 -2.379 0.605 3.008 11.165

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 27.79999 4.53708 6.127 7.13e-08 ***

pcturban 0.02241 0.03757 0.597 0.553

liter 0.19211 0.03180 6.042 9.95e-08 ***

logarea -0.41442 0.93342 -0.444 0.659

logpopn -0.26259 1.06069 -0.248 0.805

loggnp 7.73888 1.81985 4.252 7.38e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.512 on 61 degrees of freedom

(12 observations deleted due to missingness)

Multiple R-Squared: 0.7827, Adjusted R-squared: 0.7649

F-statistic: 43.96 on 5 and 61 DF, p-value: < 2.2e-16

Conclusions?

And a check of assumptions! What can we conclude?

par(mfrow=c(3,2))

plot(Reg.4, which=1:6)

par(mfrow=c(1,1))

Testing a Subset of the Predictors

Let’s test whether we could drop/eliminate the following three predictors: pcturban, logarea, and logpopn.

Our hypotheses become:

Ho: pcturban= logarea= logpopn= 0

HA: at least one of these ’s is not zero

So COMPLETE MODEL is

lifewom = 0 + pcturbanpcturban+liter liter+logarea logarea+ logpoplogpop +loggnp loggnp + 

and our REDUCED MODEL is

lifewom = 0 + liter liter+loggnp loggnp+ 

Thus we fit two regression models, obtaining:

Complete = lm( lifewom ~ pcturban + liter + logarea + logpopn + loggnp, data=Country)

Reduced = lm( lifewom ~ liter + loggnp, data=Country)

anova(Complete)

Analysis of Variance Table

Response: lifewom

Df Sum Sq Mean Sq F value Pr(>F)

pcturban 1 2653.84 2653.84 130.3680 < 2.2e-16 ***

liter 1 1404.05 1404.05 68.9731 1.316e-11 ***

logarea 1 45.08 45.08 2.2145 0.1419

logpopn 1 2.80 2.80 0.1374 0.7121

loggnp 1 368.12 368.12 18.0837 7.379e-05 ***

Residuals 61 1241.75 20.36

anova(Reduced)

Response: lifewom

Df Sum Sq Mean Sq F value Pr(>F)

liter 1 4700.5 4700.5 250.89 < 2.2e-16 ***

loggnp 1 977.6 977.6 52.18 3.756e-10 ***

Residuals 74 1386.4 18.7

And we conclude:

Multicollinearity

Reg.3 = lm( lifewom ~ liter + loggnp, data=Country)

summary(Reg.3)

Call:

lm(formula = lifewom ~ liter + loggnp, data = Country)

Residuals:

Min 1Q Median 3Q Max

-10.8597 -2.6123 0.9206 3.1004 10.3162

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 23.51270 2.96162 7.939 1.69e-11 ***

liter 0.20117 0.02678 7.513 1.07e-10 ***

loggnp 8.86394 1.22709 7.224 3.76e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.328 on 74 degrees of freedom

(2 observations deleted due to missingness)

Multiple R-Squared: 0.8038, Adjusted R-squared: 0.7984

F-statistic: 151.5 on 2 and 74 DF, p-value: < 2.2e-16

Assumptions?

par(mfrow=c(3,2))

plot(Reg.3, which=1:6)

par(mfrow=c(1,1))

How to document/identify Multicollinearity? Variance Inflation Factors or VIF’s

vif(Reg.3)

liter loggnp

1.700008 1.700008

What about Categorical Independent Variables? Indicator/Dummy Variables!

plot(fitted(Reg.5), residuals(Reg.5), pch=0+17*Country$ienglish, col=10+2*Country$ienglish)

plot(Country$liter, residuals(Reg.5), pch=0+17*Country$ienglish, col=10+2*Country$ienglish)

Reg.5 = lm(lifewom ~ liter + logarea + loggnp + logpopn + pcturban, data = Country)

plot(Country$loggnp, residuals(Reg.5), pch=0+17*Country$ienglish, col=10+2*Country$ienglish)

What would happen if we added the “English” variable alone?

RegModel.1 <- lm(lifewom~ienglish+liter+logarea+loggnp+logpopn+pcturban, data=Country)

summary(RegModel.1)

Call:

lm(formula = lifewom ~ ienglish + liter + logarea + loggnp +

logpopn + pcturban, data = Country)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 28.59519 4.58585 6.236 4.95e-08 ***

ienglish 1.79633 1.62689 1.104 0.273938

liter 0.19383 0.03178 6.099 8.38e-08 ***

logarea -0.60299 0.94728 -0.637 0.526836

loggnp 7.16271 1.89005 3.790 0.000352 ***

logpopn 0.09585 1.10744 0.087 0.931317

pcturban 0.03295 0.03869 0.851 0.397911

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.504 on 60 degrees of freedom

(12 observations deleted due to missingness)

Multiple R-Squared: 0.7871, Adjusted R-squared: 0.7658

F-statistic: 36.96 on 6 and 60 DF, p-value: < 2.2e-16

What would happen if we added the “English” variable and it’s joint effect with logGNP?

RegModel.2 <- lm(lifewom~ienglish+liter+logarea+loggnp+logpopn+pcturban+ienglish:loggnp, data=Country)

summary(RegModel.2)

Call:

lm(formula = lifewom ~ ienglish + liter + logarea + loggnp +

logpopn + pcturban + ienglish:loggnp, data = Country)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 26.23133 4.70142 5.579 6.37e-07 ***

ienglish 19.83624 10.33751 1.919 0.059844 .

liter 0.19816 0.03133 6.325 3.71e-08 ***

logarea -0.92091 0.94821 -0.971 0.335411

loggnp 7.69369 1.88168 4.089 0.000133 ***

logpopn 0.75117 1.14987 0.653 0.516126

pcturban 0.03913 0.03819 1.025 0.309677

ienglish:loggnp -5.60205 3.17155 -1.766 0.082511 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.426 on 59 degrees of freedom

(12 observations deleted due to missingness)

Multiple R-Squared: 0.7978, Adjusted R-squared: 0.7738

F-statistic: 33.25 on 7 and 59 DF, p-value: < 2.2e-16

Using SAS

data country;

title ‘country data analysis’;

infile "\\Casnov5\MST\MSTLab\Baileraj\country.data"; * reads an data file;

input name $ area popnsize pcturban lang $ liter lifemen

lifewom pcGNP;

logarea = log10(area);

logpopn = log10(popnsize);

loggnp = log10(pcGNP);

ienglish = (lang="English");

drop area popnsize pcgnp;

proc print;

run;

/*

to generate a scatterplot matrix

Solutions > Analysis > Interactive Data Analysis

- open data set WORK > COUNTRY

- select columns (CTRL and click column labels)

- Analyze > Scatter Plot (YX)

to generate regression fit via this interactive data analysis

Analyze > Fit

*/

ods html;

proc reg data=country;

title predicting life expectancy of women in different countries;

model lifewom = loggnp;

output out=new1 p=yhat r=resid;

run;

proc plot data=new1 hpercent=50 vpercent=75;

title residual plots for LIFEWOM = LOGGNP model;

plot resid*(yhat liter);

run;

proc reg data=country;

title LITER and LOGGNP as predictors of Life expectancy of women;

model lifewom = liter;

model lifewom = loggnp;

model lifewom = liter loggnp;

run;

proc reg;

title LIFEWOM predicted from PCTURBAN LITER LOGAREA LOGPOPN LOGGNP;

model lifewom = pcturban liter logarea logpopn loggnp;

plot r.*p. nqq.*r.;

run;

proc reg data=country;

title LITER and LOGGNP as predictors of Life expectancy of women;

model lifewom = liter loggnp/ tol vif collinoint;

output out=new p=yhat r=resid;

run;

proc univariate data=new plot;

id name;

var resid;

run;

proc plot hpercent=50 vpercent=50;

plot resid*yhat=ienglish resid*liter=ienglish resid*loggnp=ienglish;

run;

ods html close;

predicting life expectancy of women in different countries

The REG Procedure

Model: MODEL1

Dependent Variable: lifewom

Number of Observations Read / 79
Number of Observations Used / 78
Number of Observations with Missing Values / 1
Analysis of Variance
Source / DF / Sum of
Squares / Mean
Square / F Value / PrF
Model / 1 / 4793.33759 / 4793.33759 / 148.93 / <.0001
Error / 76 / 2446.11113 / 32.18567
Corrected Total / 77 / 7239.44872
Root MSE / 5.67324 / R-Square / 0.6621
Dependent Mean / 64.85897 / Adj R-Sq / 0.6577
Coeff Var / 8.74704
Parameter Estimates
Variable / DF / Parameter
Estimate / Standard
Error / tValue / Pr|t|
Intercept / 1 / 19.42550 / 3.77797 / 5.14 / <.0001
loggnp / 1 / 14.83433 / 1.21557 / 12.20 / <.0001
residual plots for LIFEWOM = LOGGNP model
Plot of resid*yhat. A=1, B=2, etc. resid*liter. A=1, B=2, etc.
15 ˆ A A 15 ˆ A A
‚ ‚
‚ ‚
‚ ‚
‚ A ‚ A
10 ˆ A 10 ˆ A
‚ ‚
‚ B ‚ AA
‚ A A ‚ AA
‚ BB ‚ A B A
5 ˆ C B 5 ˆ A AA AA
R ‚ A A B R ‚ A A AA
e ‚ A A AA e ‚ AA A A
s ‚ B A A A s ‚ AA A B
i ‚ A AA C A i ‚ AD A
d 0 ˆ BA AA d 0 ˆ A A AA A
u ‚ AB A A u ‚ A AA B
a ‚ A A A a ‚ A A A
l ‚ A AABAA A l ‚ AAA AA A A A
‚ AB A ‚ AA A A
-5 ˆ B A A A -5 ˆ B A A A
‚ A ‚ A
‚ A ‚ A
‚ AAB ‚ A A B
‚ A A ‚ A A
-10 ˆ -10 ˆ
‚ A AA ‚ A B
‚ ‚
‚ ‚
‚ ‚
-15 ˆ -15 ˆ
Šˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Šˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒ
40 60 80 100 0 50 100
Predicted Value of lifewom liter
NOTE: 1 obs had missing values. NOTE: 2 obs had missing values.
LITER and LOGGNP as predictors of Life expectancy of women

The REG Procedure

Model: MODEL1

Dependent Variable: lifewom

Number of Observations Read / 79
Number of Observations Used / 77
Number of Observations with Missing Values / 2
Analysis of Variance
Source / DF / Sum of
Squares / Mean
Square / F Value / PrF
Model / 1 / 4700.51263 / 4700.51263 / 149.13 / <.0001
Error / 75 / 2364.00685 / 31.52009
Corrected Total / 76 / 7064.51948
Root MSE / 5.61428 / R-Square / 0.6654
Dependent Mean / 64.68831 / Adj R-Sq / 0.6609
Coeff Var / 8.67896
Parameter Estimates
Variable / DF / Parameter
Estimate / Standard
Error / tValue / Pr|t|
Intercept / 1 / 41.85909 / 1.97590 / 21.18 / <.0001
liter / 1 / 0.32529 / 0.02664 / 12.21 / <.0001
LITER and LOGGNP as predictors of Life expectancy of women

The REG Procedure

Model: MODEL2

Dependent Variable: lifewom

Number of Observations Read / 79
Number of Observations Used / 77
Number of Observations with Missing Values / 2
Analysis of Variance
Source / DF / Sum of
Squares / Mean
Square / F Value / PrF
Model / 1 / 4620.58250 / 4620.58250 / 141.80 / <.0001
Error / 75 / 2443.93698 / 32.58583
Corrected Total / 76 / 7064.51948
Root MSE / 5.70840 / R-Square / 0.6541
Dependent Mean / 64.68831 / Adj R-Sq / 0.6494
Coeff Var / 8.82447
Parameter Estimates
Variable / DF / Parameter
Estimate / Standard
Error / tValue / Pr|t|
Intercept / 1 / 19.57316 / 3.84413 / 5.09 / <.0001
loggnp / 1 / 14.77981 / 1.24118 / 11.91 / <.0001
LITER and LOGGNP as predictors of Life expectancy of women

The REG Procedure