Stat 521: Notes 11.
Panel Data: Random Effects
Reading: Wooldridge, Chapter 10.1-10.3
I. Panel Data
Panel, or longitudinal, data sets consist of repeated observations for the same units, firms, individuals or other economic agents. Typically the observations are at different points in time. Let denote the outcome for unit i in period t and a vector of explanatory variables. The index i denotes the unit and runs from 1 to N and the index t denotes time and runs from 1 to T. Typically T is relatively small (as small as two) and N is relatively large. As a result, when we try to approximate sampling distributions for estimators, we typically approximate them using large N asymptotics, keeping T fixed (Recently there has been some work looking at asymptotic approximates where for some ).
Example 1: The Panel Study of Income Dynamics (PSID), begun in 1968, is a longitudinal study of a representative sample of U.S. individuals (men, women, and children) and the family units in which they reside. It emphasizes the dynamic aspects of economic and demographic behavior, but its content is broad, including sociological and psychological measures. One analysis we could do with the PSID data is to let be the head of household i’s earnings in year tand to let consist of education and other covariates.
Example 2: Bouis and Haddad (1990, Agricultural Commercialization, Nutrition and the Rural Poor) conducted a panel study of the effect of agricultural commercialization on rural households in the Philippines. They surveyed 406 households at four time points (each four months apart). We will consider estimating the effect of increased income on food expenditures. is household’s i’s food expenditures per capita per week in pesos. consists of logincome = log(income in pesos per capita per week), mothed = years of formal education of the mother, fathed = years of formal education of the father, mothage = age of mother in months, fathage = age of father in months, nutrsc1 = measure of nutritional knowledge of the mother, prcorn = quality-adjusted real price of corn, prrice=quality adjusted real price of rice, pcthome = percentage of food expenditures coming from own-farm production, popden = population density of the municipality, adeqvhh = number of household members expressed in adult-equivalents, rd1 = dummy variable for round 1, rd2 = dummy variable for round 2 and rd3 = dummy variable for round 3. The data is in incomedataset.raw .
# Bouis-Haddad panel data set
philipptable=read.table("incomedataset.raw",col.names=c("hhid","ffexwkpc","lny","mothed","fathed","mothage","fathage","nutrsc1","qrpcorn","qrprice","pcthomes","popden","adeqvhh","rd1","rd2","rd3","cultarpc","avnetwth","roofdummy","croptype","i_empmin"));
y=philipptable$ffexwkpc;
logincome=philipptable$lny;
mothed=philipptable$mothed;
fathed=philipptable$fathed;
mothage=philipptable$mothage;
fathage=philipptable$fathage;
nutrsc1=philipptable$nutrsc1;
qrprice=philipptable$qrprice;
qrpcorn=philipptable$qrpcorn;
pcthomes=philipptable$pcthomes;
popden=philipptable$popden;
adeqvhh=philipptable$adeqvhh;
rd1=philipptable$rd1;
rd2=philipptable$rd2;
rd3=philipptable$rd3;
# First four observations are from first household, second four are from
# second household... Construct household id
hhid=rep(1:406,each=4)
Here we will look mainly at balanced panels, where each unit is observed for the same number of time points T. An unbalanced panel has potentially different numbers of observations for each unit. This may arise because of units dropping out of the sample, or more directly for firms going out of business.
The key issue with panel data is that and tend to be correlated even conditional on the covariates and . Let us look at this in a linear model setting. Let be an unobservable individual effect for unit i that represents time-invariant characteristics of unit i, e.g.,
- In Example 1, could summarize the ith person’s ability.
- In Example 2 could summarize the household’s preference for food versus leisure in the Philippine data set.
Consider the following model:
(1.1)
The presence of , the unobserved individual effect, creates the correlation between and conditional on and .
The two main approaches to dealing with panel data with unobserved individual effects are fixed effects and random effects. These effects are not fixed or random. They are both unobserved, individual-specific, time invariant quantities that can be viewed as random variables. The key distinction is whether we allow for dependence between the individual effects and the observed covariates (typically without modeling this dependence parametrically) [fixed effects] or are willing to assume independence of the unobserved components and the observed covariates [random effects]. Better labels would therefore be correlated and uncorrelated random effects, but the labels fixed and random effects are widely used by this point, and we will follow the literature in doing so.
For the random effects model, we make the following assumptions:
- Assumption 1 (Independence Between Units): The vectors of individual outcomes and are independent for .
- Assumption 2 (Strict Exogeneity): . can be thought of as a time varying shock that is independent of the unobserved individual characteristic and the observed characteristics .
- Assumption 3 (Uncorrelated Effects): . This is the random effects assumption. In Example 1, this would mean ability is uncorrelated with education. In Example 2, this would mean a household’s preference for food versus leisure is uncorrelated with the household income.
II. OLS Estimation of Panel Data Model
Given assumptions 1-3, we can write
(1.2)
with uncorrelated with the covariates . Thus,
and we can use ordinary least squares (OLS) to estimate :
.
The usual OLS standard errors are
where is the matrix with tth row equal to . However, these standard errors may not be valid for panel data because they assume the observations are independent whereas for panel data, we expect and to be correlated.
We can get a robust variance estimate for panel data by first estimating the residuals as
.
The robust variance is then
where is the T-vector with tth element equal to .
The following R function that I wrote computes the robust variance for panel data
# Robust variance estimator for OLS estimate for panel data
# y is (NT)x1, X is (NT)xK (to include an intercept in the model, we should
# include a column of 1's in X and i identifies the unit that each row
# belongs to, units should be labeled 1,...,N
robustvar.panel=function(y,X,i){
# Number of units
nounits=length(unique(i));
# OLS estimate
olsmodel=lm(y~X);
nuhat=resid(olsmodel);
# outmat.inv is (sum_i X_i'*X_i) where Xi is TxK matrix with Tth row equal to
# X_{it}' and innermat is sum_i X_i'*nuhat_i'*nuhat_i*X_i where nu_hat is
# T vector with tth element equal to nuhat_{it}
outmat.inv=matrix(rep(0,ncol(X)^2),ncol=ncol(X));
innermat=matrix(rep(0,ncol(X)^2),ncol=ncol(X));
for(j in 1:nounits){
Xi=X[(i==j),];
nuhati=matrix(nuhat[(i==j)],ncol=1);
outmat.inv=outmat.inv+t(Xi)%*%Xi;
innermat=innermat+t(Xi)%*%nuhati%*%t(nuhati)%*%Xi;
}
varbetahat.robust=solve(outmat.inv)%*%innermat%*%solve(outmat.inv);
varbetahat.robust;
}
Let’s consider the Philippine panel data set
X=cbind(rep(1,length(y)),logincome,mothed,fathed,mothage,fathage,nutrsc1,qrprice,qrpcorn,pcthomes,popden,adeqvhh,rd1,rd2,rd3);
olsmodel=lm(y~X-1);
betahat=coef(olsmodel);
var.ols=vcov(olsmodel);
robust.var=robustvar.panel(y,X,i=hhid);
betahat;
# Display usual OLS se’s along side robust se’s
cbind(sqrt(diag(var.ols)),sqrt(diag(robust.var)))
> betahat;
X Xlogincome Xmothed Xfathed Xmothage Xfathage
8.08299204 8.78794478 1.06674329 0.31557837 -0.06834061 0.21702956
Xnutrsc1 Xqrprice Xqrpcorn Xpcthomes Xpopden Xadeqvhh
0.14172593 -3.38843516 0.40391388 -0.03751340 -0.01210908 -1.11486530
Xrd1 Xrd2 Xrd3
3.91973410 0.13772817 0.82893056
> cbind(sqrt(diag(var.ols)),sqrt(diag(robust.var)))
[,1] [,2]
X 4.262328429 6.52419270
Xlogincome 0.552353048 1.09706314
Xmothed 0.156862226 0.25027717
Xfathed 0.142460577 0.23611892
Xmothage 0.084597052 0.15916786
Xfathage 0.067365367 0.11508164
Xnutrsc1 0.117224524 0.17719815
Xqrprice 0.475843313 0.70413142
Xqrpcorn 0.629908138 0.71851590
Xpcthomes 0.014026902 0.01776399
Xpopden 0.007938146 0.01081144
Xadeqvhh 0.226429059 0.42094063
Xrd1 0.975265171 0.74518875
Xrd2 1.092749397 0.88230356
Xrd3 1.066419510 0.77339920
The coefficient on logincome is of particular interest. The food expenditure elasticity with respect to a 1% change in income for a household with the mean level of food expenditure is the following:
.
Here are the 95% confidence interval for using the usual OLS se’s and the robust se’s
> # Confidence interval for eta using usual OLS SE's
> lci.betahat.logincome.ols=betahat[2]-1.96*sqrt(var.ols[2,2]);
> uci.betahat.logincome.ols=betahat[2]+1.96*sqrt(var.ols[2,2]);
> lci.eta.ols=100*lci.betahat.logincome.ols*log(1.01)/mean(y);
> uci.eta.ols=100*uci.betahat.logincome.ols*log(1.01)/mean(y);
> lci.eta.ols
Xlogincome
0.246192
> uci.eta.ols
Xlogincome
0.3153728
> lci.betahat.logincome.robust=betahat[2]-1.96*sqrt(robust.var[2,2]);
> uci.betahat.logincome.robust=betahat[2]+1.96*sqrt(robust.var[2,2]);
> lci.eta.robust=100*lci.betahat.logincome.robust*log(1.01)/mean(y);
> uci.eta.robust=100*uci.betahat.logincome.robust*log(1.01)/mean(y);
> lci.eta.robust
Xlogincome
0.2120803
> uci.eta.robust
Xlogincome
0.3494845
Using the robust se’s widens the confidence intervals substantially.
Note: These robust se’s also provide correct se’s for clustered data (section 6.3.4 of Wooldridge). In clustered data, the data can be divided into clusters, where observations within a cluster can be dependent, but observations in different clusters are independent. An example is data on student achievement from different schools; the students within a school form a cluster. Panel data can be thought of as a special case of clustered data in which the clusters are individual units.
III. Random Effects
To exploit some of the random effect structure, define , where , the vector of all residuals for unit i.
We consider an assumption on the random effects structure:
This random effects structure follows from Assumptions 1-3 and the additional assumption that is uncorrelated with .
We can exploit this error structure by estimating the variance-covariance matrix and then using weight least squares:
Here, is the matrix with tth row equal to . The consistency of this estimator, like that for the OLS estimator, does not depend on the random effects error structure being true.
The estimator for is
To get estimates for and , first estimate by OLS, calculate the residuals
,
estimate the residuals variance as
,
estimate the variance of the unobserved individual effect as
and
or zero if this is negative.
# Random effects variance structure estimate
X=cbind(rep(1,length(y)),logincome,mothed,fathed,mothage,fathage,nutrsc1,qrprice,qrpcorn,pcthomes,popden,adeqvhh,rd1,rd2,rd3);
olsmodel=lm(y~X-1);
olsmodel.resid=resid(olsmodel);
N=length(y)/4;
T=4;
K=ncol(X);
sigmahatnusq=(1/(N*T-K))*sum(olsmodel.resid^2);
# Rearrange residuals into a 406x4 matrix with each row corresponding to an individual
residmat=matrix(olsmodel.resid,byrow=TRUE,nrow=406);
tempsum=0;
for(i in 1:N){
for(t in 1:3){
for(s in (t+1):4){
tempsum=tempsum+residmat[i,t]*residmat[i,s];
}
}
}
sigmahatcsq=(1/(N*T*(T-1)/2-ncol(X)))*tempsum;
sigmahatepssq=sigmahatnusq-sigmahatcsq;
> sigmahatcsq
[1] 86.42415
> sigmahatepssq
[1] 103.9567
If the model is correct, including the specification for , then the variance for is
.
To fit a random effects model in R, we can use the lme4 package which needs to be installed.
# Random effects estiimator
library(lme4);
remodel=lmer(y~X-1+(1|hhid));
betahat=fixef(remodel);
vcov.betahat=vcov(remodel);
se.betahat=sqrt(diag(vcov.betahat));
# Calculate point estimate and 95% CI for elasticity of food demand with
# respect to income at mean food expenditure
beta2hat=betahat[2];
lci.beta2hat=beta2hat-1.96*se.betahat[2];
uci.beta2hat=beta2hat+1.96*se.betahat[2];
etahat=100*betahat[2]*log(1.01)/mean(y);
lci.etahat=100*lci.beta2hat*log(1.01)/mean(y);
uci.etahat=100*uci.beta2hat*log(1.01)/mean(y);
> lci.etahat
Xlogincome
0.2366260
> uci.etahat
Xlogincome
0.3428723
The random effects estimator produces a shorter CI for than the OLS estimator with robust SEs [(0.237,0.343) compared to (0.212,0.349)]. However, the random effects estimator’s CI does require the random effects structure to be correct.
1