PS 245 Computer Assignment #4-- Regression
Our regression assignment is going to pay a casual visit to a controversy that roiled the public policy literature back in the 1960s. At that time, scholars were embroiled in a debate over whether political institutions really mattered, whether what party ruled the government made a difference to what government actually did in people's lives.
One side, consisting of economists and people who wished they were economists, argued that government institutions influenced very little. So much of a state's policy was determined by its economic condition that elected decision makers were mostly window dressing.
On the other side were a group of stubborn political scientists who could not believe Democratic control of a state's political institutions produced the same policy as when Republicans controlled them. To them this idea was dangerous -- not only did it place election battles in an incredibly cynical light, but it called into question whether the discipline of political science was even relevant.
We will test an extremely narrow slice of this argument, just to have an hypothesis to explore...
So far all of our data have been at the individual level. We either had survey responses, for the NES data, or stupid drivers with the highway driving data. For the regression assignments, though, we will use something called "aggregate" data. That is, each row in our data matrix does not represent a single person, but an aggregate (or collection) of people with a single number computed to represent them. In particular, each row in the dataset represents a single U.S. state. Therefore we have 50 rows for each variable.
Copy the data file over from the Drop folder to your computer’s temporary D: drive. It is called Salary.dta. Read that file into Stata.
Our dependent variable, salary90, is the average teacher salary paid in the state in 1990. Measured in dollars, this variable is our policy indicator -- we will see how much economic constraints and political traits of the state seem to correlate with the measure, both in the bivariate and later in the multivariate sense.
For the purposes of this first portion, we will not even bother with the political variables. Rather, we will set up a single, default explanation for teacher salaries -- the state's wealth. The richer a state is, the hypothesis runs, the more they can pay their teachers. This is a brainless hypothesis, of course; no one would be interested that how much money you have determines how much you spend.
Rather, this first bivariate regression represents the obvious starting point. The challenge, which we will get to in the multivariate regression assignments, is to see if anything *else* helps explain salary, to see if we can say anything interesting about what determines a state's policies beyond wealth. State wealth will become our “control” variable.
For the measure of wealth, we will use dpinc88, which is the state's disposable personal income in 1988 (one of the best measures of taxable wealth). Obviously, the year for my dependent variable does not match that for my independent variable. Yet this makes some sense; no one knows how much tax money will come in for 1990 until after teacher salaries are locked in based upon previous years' budgets.
As a first step, summarize all of the variables (using "summ" without specifying a variable).
(1) Now, using no additional information, if I regress salary90 on dpinc88, what will the output predict for average teacher salary in a state for which the disposable personal income is $13,360?
Let's run this bivariate regression first using the manual methods introduced in Wonnacott and Wonnacott. Look at Formula 11-3 on page 361, and Formula 11-6 on page 362. We can compute the intercept and slope using using a series of STATA commands:
egen xbar=mean(dpinc88)
egen ybar=mean(salary90)
egen coefnum=sum((dpinc88xbar)*(salary90ybar))
egen coefden=sum((dpinc88xbar)^2)
generate b=coefnum/coefden
generate a=ybar(b*xbar)
By the way, the "egen" command generates a variable same as the "generate" command. The only difference is, we use it to make computations that involve more than just one row at a time; it can actually make computations that involve whole variables such as summing them, taking an average, computing a standard deviation, etc. When you finish these steps, type:
list a b
(2) What is the intercept for this bivariate regression? What is the slope?
But that was too labor intensive. We do not want to go through so many steps every time we run a regression, especially because later we would like to run regressions with additional variables. So let's use the command STATA provides to do this:
regress salary90 dpinc88
Now you see the normal output we will have to interpret for regressions. Let’s have you discuss in more detail what the coefficients mean.
(3) For question 1, I asked you what the model would predict for salary when dpi was at its mean, roughly 13360. Compute this now using the coefficients provided, and confirm that your earlier answer was correct. Show the work.
(4) If the average person in Alabama has an additional $1,000 of disposable personal income relative to the average person in Mississippi, how much more do we predict Alabama will pay teachers?
(5) How much more, on average, does a state at the upper quartile of dpi pay teachers relative to a state at the lower quartile of income? (Hint: "summ dpinc88, detail").
Let’s explore the “fit” of this regression – that is, how well our predictions fit reality.
(6) What percent of the variance in average teacher salaries does this income variable explain, when considered alone? Is this number high or low, or can you tell?
(7) By how much, on average, do we miss a given state's mean salary with our model's estimates? (HINT: Root Mean Square Error = Standard Error of the Regression.)
The coefficient on dpinc88 represents the absolute correlation between teacher salaries and income. For a description of how these two items correlate at the state level, it may be a nice measure. But if we are interested in the causes of this salary level, we need to consider that some of the correlation is for unrelated reasons. For example, wealthier states may be more likely to elect Democrats (or Republicans), and what really determines teacher salaries is which party's elected officials control the government.
So we need to introduce a few other variables, to see how they stack up next to the economic data in explaining salaries.
In particular, we will add the following:
* unemp88 -- The unemployment rate in 1988. Logic: The overall wealth of a state may have less influence on the generosity of its social welfare policies than the distribution. If we were to compare two states of equivalent wealth, one of which had few unemployed and one of which had many, that would indicate the latter's wealth was more poorly distributed.
* gradrate -- The % of adults in the state who have graduated from high school. Logic: One political influence on education spending might be the popular demand for education. We expect states with more educated people to demand more generous spending on education.
* pop1k -- The state population in 1,000s. Logic is: Bigger states require a larger state education bureaucracy, which will draw on educational funds that otherwise might have gone into the classroom.
* public88 -- The % of the state's high school students who are in public schools. Logic: Another proxy representing demand for public education. In a state where many students have fled to the private-school system, public education probably enjoys less widespread political support than in a state where spending influences a broader base of children.
* plgdem90 -- The % of the state legislature's lower house (where most spending bills originate) that is Democratic. Logic: The Democratic party is usually affiliated with social-welfare spending, and often is affiliated with teachers' unions desiring salary increases for their membership. Therefore, a state with strong Democratic influence should spend more on teacher salaries.
The multivariate regression command looks like this:
regress salary90 dpinc88 unemp88 gradrate pop1k public88 plgdem90
(8) (a) If the average person in Alabama has an additional $1,000 of disposable personal income relative to the average person in Mississippi, how much more do we predict they will pay teachers under this new model? Is this substantively different from what we found with the bivariate model (a judgment call, not a statistical test)? (b) If 10 percentage points more people in Alabama were unemployed than in Mississippi, what would we predict as the difference between their two salary rates, other things being equal? (Note: You may have to use "summ" to determine how unemp88 is scaled.) (c) If 5 percentage points more adults in Alabama have graduated from high school than in Mississippi, what would we predict as the difference between their two salary rates, other things being equal?
(9) How much more of the variance in average teacher salaries can we explain now? How much smaller, on average, are our guesses from each state's real salary?
(10) Each coefficient represents a test of the hypothesis I listed for that variable (under "Logic"). For which variables is the estimated relationship with average salary in the direction implied by the hypothesis? For which variables is the estimated effect opposite from what the hypothesis implies?
(11) Let's say we required each coefficient to pass an hypothesis test at 90% confidence. Which coefficients pass the test, judging from their p-values? Do these coefficients also pass a 95% confidence test?
(12) Another way to conduct this hypothesis test would be to compute the 90% confidence interval, which we get by adding an option after the regression command (remember, options are set off from the main command by a comma). The option we need is: level(90). Rerun the regression with that option added. What is the 90% confidence interval for our coefficient on unemp88? What about for gradrate?
Now let's look at how well the regression performs from another point of view, by comparing the real salaries to the predicted salaries that we would produce using our coefficients and explanatory variables alone (i.e., leaving out the unexplained variation around our prediction line, y-hat). Type the following commands, which always capture the predictions and errors from the last regression STATA has run:
predict salguess
predict salerror, resid
(13) What is the average residual (i.e., remaining error)? (Hint: "summ salerror"). Can you explain this result?
Note that the Root MSE from the last regression is quite similar to the standard deviation of "salerror," although not identical. Both capture how our predictions vary from reality, but not in exactly the same way. Let's see if a pattern appears in these errors to indicate that our regression assumptions do not hold. First, let's graph the errors across the predictions:
graph salerror salguess
(14) Are the errors notably more or less spread out when our predictions are in a particular range (contrary to our regression assumptions)? Or do our errors follow the same rough distribution across all predicted values (as we assumed they would)?
More interestingly, let's graph the predictions and the actual values together. This time I will fancy up the graph a bit by picking the scales for each axis and using the state abbreviations instead of just dots:
graph salary90 salguess, xscale(20000,45000) yscale(20000,45000) symbol([state])
(15) Copy this graph to a Microsoft Word file, and print it out. Draw a 45-degree angle line from the bottom left corner up to the top. What would it mean if a point fell on this line? (Note: STATA can produce a line like this, but it's too complicated to do here; type "help graph" if you feel like playing with this idea.) Judging from the graph, which state represents the largest positive residual (i.e., the state in which are guess falls the farthest below the truth)? Which one represents the most negative residual (i.e., our guess is much higher than the truth)? Circle each one.
Note that you would not be required to use graphs to evaluate your predictions. You could always look at the salerror and salguess variables directly to figure this sort of thing out. I won’t bother to walk you through that, but the only new command you would need for that purpose is “sort”. For example, the following would rearrange your dataset according to the size of the error : sort salerror
(16) BONUS -- Can you figure out why Nebraska has missing data?
This computer assignment has (in addition to encouraging you to practice interpreting regression coefficients and measures of fit):
(A) Shown you how to run a multiple regression in STATA.
(B) Shown you how to adjust the confidence intervals for regressions.
(C) Shown you how to analyze the residuals from a regression using graphical techniques.