90-776 Manipulation of Large Data Sets
Lab 6
April 21, 1999
Major Skills covered in today’s lab:
· Using SAS to compute statistics
Today’s hint:
“Man without statistics is like a fish without a bicycle.”
Yet another data set! For this lab, we will use the data set CEN80.SD2 that is on the course lan account: l:\academic\90776\data\. This data set has some 1980 decennial census data at the ZIP code level for approximately 10 states. All of the variables in the data set are labeled, so the first thing to do is a contents procedure.
I. Correlations
1) Find the correlation among the total number of people in poverty (BPOV80), the total value of owner-occupied housing (VLOOC80), and the total gross ret of renter occupied housing (VLRN80). What do you expect the relationship to be among the three variables? What do you find? Are the correlations significant (P-values small)?
2) Why do you think more people in poverty is positively correlated with higher housing values and rents?
3) Next, create new variables (in a new temporary data set) that measure
A) The percent of people in poverty: POVPCT = BPOV80/BSPOV80
B) Average housing value: VLOOCAV = VLOOC80/HUOOC80
C) Average rental price: VLRNAV = VLRN80/RNTOT80
You may want to label these variables.
These variables are the values divided by the totals, and they give an average value for that ZIP code. This solves the problem from above (more people means more people in poverty; more houses means greater total housing value, etc.). When working with aggregate data, it is very important to create the proper averages!
4) Check your log file. Notice all of the error messages. For some of the ZIPs, there are zero values for BSPOV80, HOOC80 and RNTTOT80. SAS does not like to divide by zero. Change your program to only set the permanent data set if all of those three variables are greater than zero. (Do this with an IF statement right after the SET statement).
5) Next, calculate the correlations among the three variables you created in (3). Do the correlations now appear to be more believable?
II. Testing Means
Let’s see if average poverty rates, housing values and rents in Pennsylvania are different than those in the rest of the country.
1) Create a new temporary data set that sets the data you created in part I. In this data set, create a dummy variable called PA that equals 1 if the observation is from Pennsylvania, and equals 0 otherwise. If the observation is from Pennsylvania, its ZIP code will be between 1500 and 19699.
2) What percent of the observations are from PA? Do a means procedure of the PA dummy to find out.
3) Use PROC TTEST to test whether POVPCT VLOOCAV VLRNAV are different in PA than in the rest of the states. (Use CLASS PA to tell SAS to test the means from PA against the means from the rest of the country). Are the means significantly different?
III. Estimating regressions
Let’s see if we can predict the price of housing based on the income in a ZIP.
1) Create another temporary data sets the data set you created in part II. In this data set, create an average or per-capita income variable: INCAV = TINC80/TOT_P80;
2) Regress VLOOCAV on INCAV (in your PROC REGRESS statement, the syntax is MODEL VLOOCAV = INCAV). Also, produce a plot of housing values and income (in your PROC REGRESS statement, the syntax is PLOT VLOOCAV*INCAV). Is the coefficient on INCAV significant?
3) Re-run your program, this time including the predicted regression line in your plot (the variable name is PREDICTED. and the option: /OVERLAY will allow you to plot both the actual data points and the regression line on the same plot.) Also, do another plot in the same regression command that plots the residuals: (PLOT RESIDUAL.*INCAV = "o";
4) Next, also include the percent of people in poverty as a second explanatory variable in your regression (POVPCT). Is the coefficient on this variable significantly different than zero? Also, include a test of the null hypothesis that both coefficients are jointly equal to zero (in your proc regress, include the statement TEST INCAV = POVPCT=0;). Do you reject the null hypothesis? Look at the value of the F-statistic that calculated as a result of the test. Compare its value to the value of the F VALUE that is reported in the SAS regression output. What test do you think this F value is reported for? (Hint, it is a test of the overall significance of the regression).