2
UNIVERSITY OF OSLO
DEPARTMENT OF ECONOMICS
Exam: ECON4135 - Applied statistics and econometrics, fall 2005, continuation exam
Date of exam: Monday, January 16, 2006
Time for exam: 2:30 p.m. – 5:30 p.m.
The problem set covers 4 pages
Resources allowed:
· All written and printed resources, as well as calculators, are allowed
Grades given: A (best), B, C, D, E and F, with E as the weakest passing grade.
Comments in arial font
Scientific journals constitute the medium of communication between scientists, and also the memory (storage) of science. The economics of (scientific) journals is interesting. Bergstrom[1] argues that journals owned by private publishers are grossly overpriced, and he recommends several actions to reduce the large profits made by these publishers. Bergstrom provides data to substantiate his case. There are 180 economic journals in his database, of which 16 are published by scholarly societies such as the American Economic Association. These 16 journals are published on a non-profit basis, as opposed to the remaining journals that have private publishers. We shall particularly be interested in the separation between society journals and privately published journals. Consider the variables:
: Library subscription price for the journal per year (USD).
: Number of libraries subscribing to the journal.
: Total number of times papers in the journal were cited in 1998.
: Age of the journal (years).
: Number of pages in the journal in 1998.
: Binary variable (dummy); 1 if non-profit (scholarly society), 0 otherwise.
1. Figure 1. shows dummy for society journal plotted against age . Would you think that age of journals is normally distributed within the two groups of journals? Explain what it would mean that and are stochastically independent. Would if and are independent?
No, the distribution seems skewed, with a long tail towards high ages, particularly for privately published journals. Yes, since the conditional densities of given for the two values of are equal (and equal the marginal density), the conditional expected values must be equal.
2. To estimate the mean journal age in the two groups one could consider the regression presented in R1 below. What is the estimated mean age for privately owned journals, and what is it for society journals? It is of interest to estimate the difference in age distribution between the two groups. What is the p-value for testing the null hypothesis , versus a two-sided alternative? What would the p-value be for testing versus the one-sided alternative? Can you find a 95% confidence interval for the difference in mean age?
Let . Then, years, and years. The two-sided p-value is 0.032, and the one-sided is half of that. The 95% confidence interval for is years.
3. Regression R2 is similar to R1, but now the response variable is . Histograms are shown in Figure 2. Calculate estimated mean log age in the two groups of journals. Do your results agree with those in point 2? If you now want to test the independence between and you might test versus a two-sided alternative. What is the p-value? Why is it different from what you found in point 2? Would you prefer to compare age between the two groups on the arithmetic scale (age in years) or on the logarithmic scale?
The estimated mean log ages are for private and for society journals (in log year units). The two sets of results do not agree quite – estimated centre of distribution on the arithmetic scale is systematically higher than that obtained from the mean log age – because log(a) is a concave function of a, and thus by Jensen’s inequality. The two-sided p-value is now 0.003. I would rather use the logarithmic scale because the distribution is more symmetric on that scale (Figure 2), and the mean is thus a more meaningful measure of the distribution centre. Also, the two conditional distributions separates better on the log scale, as indicated by the two-sided p-value being less.
4. If now the issue is to find determinants for what makes a journal being society published rather than privately published, logistic regression might be useful. Regression R3 shows the result of fitting the equation where is the cumulative logistic distribution function. Explain why the estimated probability of a one year old journal being society published is 0.00275. What is the estimated probability of a hundred year old journal being society published? What is the age which would make the probability about ½ for the journal being society published? Note that natural logarithms are used.
For, and, which is estimated as . . Therefore, the estimated probability is . To get the estimated probability equal to ½, the exponent must be 0. That is, and years, which is well outside the range of the data.
5. A more complex logistic regression is shown in R4, where and etc. How would you explain to your fellow economist who has never heard of logistic regression what these results mean? Are the data markedly better fitted by this regression than by R3?
I would only try to explain what we can read out from the signs of the estimated regression coefficients. Everything else being constant, if the age increases, the probability is reduced; if the subscription increases, it also is reduced; but if the number of pages increases, the probability increases; as it does when the number of citations increases. The most important determinant is the price, and pari pasu, if the price increases the probability of the journal being non-profit increases – as expected. In R3, is clearly significant. But not in model R4, presumably since are quite strongly correlated. None of these have a significant logistic regression effect on their own, but as a collective they are strongly significant with p-value 0.0058. The data certainly fits the data better, with log likelihood increased by 13 units on 4 extra parameters. The over-all significance of the set of covariates is increased, with p-value decreased from 0.0144 to 0.0000.
6. Let the odds of the probability of a journal being society published be . Show that when . Could you interpret as the price elasticity of the odds for a journal being society published? What is the 95% confidence interval for the price elasticity on the odds?
. Thus the expression for , and . is therefore the price elasticity of the odds for being society- rather than privately published. The )5% confidence interval is given in R4, and is .
7. Explain, with your statistically ignorant fellow economist in mind, what is meant by the interval having degree of confidence 95%. Our sample of 180 economics journals is about the total population of English language journals in the field of economics. The sample can therefore not rightly be thought of as a random sample from a big population. Is this a problem for your interpretation of the confidence interval?
If the experiment was repeated over again many times, and if the same method was used to calculate the 95% interval, it would cover the true value in 95% of the replicates in the long run. But what should the experiment be (my dear friend)? The history of the field of economics has had its realization so far, with its 180 journals in the English language. You can probably not envisage the development being repeated independently a large number of times. But you might think hypothetically. If the logistic model is true – in the hypothetical sense that Age, subscription etc developed as they did, but that for each journal a coin is flipped to determine whether it should be an or an journal, with success probability determined by the logistic equation, then replicated data from the assumed model can be simulated. And if simulated a good many time, the 95% intervals calculated by the method will – if Stata is right – cover the true value of the parameter in about 95% of the replications. From this explanation, it is really no problem that the sample is the complete population of economics journals in the English language.
R1
Regression with robust standard errors Number of obs = 180
F( 1, 178) = 4.70
Prob > F = 0.0315
R-squared = 0.0193
Root MSE = 25.534
------
| Robust
Age A | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
Society S | 12.51829 5.774714 2.17 0.032 1.122582 23.914
_cons | 33.98171 2.02106 16.81 0.000 29.99339 37.97003
------
R2
Regression with robust standard errors Number of obs = 180
F( 1, 178) = 9.05
Prob > F = 0.0030
R-squared = 0.0335
Root MSE = .6261
------
| Robust
Log age LA | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
S | .4074081 .1354304 3.01 0.003 .1401523 .6746639
_cons | 3.314908 .0497229 66.67 0.000 3.216786 3.413031
------
R3
Logit estimates Number of obs = 180
LR chi2(1) = 5.98
Prob > chi2 = 0.0144
Log likelihood = -51.000589 Pseudo R2 = 0.0554
------
S | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------+------
LA | 1.01383 .4217932 2.40 0.016 .1871301 1.840529
_cons | -5.891597 1.571597 -3.75 0.000 -8.97187 -2.811323
R4
Logit estimates Number of obs = 180
LR chi2(5) = 32.11
Prob > chi2 = 0.0000
Log likelihood = -37.935238 Pseudo R2 = 0.2974
------
S | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------+------
LA | -.0690083 .5582708 -0.12 0.902 -1.163199 1.025182
LY | -.4324953 .4960825 -0.87 0.383 -1.404799 .5398084
LN | 1.825738 .9343176 1.95 0.051 -.0054905 3.656967
LC | .6336932 .4133132 1.53 0.125 -.1763857 1.443772
LP | -1.438341 .4243647 -3.39 0.001 -2.270081 -.6066017
_cons | -8.368748 5.135016 -1.63 0.103 -18.43319 1.695699
------
. test LC LN LA LY
( 1) LC = 0
( 2) LN = 0
( 3) LA = 0
( 4) LY = 0
chi2( 4) = 14.54
Prob > chi2 = 0.0058
Figure 1. Dummy for society journal plotted against age .
Figure 2. Histogram of log age by journal type (privately published to the left)
2
[1] Bergstrom, T.C. 2000. Free Labor for Costly Journals? Journal of Economic Perspectives. 15: 183-198.