General

Though it is not a requirement for this exercise, we can save a lot of time by combining the two data sets before we do the exercises. To do that, we have to first save the two data files separately, say as:

"M:\pc\Desktop\2007 host 120907 120907 earningsdata_males.dta" and

"M:\pc\Desktop\2007 host 120907 earningsdata_females.dta"

We can then combine the two data sets by using the command ‘append’. We first open one of the two files and append the second one. Here is an example:

. use "M:\pc\Desktop\2007 host 120907 120907 earningsdata_males.dta", clear

. append using "M:\pc\Desktop\2007 host 120907 earningsdata_females.dta"

Next, we need to generate an identifier variable to separate the two data sets. We can call the variable gender and generate it as follows.

. generate gender = "male"

. replace gender = "female" in 5860/9106

gender was str4 now str6

(3247 real changes made)

Problem 1.

. sort gender

. by gender: sum ln_y_

------

-> gender = female

Variable | Obs Mean Std. Dev. Min Max

------+------

ln_y_ | 3247 7.688559 .2854469 6.43294 9.841133

------

-> gender = male

Variable | Obs Mean Std. Dev. Min Max

------+------

ln_y_ | 5859 7.94615 .3150719 6.398595 9.806811

. by gender: ci ln_y_

------

-> gender = female

Variable | Obs Mean Std. Err. [95% Conf. Interval]

------+------

ln_y_ | 3247 7.688559 .0050094 7.678737 7.69838

------

-> gender = male

Variable | Obs Mean Std. Err. [95% Conf. Interval]

------+------

ln_y_ | 5859 7.94615 .0041162 7.938081 7.954219

. by gender: ci ln_y_, level(90)

------

-> gender = female

Variable | Obs Mean Std. Err. [90% Conf. Interval]

------+------

ln_y_ | 3247 7.688559 .0050094 7.680317 7.696801

------

-> gender = male

Variable | Obs Mean Std. Err. [90% Conf. Interval]

------+------

ln_y_ | 5859 7.94615 .0041162 7.939378 7.952921

. ttest ln_y_, by(sex) unequal

Two-sample t test with unequal variances

------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

------+------

famale | 3247 7.688559 .0050094 .2854469 7.678737 7.69838

male | 5859 7.94615 .0041162 .3150719 7.938081 7.954219

------+------

combined | 9106 7.854298 .0034461 .3288497 7.847543 7.861054

------+------

diff | -.2575912 .0064836 -.270301 -.2448815

------

diff = mean(famale) - mean(male) t = -39.7296

Ho: diff = 0 Satterthwaite's degrees of freedom = 7272.12

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000

We can also make the above test manually using the display command.

. display (7.94615 -7.688559)/ ((.3150719^2/5859) + (.2854469 ^2/3247))^(1/2)

39.729598

Problem 2

. generate y = exp( ln_y_)

. sort gender

. by gender:sum y

------

-> gender = female

Variable | Obs Mean Std. Dev. Min Max

------+------

y | 3247 2278.268 760.6658 621.9999 18791

------

-> gender = male

Variable | Obs Mean Std. Dev. Min Max

------+------

y | 5859 2980.621 1123.488 600.9999 18157

. sum y

Variable | Obs Mean Std. Dev. Min Max

------+------

y | 9106 2730.177 1063.75 600.9999 18791

. sum ln_y_

Variable | Obs Mean Std. Dev. Min Max

------+------

ln_y_ | 9106 7.854298 .3288497 6.398595 9.841133

. display exp(7.854298 )

2576.7856

Problem 3

. sort gender

. by gender: regress ln_y_ s

------

-> gender = female

Source | SS df MS Number of obs = 3247

------+------F( 1, 3245) = 534.04

Model | 37.3757573 1 37.3757573 Prob > F = 0.0000

Residual | 227.108137 3245 .069987099 R-squared = 0.1413

------+------Adj R-squared = 0.1411

Total | 264.483894 3246 .081479943 Root MSE = .26455

------

ln_y_ | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

s | .0438482 .0018974 23.11 0.000 .0401279 .0475684

_cons | 7.449845 .0113251 657.81 0.000 7.42764 7.47205

------

------

-> gender = male

Source | SS df MS Number of obs = 5859

------+------F( 1, 5857) = 810.37

Model | 70.6800466 1 70.6800466 Prob > F = 0.0000

Residual | 510.845549 5857 .08721966 R-squared = 0.1215

------+------Adj R-squared = 0.1214

Total | 581.525595 5858 .09927033 Root MSE = .29533

------

ln_y_ | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

s | .0470292 .0016521 28.47 0.000 .0437906 .0502679

_cons | 7.703346 .0093614 822.88 0.000 7.684994 7.721698

------

Problem 4

.display 7.449845 +.0438482 * 5

7.669086

. display 7.703346 +.0470292 * 5

7.938492

Problem 6

. display (.0470292 - .0438482)/(( .0016521^2+.0018974^2)^(1/2))

1.2643778

Problem 7

When we use just s, as a regressor it is very likely that we are committing omitted variable bias. One possible omitted variable could be e (if e affects ln_y_ and is also correlated with s). Let us see what will happen to the coefficient of s when we include e as an additional regressor.

. by sex: regress ln_y_ s e

------

-> sex = famale

Source | SS df MS Number of obs = 3247

------+------F( 2, 3244) = 301.43

Model | 41.4488911 2 20.7244456 Prob > F = 0.0000

Residual | 223.035003 3244 .068753084 R-squared = 0.1567

------+------Adj R-squared = 0.1562

Total | 264.483894 3246 .081479943 Root MSE = .26221

------

ln_y_ | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

s | .0526224 .0021992 23.93 0.000 .0483105 .0569342

e | .0065945 .0008568 7.70 0.000 .0049147 .0082744

_cons | 7.30401 .0220225 331.66 0.000 7.26083 7.347189

------

------

-> sex = male

Source | SS df MS Number of obs = 5859

------+------F( 2, 5856) = 585.91

Model | 96.9632034 2 48.4816017 Prob > F = 0.0000

Residual | 484.562392 5856 .08274631 R-squared = 0.1667

------+------Adj R-squared = 0.1665

Total | 581.525595 5858 .09927033 Root MSE = .28766

------

ln_y_ | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

s | .0603534 .0017743 34.01 0.000 .056875 .0638317

e | .0124004 .0006958 17.82 0.000 .0110364 .0137644

_cons | 7.447337 .0170141 437.72 0.000 7.413983 7.480691

------

. correlate s e

(obs=9106)

| s e

------+------

s | 1.0000

e | -0.4584 1.0000

. correlate s e ln_y_

(obs=9106)

| s e ln_y_

------+------

s | 1.0000

e | -0.4584 1.0000

ln_y_ | 0.3093 0.0061 1.0000

Problem 8

We can run different univariate regressions and compare their R2with the model that includes s as a regressor. Let us try one:

. sort gender

. by gender: regress ln_y_ e

------

-> gender = female

Source | SS df MS Number of obs = 3247

------+------F( 1, 3245) = 25.76

Model | 2.08275579 1 2.08275579 Prob > F = 0.0000

Residual | 262.401138 3245 .080863217 R-squared = 0.0079

------+------Adj R-squared = 0.0076

Total | 264.483894 3246 .081479943 Root MSE = .28436

------

ln_y_ | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

e | -.0040326 .0007946 -5.08 0.000 -.0055906 -.0024747

_cons | 7.748528 .0128269 604.08 0.000 7.723378 7.773677

------

------

-> gender = male

Source | SS df MS Number of obs = 5859

------+------F( 1, 5857) = 12.37

Model | 1.22562115 1 1.22562115 Prob > F = 0.0004

Residual | 580.299974 5857 .099078022 R-squared = 0.0021

------+------Adj R-squared = 0.0019

Total | 581.525595 5858 .09927033 Root MSE = .31477

------

ln_y_ | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

e | .0024285 .0006905 3.52 0.000 .0010749 .003782

_cons | 7.909485 .0112063 705.81 0.000 7.887517 7.931454

------