Department of Sociology and Anthropology

Carleton College

Statistical Tools for Quantitative Reasoning Peter D. Brandon

Soan 280 Leighton Hall 229

Winter Term 2009 Phone: 222-7199

Email:

DATA SOURCES FOR ASSIGNMENTS, FINAL PAPERS, AND TEAM POSTERS

Notes: *You cannot use the data sets for assignments for either final papers or team posters.

*You cannot use the data set your team uses for the poster for your final paper.

*The hints are guides only you can make other discoveries from which to argue.

DATA SETS FOR TEAM POSTERS OR FINAL PAPERS

1.  AIDS Survival in Australia (AidsinAustralia.dta)

(Hint: What factors could you assert make death less likely?)

The variables include state (NSW, QLD, VIC and other), sex (M, F), date of diagnosis, date of death, status (A for alive, D for dead), transmission category (as explained in the table below), age (in completed years), died again whether died, days of survival, years of survival, identification number, exposure time to the AZT drug, and whether or not the patient received the AZT drug. Note that state, sex, status, and transmission are string variables.

Code / Description of transmission
Hs / male homosexual or bisexual contact
hsid / as above and also intravenous drug user
Id / female or heterosexual male intravenous drug user
Het / heterosexual contact
haem / haemophilia or coagulation disorder
blood / receipt of blood, blood components or tissue
mother / mother with or at risk of HIV infection
other / other or unknown

The data were assembled in January 1992, but to allow for delays in notification of deaths the effective ending date is six months earlier. The file includes all patients diagnosed prior to July 1991, with their status as of that date. There are 2843 patients and 1761 deaths. All dates are coded as elapsed days since Jan 1, 1970, which also happens to be the way Stata stores dates. There are 29 cases that were diagnosed after death and are coded as zero survival.

2.  Divorce in America (Divorceinamerica.dta)

(Hint: Are black males more likely to divorce than others?)

This data set is assessed on its strength and weaknesses. There is no link to a data description for confidentiality reasons. However, it is a rich data set that still allows you to evaluate its strengths and weaknesses. The data set has 3,371 couple observations and 18 variables as described below.

Variable Name Variable Description

id Unique respondent's id

marnum Indicator of first marriage

censor Censoring indicator (1=censored,0=divorced)

hiseduc Husband's education (in years of schooling)

hereduc Wife's education (in years of schooling)

heblack Indicator for whether the husband is African American

sheblack Indicator for whether the wife is African American

age Age of husband (at marriage)

agediff Age difference between husband and wife

durat Duration of marriage in years

fail Marriage ends in divorce

husbandeduc Husband graduated from high school

wifeeduc Wife graduated from high school

heold Husband is at least 3 years older than his wife

mixmar 1 if a mixed racial marriage

whitemarr Couple are both white

blackmarr Couple are both black

educdiff Years of education differs by at least 2 years between spouses

3.  Recidivism (Recid.dta)

(Hint: Are minorities more likely to return to U.S. prisons?)

The dataset considered here is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest for our purposes is whether they returned to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum possible length of observation is 81 months. To learn more about these data and to evaluate the data source, refer to the citations above.

Black / =1 if black
Alcohol / =1 if alcohol problems
Drugs / =1 if drug history
Super / =1 if release supervised
Married / =1 if married when incarc.
Felon / =1 if felony sentence
Workprg / =1 if in N.C. pris. work prg.
Property / =1 if property crime
Person / =1 if crime against person
Priors / # prior convictions
Educ / years of schooling
Rules / # rules violations in prison
Age / in months
Tserved / time served, rounded to months
Follow / length follow period, months
Durat / max(time until return, follow)
backinprison / =1 if duration right censored

4.  Women and Partnering Arrangements (Livingarrangement.dta)

(Hint: Poor white women are the most likely to cohabit than any other group of people?)

The dataset considered here is drawn from the 2001 Survey of Income and Program Participation, Wave 1 data. Go to the Census Bureau to learn more about the strengths and weakness of the SIPP and its nature. In this data set there are 23 variables and 5,925 observations of adult men and women. The SIPP is a rich source of data. Again, please go to the Census Bureau website to learn more about the SIPP; it is well documented!

Variable name / Variable label
nkid014 / Number of kids in house age 0 to 14
nkid017 / Number of kids in house age 0 to 17
id / Person identifier
base_sex / 1 if sex =
base_age / Age measured in years
base_race / Race of respondents
base_ms / Marital status of respondent
base_state / State residing in
emp_status / Employment status
emp_njobs / Number of jobs currently working
emp_disab / Person is unable to work due to disability
hh_hnf / number of families in household
hh_fnp / number of persons in family
hh_fkind / Kind of family
hh_fnkids / Number of children in household
hh_htenure / housing tenure
hh_hpubhs / yesno1 or 0 living in public housing
inc_htotinc / total household income
inc_hpov / yesno1or 0 household poverty threshold
inc_hIsPov / yesno1or 0 household is in poverty
inc_hRelPov / relpov relative household poverty status
educ_highQual1 / Highest level of education
cohabiting / Cohabit rather than marry

5.  Allocation of time to house work (Timeuse.dta)

(Hint: Is there gender equity in the allocation of time to household work?)

This time use dataset is from Australia. It is the 1997 Australian Time Use Study. It is a famous data set that can be used to examine time allocations of Australian, i.e., how they spend their time. Go to the Australian Bureau of Statistics website more many details about this data set. In this data set there are 19 variables and 4,926 observations of adult men and women.

Variable / Variable label
randidp / random identifier at person level
famtype / family type
areacd / capital city and balance of state
sex / sex of person
marstat / marital status
coubirth / Birthplace
curstu / currently studying
anyben / Received gov’t pensions, benefits or allowance
totinc / weekly income
fulstat / full-time/part-time status
statwork / status in employment
uhw / hours worked per week
houtypeh / household structure
famsu / number families in household
pers / number of persons including children
depkids / number of dependants in household
emphh / labour force status of reference person
empsphh / labour force status of spouse
pinddom2 / Average time spent doing housework

6.  Out-of-School Child Care for School-Aged Children (SchAgechildcare.dta)

(Hint: Do women single mothers leave their children to care for themselves after school?)

The dataset considered here is drawn from the 2001 Survey of Income and Program Participation. It is from the Child Care module data. Go to the Census Bureau to learn more about the strengths and weakness of the SIPP and its nature. In this data set there are 30 variables and 43,388 observations of women with dependent children. The SIPP is a rich source of data. There are many questions you could ask as well as the one above. Because this is a larger data set with more variables, please open the data set to examine the variables in the file.

7.  Labor unions in America (CPS88.dta)

(Hints: Do union workers earn premium wages? What determines union membership in America?)

CPS88 comes from the Current Population Survey. It is a random sample, with replacement, of 1,000 observations from a sample of males with non-missing information on all the 11 variables in the data set.

Variable Variable label

AGE (you know what)

LNWAGE Log of wage

OCC1 Dummy variable for occupational category (see below)

IND1 Dummy variable for industrial category (see below)

UNION 1 if union member, 0 otherwise

GRADE highest educational grade completed

MARRIED 1 if married, 0 otherwise

PARTT 1 if part-time worker, 0 otherwise

POTEXP Years of potential experience

EXP2 POTEXP squared

WEIGHT Sampling weight

HIGH "Highly" unionized industry (IND1 equals 1,2,3,4,5,10,11, or 14)

Categories for OCC1 are: / Categories for IND1 are: / Categories for IND1 are:
1 Managers and administrators / 1 Natural resources / 9 Finance, Insurance, Real Estate
2 Professionals / 2 Durables / 10 Education
3 Nurses and other non-doctors / 3 Non-durables / 11 Health and Welfare
4 Clerical / 4 Construction / 12 Business services
5 Sales people / 5 Transportation / 13Personal and other services
6 Service workers / 6 Communication and utilities
7 Manual workers / 7 Wholesale trade
8 Craft workers / 8 Retail trade

8.  Affording non-parental child care (Childcare.dta)

(Hints: Do more educated mothers spend more on non-parental child care?)

The dataset considered here is drawn from the 1996 Survey of Income and Program Participation. It is from the Child Care module data. Go to the Census Bureau to learn more about the strengths and weakness of the SIPP and its nature. In this data set there are 30 variables and 3, 242 observations of mothers with dependent children. Some use non-parental child care. Note that the variable labels are not on the file. You can add them easily in STATA.

Variable / Variable label
totsibs / Total number of siblings in the family
faminc / Family income
eduksun / Mother’s education
avccpce / Average price of child care per hour
hersalre / Mothers monthly salary
famsize / Number of persons in family
race / Race of mother
povlev / Whether below poverty or not
dumysth / Live in South
dumywst / Live in West
dumymws / Live in Mid-West
dumynre / Live in North-East
herwkhrs / Mother’s weekly work hours
chcrcs / Monthly expenditure on hours of child care
totcchrs / Total number of hours per month spend in non-parental child care

B: DATA SETS FOR ASSIGNMENTS

B.1.a CPS78.dta

This data set consists of 550 randomly selected employed workers from the May 1978 current population survey conducted by the U.S. Department of Commerce. This is a survey of over 50,000 households conducted monthly, and it serves as the basis for the national employment and unemployment statistics. Data are collected on a number of individual characteristics as well as employment status. This data extract contains information on twenty-one variables for the 552 employed workers in the sample.

1. ED = years of education

2. SOUTH = 1 if lives in south

3. NONWH = 1 if nonwhite

4. HISP = 1 if Hispanic

5. FE = 1 if female

6. MARR = 1 if married with spouse present (in household)

7. MARRFE = 1 if married female with spouse present

8. EX = years of labor market experience (= AGE-ED-6)

9. EXSQ = years of labor market experience squared

10. UNION = 1 if working on a union job

11. LNWAGE = natural logarithm of average hourly earnings

12. AGE = age in years

13. NDEP = # of dependent children under 18 in household

14. MANUF = 1 if working in manufacturing industry

15. CONSTR = 1 if working in construction industry

16. MANAG = 1 if occupation is managerial or administrative

17. SALES = 1 if occupation is sales worker

18. CLER = 1 if occupation is clerical worker

19. SERV = 1 if occupation is service worker

20. PROF = 1 if occupation is professional/technical worker

B.1.b CPS85.dta. Same principle as above. Number of Observations: 534

1. ED = years of education

2. SOUTH = 1 if lives in south

3. NONWH = 1 if nonwhite

4. HISP = 1 if Hispanic

5. FE = 1 if female

6. MARR = 1 if married with spouse present (in household)

7. MARRFE = 1 if married female with spouse present

8. EX = years of labor market experience (= AGE-ED-6) (minimum = 0 imposed ex post)

9. EXSQ = years of labor market experience squared

10. UNION = 1 if working on a union job

11. LNWAGE = natural logarithm of average hourly earnings

12. AGE = age in years

13. MANUF = 1 if working in manufacturing industry

14. CONSTR = 1 if working in construction industry

15. MANAG = 1 if occupation is managerial or administrative

16. SALES = 1 if occupation is sales worker

17. CLER = 1 if occupation is clerical worker

18. SERV = 1 if occupation is service worker

19. PROF = 1 if occupation is professional/technical worker

B.2 Whether Women Work and How Much They Get Paid (Womenandwork.dta)

This is the famous Mroz data file taken from the 1976 Panel study of Income Dynamics, and is based on the data for the previous year, 1975. This data file contains 753 observations on married white women aged 30-60 in 1975 for 19 variables. The first 428 observations are those for women whose hours of work in 1975 were positive, while the final 325 observations are those for women who did not work for pay in 1975. The first variable, LFP, is a labor force participation dummy variable that equals 1 if the woman's hours of work in 1975 were positive; otherwise, it equals zero. WHRS is the wife's hours of work in 1975, while KL6 and K618 indicate the number of children in the household under age six and between ages six and 18, respectively. WA is the wife's age in years, WE is the wife's educational attainment in years of schooling, WW is the wife's 1975 average hourly earnings in 1975 dollars, and RPWG is the wife's wage reported at the time of the 1976 interview, in dollars. The HHRS variable is the husband's hours worked in 1975, HA is his age, HE is his educational attainment in years of schooling, and HW is his 1975 wage in 1975 dollars. FAMINC is the family income in 1975 dollars; hence to calculate the wife's property income, one must subtract the product of WW and WHRS from FAMINC. MTR is the wife's marginal tax rate evaluated if her hours of work were zero. MTR is taken from published federal tax tables (it excludes state and local income taxes but includes any applicable social security benefits). WMED is the wife's mother's years of schooling, and WFED is the wife's father's years of schooling. UN is the unemployment rate in the county of residence, in percentage points, while CIT is a dummy variable that equals 1 if the family lives in a large city (a Standard Metropolitan Statistical Area, SMSA); otherwise, it equals zero. Finally, AX is the wife's previous labor market experience, in years.