Econ 253

Data Downloading Workshop: Emphasis on Stata

Steps:

Select variables in DataFerrett

Download data: two options

Tab delimited format -> Excel

Stata format -> Stata (.do and .asc files) [Discussed in this handout]

Recode the data

Excel: IF() and VLOOKUP() functions

Stata: Generate, Replace, Drop statements

Keep a careful record of what you have done!

Excel documentation

Stata log files

Resources:

N:\eco253\commons\Basic Tools\InternetData\CPS has several useful files.

CPS.doc: general discussion of using DataFerrett

CPSEducRecode.xls: how to recode the education variable

MeasuringPayCPS.doc and ObtainingHourlyWages.xls: How to get hourly wage data from the CPS.

An example:

Research Question: What are the determinants of health status?

Data set: Behavioral Risk Factor Surveillance System 1996 Wave

Variables: Health, Age, Education level, Sex.

Steps:

  • Open DataFerrett
  • Find the data set.
  • Find the variables
  • Download the variables and the codebook.
  • Decide whether to download in Excel (tab-delimited) or Stata

To download in Stata, hit the Download button under Step 2 and choose the following options:

You will see something like the following in an Internet Explorer Window:

NOTE: Use the right mouse button on the file link to "Save Target As:/ Save Link As:" which will download your file(s) to your computer

Extracted: 123579 record(s) for 1996

  • Data File for 1996 howlandffu1uyjya1.asc
  • STATA Command File for 1996 howlandffu1uyjya1.do
    Extracted: 123579 record(s) for 1996
  • Data File for 1996 howlandffu1uykdb2.asc
  • STATA Command File for 1996 howlandffu1uykdb2.do

Each Data File is a flat ASCII file that will be converted into a STATA dataset when you submit the corresponding STATA Command File.

Your task is to download these files, rename them to something easier to work with, alter the do file by changing two names of files, then run the do file to read the data into Stata.

(1)Open Stata.

(2)Open a log file via a pull down menu

(3)Give the log file an easy to work with name:

(4)Set memory to 100 megabytes or more:

set mem 100m

(5)Open a new do file editor:

(6)Within the editor open up the do file you downloaded

(7)The do file looks like this (I leave out part of the file and highlight the things to change in bold):

(8)*/************************************************/

(9)*/* Please scroll down to the INFIX step for */

(10)*/* instructions related to providing full path */

(11)*/* name for the input data file. */

(12)*/************************************************/

(13)*/ then scoll to the end of the file and provide the */

(14)*/ path for where to store the data when finished */

(15)

(16)#delimit ;

(17)

(18)

(19)label define FM0X

(20)1 "Excellent"

(21)2 "Very Good"

(22)3 "Good"

(23)4 "Fair"

(24)5 "Poor"

(25)7 "Don''t know/Not sure"

(26)9 "Refused"

(27);[I’ve left out many lines of code]

(28)/************************************************/

(29)/* You will need to change the "using" */

(30)/* statement at the end of the "infix" command */

(31)/* statement to include the full directory path */

(32)/* of the directory where you saved the ASCII */

(33)/* data file. For example: */

(34)/* infix... */

(35)/* using "C:\My Documents\bearhardlo5u2ic1.asc";*/

(36)/********************************************************/

(37)

(38)infix

(39)doubleGENHLTH 1 - 11

(40)AGE 12 - 22

(41)EDUCA 23 - 33

(42)SEX 34 - 44

(43)

(44)using "c:\Documents and Setting\bearnhardlo5u2ic1.asc"

(45);

(46)

(47)label variable GENHLTH "General Health";

(48)label variable AGE "Reported Age in Years";

(49)label variable EDUCA "Education Level";

(50)label variable SEX "Sex";

(51)label values GENHLTH FM0X;

(52)label values AGE FM1X;

(53)label values EDUCA FM2X;

(54)label values SEX FM3X;

(55)

(56)save "c:\documents and setting\bearhardlo5u2ic1",

(57)replace;

(58)

(59)describe;

Note the following:

  • There are no lines (1) to (7)—that’s just Microsoft Windows adding numbers to my list. I don’t know how to change this.
  • At line (27) I deleted a bunch of code which creates labels.
  • Line (39) contains a typo which appears in all the do files I have seen. This is a bug in the program. You want to separate the word “double” which tells Stata what kind of variables we are dealing with from the variable name “GENHELTH”. Put in a return so that it reads like this:

double

GENHLTH 1 - 11

I stress that this happens at this place in the code no matter what data set you download.

You also need to change the location where the dataset has been downloaded. That’s at line (44). I renamed the file healtha1.asc and put it in the N: drive . To get the exact code I tried to open the data set:

This command produced the following failure in Stata:

use "N:\eco253\howlandf\healtha1.asc", clear

file N:\eco253\howlandf\health1.asc not Stata format

but I don’t care. I now know two things: first what I should type in to my do file, second what directory I am working in.

I see that I am in the N drive. Execute a command like this:

cd N:\eco253\howlandf\

Of course you’ll put in your directory address, which will be different from N:\eco253\howlandf\

Having changed the directory, you can now make line (44) really simple:

using "healtha1.asc"

Finally we need to tell Stata where to save the Stata .dta file (a Stata data file). This is in line (56), which is just

save "health", replace;

Make the changes to the do file and save the do file:

Then run the do file:

In my case this worked and Stata spat out the program in the results window. At the end, I got text like this:

. save "health",

> replace;

file health.dta saved

. describe;

Contains data from health.dta

obs: 123,579

vars: 4 27 Apr 2009 21:04

size: 2,965,896 (97.2% of memory free)

------

storage display value

variable name type format label variable label

------

GENHLTH double %20.0g FM0X General Health

AGE float %20.0g FM1X Reported Age in Years

EDUCA float %55.0g FM2X Education Level

SEX float %9.0g FM3X Sex

------

Sorted by:

.

end of do-file

One thing to note is that Stata data reports (like tabulate) will use the labels rather than the original numbers. Thus, in our example, sum and tab give different results for the variable SEX:

. sum SEX

Variable | Obs Mean Std. Dev. Min Max

------+------

SEX | 123579 1.587632 .4922627 1 2

. tab SEX

Sex | Freq. Percent Cum.

------+------

Male | 50,960 41.24 41.24

Female | 72,619 58.76 100.00

------+------

Total | 123,579 100.00

Male is 1 and Female is 2 in the variable SEX.

One way to find out what the numbers behind a label are is to use the label list command. You have to know the label name. In this case the do file defined a label called FM2X and applied it to the variable EDUCA:

. label list FM2X

Here is the output:

FM2X:

1 Never attended school or only kindergarten

2 Grades 1 through 8 (Elementary)

3 Grades 9 through 11 (Some high school)

4 Grade 12 or GED (High school graduate)

5 College 1 to 3 years (Some college or technical school)

6 College 4 years or more (College graduate)

9 Refused

1