Stata Tutorial
Dan Edelstein
Kristi Thompson
Data and Statistical Services
Stata
Stata is a statistical analysis package, used for exploring, graphing, summarizing and manipulating data files. There are commands built into Stata that allow the user to do statistical analysis such as cross-tabulation and regression on data sets.
Data Sets
A data set is just a file organized into rows and columns. The rows represent individual units, or observations. An observation might be a person, a stock, a district, or a company. The columns represent variables. A variable is a piece of information that has been recorded about each observation.
As an example, we will be using a file with some data about people. Each observation is a person, and the variables include each person's name, income, age, and race.
Getting Data into Stata
Stata has its own native format for datasets, and files in this format can be read directly into Stata with the "use" command or by using the "Open" command on the File menu. Files in Stata format have the file extension ".dta".
To read Excel Files into Stata:
- Open the file in Excel.
- Make sure that missing data values are coded as blank or as numeric values (e.g., 999 or -1). Do not use periods (.) or any other character values (e.g., N/A) to represent missing data. Simply leave the cells empty or code a number.
- Make sure that there are no commas in the numbers. You can change this under Format menu, then select Cells....
- Make sure that variable names are included only in the first row of your spreadsheet. There should be only one row of variable names (some files produced by databases have several header rows). Variable names should be 32 characters or less, start with a letter and contain no special characters, such as $ or &, except the underscore [_]. You should eliminate embedded blanks (spaces).
- Under the File menu, select Save As. Then Save as type 'CSV' (comma separated values). The file will be saved with a .csv extension, for example "sample.csv."
- Read the file in to Stata using the "insheet" command
Other file types can be read into Stata, or converted to Stata format using conversion programs, depending on the type of file. The data lab can provide assistance.
Problem Spreadsheet
Cleaned Spreadsheet
For rest of this tutorial we will be using a sample spreadsheet, sample.csv, that can be downloaded from the AP501 blackboard site. Start by downloading this spreadsheet from the web site and save it to your H: drive. In this tutorial, commands you should type will be show like this:
.stata command
Output that stata gives you will be shown below the commands, on lines that do not start with periods.
Commands, logs, help
Commands can be executed one at a time at the Stata prompt (command window in Stata for Windows). Just type the command and hit the "Enter" key. Alternatively, groups of commands can be entered into do files which can then be executed.
If you are not sure which command you need, you can type "search" and a keyword. Stata will display a list of commands and other resources associated with that keyword, if there are any. Click on the name of one of the commands or resources to display the help screen.
Alternatively, you can use the "Help" menu. Click on "Stata Command" if you know the command or "Search" if you don't.
Opening Stata
Stata starts in its default working folder or directory, typically C:\data or C:\stata. If you don't change to something else, Stata will assume that any file name you type is in the default directory. Since normally your data will be in other directories, you need the cd (Change Directory) command:
. search cd
. cd "h:\"
"h:\" is where you saved the data spreadsheet. Now that you have told stata to work there, you can read it in with a simple "insheet" command. But before you do that, you need to open a log file:
. log using stata1.log
Because you first cd'd to your h:\ drive, the log file "stata1.log" will be saved there. Everything you type into Stata (commands) and all the output Stata gives you will be saved in this file, exactly as it appeared on the screen. This is important for two reasons:
- You will have a record of your results to refer to and paste in to your report when you write up your analysis.
- You will be able to see exactly what happened if you realize you made a mistake.
A log file with extension ".log" is a plain text file. This means you can open and read it in MS Word or almost any other program. If you just issued the command
. log using stata1
Stata would create the file "stata1.smcl", which is a file type that can only be opened by Stata. This is usually inconvenient, so don't.
Describing and listing data
Now that a log file is open, read the data into Stata, then use the "describe" command to find out something about it.
. insheet using sample.csv
. describe
The describe command gives the names of each variable in your dataset and also gives the variable type.
There are two types of variables in Stata: numeric and string. String variables are indicated by variable types that begin with "str", followed by a number indicating the maximum length of the string: str9, str5, etc. Different sizes of number have different labels in the variable type column: byte, int, long, double, float. Numeric variables are simple – they contain numbers. String variables contain text which can contain any characters on the keyboard: letters, numbers, and special characters. We can do numeric calculations and statistical analysis on numeric variables – we can’t on string variables. So it's important to check your data to make sure that variables that should be numeric actually are!
The list command lists values of the different variables in your dataset.
. list name
Lists names for everyone in your dataset.
. list in 1/2
Lists the values of all variables for the first two observations. The lower number / higher number syntax can be used with other commands too.
. list in 5/10
. list
Lists the value of every variable for every observation in the dataset-- usually not a good idea!
. l sex income
Lists the value of sex and income for every observation. l (the letter L) is an abbreviation for list.
. l name if income<50000
Lists the names of people with an income of less than 50,000. Note the use of the less-than symbol to specify a condition. We'll see more of this later.
Basic descriptive statistics
Two commands that are useful for getting basic descriptive statistics for your variables are summarize and tabulate (abbreviated sum and tab respectively). sum gives the number of valid observations, mean, standard deviation, minimum and maximum values for any variables you specify. You can do the entire dataset at once:
. sum
or just a subset of variables:
. sum income age
The tabulate (or tab) command gives you a frequency distribution (for one variable) or a crosstabulation (for two).
. tab sex
sex | Freq. Percent Cum.
------+------
Female | 10 50.00 50.00
Male | 10 50.00 100.00
------+------
Total | 20 100.00
The summarize command can be combined with the tabulate command to produce summaries of one variable for each value of another. The following table shows separate summaries of income for males and females.
. tab sex, sum(income)
| Summary of income
sex | Mean Std. Dev. Freq.
------+------
Female | 27100 12251.531 10
Male | 45444.444 24734.142 9
------+------
Total | 35789.474 20868.847 19
. tab sex race
Crosstabulation showing values of race for each level of sex.
. tab sex race, row col
Same as above, but with row and column percentages.
You can combine a crosstabulation with the summarize command, just like we earlier combined a simple tab with summarize:
. tab sex race, sum(income)
Means, Standard Deviations and Frequencies of income
| race
sex | Black Other White | Total
------+------+------
Female | 19666.667 28500 31000 | 27100
| 577.35027 2121.3203 16583.124 | 12251.531
| 3 2 5 | 10
------+------+------
Male | 32666.667 50000 52200 | 45444.444
| 12220.202 0 31051.57 | 24734.142
| 3 1 5 | 9
------+------+------
Total | 26166.667 35666.667 41600 | 35789.474
| 10515.069 12503.333 25992.307 | 20868.847
| 6 3 10 | 19
. tab sex race, row col chi2
Same as the earlier crosstab, but calculates Pearson's chi-squared for the hypothesis that race and sex are independent.
| race
sex | White Black Other | Total
------+------+------
Female | 5 3 2 | 10
| 50.00 30.00 20.00 | 100.00
| 50.00 42.86 66.67 | 50.00
------+------+------
Male | 5 4 1 | 10
| 50.00 40.00 10.00 | 100.00
| 50.00 57.14 33.33 | 50.00
------+------+------
Total | 10 7 3 | 20
| 50.00 35.00 15.00 | 100.00
| 100.00 100.00 100.00 | 100.00
Pearson chi2(2) = 0.4762 Pr = 0.788
(Notice that we've started getting into performing statistical tests!) The tab command usually only makes sense for categorical variables. Trying to tab an income variable with hundreds of different values wouldn't be very enlightening. Stata sensibly refuses to do a two-way table when the table would be excessively large.
Statistical tests
To test whether there is a significant difference in the means between two groups, use the command ttest:
. ttest income, by(sex)
Two-sample t test with equal variances
------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
------+------
Female | 10 27100 3874.274 12251.53 18335.78 35864.22
Male | 9 45444.44 8244.714 24734.14 26432.1 64456.79
------+------
combined | 19 35789.47 4787.642 20868.85 25731.01 45847.94
------+------
diff | -18344.44 8806.464 -36924.46 235.5702
------
diff = mean(Female) - mean(Male) t = -2.0831
Ho: diff = 0 degrees of freedom = 17
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0263 Pr(|T| > |t|) = 0.0527 Pr(T > t) = 0.9737
The by() construct is used to specify the variable that divides the dataset into two groups. If the sex variable had three categories, say Male, Female, and Unknown, the ttest command would not work. However, you could give the command
. ttest income if sex != "Unknown", by(sex)
which would cut the number of categories back down to two.
regress (abbreviated reg) is the command that runs simple linear regression. A binary regression of income on age would look like this:
. reg income age
Source | SS df MS Number of obs = 19
------+------F( 1, 17) = 0.06
Model | 26697204 1 26697204 Prob > F = 0.8124
Residual | 7.8125e+09 17 459556511 R-squared = 0.0034
------+------Adj R-squared = -0.0552
Total | 7.8392e+09 18 435508772 Root MSE = 21437
------
income | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
age | -78.67952 326.4362 -0.24 0.812 -767.3997 610.0406
_cons | 38891.1 13776.23 2.82 0.012 9825.808 67956.4
------
A multivariate regression might include sex as a control variable, as men and women have different mean incomes. However, running
. reg income age sex
produces the error message
no observations
This is because sex is a string variable, with values of Male and Female. To include sex in this regression, we need to modify the variable to make it useable.
Creating and modifying variables
The basic commands for creating new variables and modifying old ones in Stata are generate (abbreviated gen), egen and replace.
The command genvariablename = something creates a new variable named variablename and sets it equal to something. Something can be a simple number, a string, a mathematical expression, or a function of other variables.
. gen one = 1
. gen two = 1+1
. gen three = one+two
. gen bmi = (weight/(height*height)) * 703
. gen approx_health_score = round(health_score,1)
There is another command, egen, that is also used to create new variables. egen works with a different set of functions than gen. There is no particular logic as to why there are two variable creation commands - it's just an oddity that has to do with the way Stata was written. Most egen functions work across all observations to produce variables that summarize other variables. For example:
. egen max_weight = max(weight)
. egen sum_of_all_weights=sum(weight)
For information on what gen functions there are, search for "functions" in Stata's online help. For information on egen functions, search for "egen".
The if qualifier
The if qualifier is used to isolate a set of observations with variables meeting some particular criteria. Values on variables in a dataset are compared to values on other variables or to numbers or strings using logical comparison operators. This is very often used to create "dummy variables", 0-1 indicators used to indicate whether something is true or false.
Operator / Meaning== / equal to
greater than
>= / greater than or equal to
less than
<= / less than or equal to
!= or ~= / not equal to
Pay special attention to that double equals sign! If you are testing for equality, use a double equals sign (==). A single equals sign (=) is used to set something equal to something else.
For example, if you want to list all information for a person in your dataset whose first name is Sarah, you would type:
. list if name=="Sarah"
String values always need to be put in quotes.
To display the names of people with an income less than or equal to $40,000:
. list name if income<=40000
Combining tests: and and or
if on its own is useful if you are interested in testing for only one thing at once, such as the condition "Female". But let's say you want to find out the mean income for women in your dataset between the ages of 25 and 34. What you need to do is take this series of tests and combine them with the and operator, .
. sum income if sex=="Female" & age >=25 & age <= 34
Note that the if statement is included only once, and then the tests are simply stated one after another. Also note that you need to write out the entire test statement each time:
. sum income if age >=25 & <= 34
is not allowed.
If you want to look at cases where at least one of two or more conditions is met, the or operator | is needed.
. gen child_of_immigrant=1 if birthplace_mother!="USA" | birthplace_father !="USA"
It is possible to combine the & and | operators. It's good practice to group the statements using parentheses. Since we didn't like the way we generated "child of immigrant" the first time, we need to drop it using the drop command before recreating it.
. drop child_of_immigrant
. gen child_of_immigrant=1 if (birthplace_mother!="USA" | birthplace_father !="USA") & (birthplace=="USA" & birthplace_mother!="" & birthplace_father !="")
Notice that this time a test for missing birthplaces was included.
The replace command is used to make changes to existing variables:
. gen heavy=1 if bmi>=30
. replace heavy = 0 if bmi<30
This pair of commands looks correct, but is actually wrong, for a subtle, Stata-specific reason. A peculiarity of Stata is that numerical missing, represented as a period (.), is internally treated as an infinitely large number, the highest number possible. So if you are testing for values greater than some number, missing values will always be included. This can produce very strange results -- for example, people who show up as being both senior citizens and in high school, because their age was missing. The correct sequence of commands is:
. gen heavy=1 if bmi>=30 & bmi!=.
. replace heavy = 0 if bmi<30
The moral is, always check your variable creation statements and then ask yourself, "What is happening to the missings?"
Missing string values are represented by the empty string, "".
Replace can be used with all gen functions to change the values of existing variables, but not with egen functions. However, you can use replace to modify variables created by egen as well as those created by gen.
You normally want to use replace for second and later steps in multi-step variable creations, just as we used it here. It is bad practice to "write over" existing variables, because if you make a mistake there's no way to get the original data back. For example, even if you decided that you only cared about health scores rounded to the nearest integer,
. replace health_score = round(health_score,1)
is not recommended. It's always better to create a new variable. Stata doesn't have an "undo" command, so once something is gone, it's gone!
Finally, we're ready to return to the sex and regression problem. Variables like sex, where the values have no numeric meaning, are called categorical. Variables like this are included in regressions by creating dummy variables, which are variables with the values of 0 and 1. For sex, we create a variable called female with a value of 1 meaning female and a value of 0 meaning not female (or male).
. tab sex
sex | Freq. Percent Cum.
------+------
Female | 10 50.00 50.00
Male | 10 50.00 100.00
------+------
Total | 20 100.00
. gen female = 1 if sex == "Female"
. replace female = 0 if sex == "Male"
. regress income age female
A similar procedure would be followed for race:
. tab race
race | Freq. Percent Cum.
------+------
Black | 7 36.84 36.84
Other | 2 10.53 47.37
White | 10 52.63 100.00
------+------
Total | 19 100.00
. gen black = 1 if race == "Black"
. replace black = 0 if race != "Black" & race != ""
. gen other = 1 if race == "Other"
. replace other = 0 if race != "Other" & race != ""
. regress income age female black other
The save and use Commands
You can save your data in Stata format using the save command.
. save sample
This saves all the new variables you've created.
You can then open the Stata data set with the use command.
. use sample
Graphs
While Stata's graphs are often not the most visually appealing, they can be quite useful for diagnosing data problems and for illustrating the results of analyses. For discovering the different kinds of graphs available, it is often helpful to explore the menus, one of the cases where Stata's menu interface is really helpful.
The basic command for drawing a bivariate graph in Stata is twoway. The command twoway is followed by a keyword indicating the type of graph. To do a scatter plot showing the relationship between age and income, type
. twoway scatter income age
To use the menus, click on graphics, twoway graph (scatterplot, line etc), then choose scatter from the list on the right and select age and income as the x and y variables.
Twoway graphs can be overlaid; that is, you can draw two twoway graphs on the same set of axes. A common use of this is to draw a scatterplot with a linear fit line laid overtop of it to show how the regression line fits the data.
. twoway (scatter income age) (lfit income age)
To use the menus, click on graphics, overlaid twoway graphs, then choose a scatterplot as above for plot 1 and a linear prediction with age and income as the x and y variables for plot 2.