CLRES 2020, Lab 1, Created by Fiona Callaghan
Tuesday 1.00pm-5pm July 12, 2004
GSCC 126
Instructors:
Joyce Chang, PhD
Doris Rubio, PhD
Maria Mor, PhD
Teaching Assistants:
Fiona Callaghan MS
David Corcoran
Vinay Mehta
Goals for Lab 1
1. Descriptive Statistics
2. Making Tables
3. Graphs
4. Conditional Probability
5. Screening (Diagnostic) Tests
6. Binomial Probability
Instructions – How to follow this lab sheet.
Whenever you see a check-mark that means that you are required to perform some action. Whenever some words are in this font it means that this is
a command that you should type in the command window of STATA. And whenever you see an > it refers to going to a series of drop-down windows, as in
“All Programs>Mathematics>STATA”. There are generally two ways to do most things in STATA: using commands that you type in the command window, or using drop-down menus, as in SPSS. Whenever possible, we will give you both ways of doing things in STATA, but you are only required to do it the way you feel most comfortable. On the back of this handout are some questions that you are required to fill in.
The questions that you have to answer to get credit for this lab are enclosed in a box like this.
You will answer these questions as you go through the lab and hand them in at the end for credit, so remember to write your name on them! If you experience trouble at any time, just raise your hand to let a TA or an instructor know that your need help. Let’s get started!
Getting Started
First we will log on to the computer. To do this you will need your University of Pittsburgh user id and your password.
ü You should see a space on the screen to enter your user id. Type it in and press return.
ü Now enter your password and press return. You should now be logged on to the computer.
We will open a folder in which to save our work, and then we will open STATA and enter a data set into STATA.
ü Right-click somewhere on the desktop and select “New Directory”. Name your folder “Lab1”, or some other name that makes sense to you. We will save all our work in this folder.
ü Go to the web page: http://www.pitt.edu/~changj/CLRES2020/main.html
ü Scroll down to find the data sets and right-click on “calcium.dta” and select “Save Link As” and save the file in “/scratch/username/Desktop/Lab1”. To do this, double click on “Desktop” and then “Lab1” in the main window (you should only have to do this once; the computer will remember where you are saving your files later on). Click “Save”. The “username” is your University of Pittsburgh email id (the part of your University of Pittsburgh email address that comes before the “@” e.g. “fmc2” is the id from the email address ).
ü Now open STATA. Go to the programs icon on the bottom left of your screen (this is the “Start Applications” menu) and click. Go to the menu Mathematics>STATA. Click on STATA and STATA should start up.
ü We wish to tell the STATA to save anything we do from now on in our “Lab1” file. To do this, in the command window type: cd “/scratch/username/Desktop/Lab1”.
ü Open a log file to save your computer session. To start a log file called “loglab1”, type log using loglab1.log and press return, or use the drop-down menus.
ü Type use calcium in the command window, and press return. You can also enter your data using a drop down window. Go to “File>Open…” and select the calcium.dta data set and click “Open”. Your data set should now be in STATA.
ü You should see some words in the “Variables” window – “treatment”, “begin”, “end” and “decrease”. Click on the Data Editor button (or type edit in the command window). You should see 4 columns of numbers and some labels at the top of those columns. Click on the red button with the white cross at the top right of the screen to get rid of the Data Editor window. If your data does not look right, ask a TA for help.
About the Data
Does increasing calcium intake reduce blood pressure? Observational studies suggest that there is a link, and that it is strongest in African-American men. Twenty-one African-American men participated in an experiment to test this hypothesis. Ten of the men took a calcium supplement for 12 weeks while the remaining 11 men received a placebo. Researchers measured the blood pressure of each subject before and after the 12-week period. The experiment was double-blind.
Datafile Name: Calcium
Reference: Moore, David S., and George P. McCabe (1989). Introduction to the Practice of Statistics. Original source: Lyle, Roseann M., et al., "Blood pressure and metabolic effects of calcium supplementation in normotensive white and black men," JAMA, 257(1987), pp. 1772-1776,
Authorization: contact authors
Description: Results of a randomized comparative experiment to investigate the effect of calcium on blood pressure in African-American men. A treatment group of 10 men received a calcium supplement for 12 weeks, and a control group of 11 men received a placebo during the same period. All subjects had their blood pressure tested before and after the 12-week period.
Number of cases: 21
Variable Names:
1. Treatment: Whether subject received calcium or placebo
2. Begin: seated systolic blood pressure before treatment
3. End: seated systolic blood pressure after treatment
4. Decrease: Decrease in blood pressure (Begin - End)
Descriptive Statistics
Once we have successfully entered our data into a computer, our first step should be to calculate some simple statistics – for example, the mean, the median, the standard deviation, the number of observations, the number of missing data etc. This helps us to get to know our data and check that we have entered the data in correctly. This is called calculating the descriptive statistics.
Summary Statistics
ü Type summarize in the command window and press return. For each of the numerical variables (“begin”, “end” and “decrease”) there should be the number of observations, the mean, standard deviation and minimum and maximum values.
ü We can do the same thing by clicking on Data>Describe Data>Summary Statistics. This will lead us to a window with “Main” in the top left corner. Click on the white box below the word “Variables” in order to activate it and go the the “Variables” window to your left and click on “treatment”, “begin”, “end” and “decrease. Select “Standard display” and click “Submit”. The same result that you got before should show up on the screen.
ü Type summarize begin, detail and press return. This gives you the summary statistics that you got before (for the variable “begin”) plus some new ones. The important ones that were missing before are the 25th, 50th and 75th percentiles. These are you lower quartile, median and upper quartile of your data.
ü You can also do this with STATA windows. Go to Data>Describe Data>Summary Statistics and this time put “begin” in the “Variables” box and choose “Display Additional Statistics” and click “Submit”.
Question 1: What is the mean, standard deviation, minimum and maximum values for the variable “begin”?
Question 2: What is the 25% percentile of “begin”?
Listing the Data
ü Type list in the command window. This gives us a list of all the data. Be careful not to use this if you have hundreds of observations!
ü If you only want to look at the first few observations type list in 1/6 and press return and this will list the first 6 observations.
ü Using the windows, go to to Data>Describe Data>List data. Leave the “Variables” box blank this time and click on the “by/if/in” window at the top. Click on “Obs. in range” and type “1” in the first box and “6” in the second box. Click “Submit”. You should get the same result as before.
Question 3: What are the first 6 values of the variable “Decrease”?
Making Tables
Currently, our data is in the form of a collection of columns. Oftentimes it is useful to present our data in a table format. We will get STATA to present our data in the form of a table. Type the following commands:
ü generate begingrp = 1 if begin <=112. Press return.
ü replace begingrp = 2 if (begin >112 & begin ~=.). Press return.
ü list. Press return. You should see that now we have a new variable called “begingrp” that ‘groups’ the beginning scores into two categories: when “begin” is less than or equal to 112, “begingrp” is equal to 1; when “begin” is greater than 112, “begingrp” is equal to 2.
ü Type tabulate treatment begingrp. We now have a table with the “begingrp” variable forming the columns and the “treatment” variable forming the rows.
Question 4: How many subjects started out with ‘low’ blood pressure (“begingrp” = 1)? What proportion of subjects started out with low blood pressure?
Question 5: Sketch the table in the answer sheet. How many people who took the calcium treatment also had low blood pressure at the beginning of the treatment (“begingrp” = 1)? What is the proportion of people who took the calcium treatment and had low blood pressure at the beginning of the study?
Question 6: How many people who took the calcium treatment also had high blood pressure at the beginning of the treatment (“begingrp” = 2)? What is the proportion of people who took the calcium treatment and had high blood pressure at the beginning of the study?
Question 7: Do you think that there are about equal proportions of ‘high’ and ‘low’ blood pressure subjects in each of the treatment groups? Why might this be important to the study results?
Making Graphs
Graphs and plots enable us to view large chunks of the data set all at once and often allow us to see patterns in our data that we cannot see with tests and summary statistics. For this reason, plots and graphs should always be computed before getting into any detailed analysis.
Stem-and-leaf Plot
The stem-and-leaf plot gives us a quick way to visualize the variables in our data and tells us how they are distributed.
ü Type stem begin. The stem-and-leaf plot groups the variable by the values listed down the left-hand-side of the graph, and then lists the values that fall into that category on the right-hand-side. For instance, this plot tells us that the values for “begin” from lowest to highest are: 98, 102, 102, 107, 107, 109, …etc. STATA has automatically grouped the variable in units of 5 (the first group is 95-99, the second 100-104, the third 105-109, …etc.). The mode for this group is 112: this value is repeated four times.
ü Type stem decrease and press return.
Question 8: Copy the stem and leaf plot for “decrease” onto the answer sheet. What is the mode of this variable?
Question 9: Do you think that this variable looks symmetrical?
Histograms
A histogram is similar to a stem-and-leaf plot in that they both group a variable into a number of categories and tell us how many subjects are in each group. The histogram presents this information vertically and so is like a stem-and-leaf plot turned on its side.
ü Type hist decrease and press return. This gives us the histogram of decrease in blood pressure of the subjects. By default, STATA decides that grouping the variable “decrease” into 4 categories is best.
ü Type hist decrease, bin(6). This will give you a histogram with the data grouped into 6 categories, rather than the 4 that you had before. Notice how the shape of the histogram changes. Changing the number of “bins” can give us more information about the variable.
Question 10: Sketch the shape of the two histograms that you just calculated. Which do you think is a clearer representation of the variable?
ü Type hist decrease, by(treatment) and press return. This should give you two histograms, side-by-side. The one on the left is entitled “Calcium” and the one on the right is “Placebo”.
Question 11: Sketch the graphs of the histograms for Calcium and Placebo. Do you think that there is a difference between the treatment groups for the decrease in blood pressure?
Boxplots
A boxplot is a graph made up of a collection of summary statistics – minimum, maximum, median, lower quartile, upper quartile. It may also plot any outlying (extreme) values.
ü Type summarize decrease, detail and press return.
Question 12: Write down the minimum, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile) and maximum for the variable “decrease”.
ü Type graph box decrease and press return to get a boxplot of the variable decrease.
Question 13: Sketch the boxplot and label your sketch with the summary statistics you wrote down for the previous question.
ü Now we wish to compare the decrease in blood pressure for the two groups using boxplots (like we did for the histograms). Type graph box decrease, by(treatment) and press return.
Question 14: Sketch the two boxplots. Do you think that there is a difference now between the two groups? Why?
Conditional Probability
Now we will explore conditional probability. We will continue using our calcium data set and we will try to explore whether the calcium treatment has had an effect on the blood pressure of the subjects using conditional probability.