Maths 0146 Statistical Modelling in Biology
SPSS Practical
Logistic regression
Objectives
At the end of this practical you should be able to:
· Perform logistic regression
· From the output, calculate odds ratios for the explanatory variables
· From the output, calculate 95% confidence intervals for the variables of interest
· From the output, calculate the ED50, the value of the explanatory variable that corresponds to the response variable being 50%, i.e. the does at which death rate is 50%
Before you start
· Login to the PC and goto my homepage www.bath.ac.uk/~masgs, into Teaching, this courses page, and then download the following files to your own workspace on the computer:
vite.sav
· Start up SPSS from the ‘Programs’ menu on the ‘Start’ button
Exercise 1: Exploring the difference between cases and controls
High levels of vitamin E are thought by some to be protective of cancer. This hypothesis was investigated by measuring vitamin E in stored blood samples from 271 men in a large cohort study who subsequently developed cancer. These values were compared with those from control men who at the time had not developed cancer (i.e. a ‘nested case-control design’). Two controls for each case were selected randomly from within the cohort study, subject to matching for age (within 5 years), duration of storage of the blood sample (within 3 months) and smoking status. Nine controls’ blood samples were spoilt in handling before analysis, and so there were a total of 533 controls.
The purpose of the analysis was to investigate the relationship between vitamin E and the risk of cancer. In what follows we do not explicitly take account of the matching; a more sophisticated analysis would do so. The data file (vite.sav) contains one line of data per subject as follows:
CASE – Case/control indicator, 1 = case, 0= control
AGE – Age at time of blood sample (years)
SMOKE – Smoking status 1 = non-smoker
2 = ex-smoker
3 = cigarette smoker
4 = other smoker
TIME - Time from blood collection to diagnosis of cancer
1 = up to 1 year
2 = 1-2 years
3 = 3 or more years
VIT - Vitamin E (mg/l)
VITQ - Vitamin E quintile
1.1 Comparing means
Load the dataset vite.sav
The age of each individual is recorded.
· Why are there 804 men in total ?:
If you want a quick summary of the variables, you can do a summary by columns, which is found in Analyze, Reports, Reports summaries by column. The default is to just give the sums, but by pressing the summary button you can obtain means, standard deviations etc.
· What proportion are cases (remember than cases=1, controls=0, which makes it easy, hint: just add up the column) ?
· What is the age range and mean age of the men ?
A good way to do this is to use Analyze, Compare means, which the allows you to compare means, perform t-tests and more.
Is there a difference in the mean age in the cases and controls ?
What is the mean vitamin E intake for cases, and controls ?
Is there a significant difference ?
The best way to see the distribution of a variable is to look at the histogram.
· From the ‘Graphs’ pull-down menu select ‘histogram
· Pick the variable you want to look at
· You can also an a normal curve to the data
Is the distribution of vitamin E approximately normal ?
1.2 Assess the effect of smoking
· What proportion of the men are smokers, or have smoked in the past (i.e. ex-smokers)?
In order to produce tables, with percentages in each group (row and column percentages), use Analyze, Custom Tables, General Tables. This enables you to pick the variables to use in the rows and columns (and to stratify even further) and to choose the statistics you want to calculate, for example row and column percentages (using Edit Statistics).
· Are the proportions of smokers different for cases and controls ?
· How many cancer cases developed 1-2 years after blood collection ?
The vitamin E levels have already been put into quintiles, but in SPSS you can easily categorise variables using, Transform, categorize variables and then pick the variable you want and how many levels
· Have the quintiles for vitamin E level given in VITQ been correctly defined ? Check by calculating them again from the original data, VIT.
1.3 Logistic Regression
We want to assess the effect of vitamin E intake on the probability of being a case.
We thus wish to fit a logistic model:
Log(p/1-p) = a + bvitamin E
where a and b are two numbers which we can estimate from the data. The value a is referred to as the constant term or the intercept term.
· In the ‘Statistics’ menu select ‘Regression’ and then ‘Binary Logistic’
The ‘Binary Logistic Regression’ form will appear:
· Specify case as the dependent variable, and vit as the independent variable
· Click ‘Option’ and tick CI for exp(b) Then click ‘Continue’
· Click ‘OK’ to run the logistic regression
A lot of tables are written to the output window. We are primarily interested in the model which includes vitamin E and its deviance (-2 log-likelihood) which we will use later to compare models.
Model Summary
Step / -2 Log likelihood / Cox & Snell R Square / Nagelkerke R Square1 / 937.442 / .105 / .146
· What is the deviance of the model with an intercept and the vitamin E term ?
Possibly the most interesting table is the one with the estimated regression coefficients:
Variables in the Equation
B / S.E. / Wald / df / Sig. / Exp(B) / 95.0% C.I.for EXP(B)Lower / Upper
Step 1 / VIT / -.028 / .003 / 75.854 / 1 / .000 / .973 / .967 / .979
Constant / .642 / .161 / 15.848 / 1 / .000 / 1.900
a Variable(s) entered on step 1: VIT.
Results are also given for the model with just the intercept, i.e. no vitamin E term yet
Variables in the Equation
B / S.E. / Wald / df / Sig. / Exp(B)Step 0 / Constant / -.675 / .075 / 81.689 / 1 / .000 / .509
· How do you interpret the parameter estimate in this model (i.e. fitting just an overall mean) ?
- Variables Entered/Removed. This table gives details about the independent variables used in the regression. It is used when there are many explanatory variables being used in different combinations. We can ignore this table.
- Model Summary. This table reports the Pearson correlation coefficient which you calculated earlier in exercise 2.2: it is called R in this table.
- ANOVA. The Analysis of Variance (which is commonly referred to as ANOVA) is a statistical framework which divides up the observed variation of the dependent variable (which in this case is FEV1). In the scatter plot you can see that the values of FEV1 scatter across a range of values. Part of this scatter is explained by differing height, but part it remains in unexplained variation between individuals of the same height. The purpose of ANOVA is to test whether a significant amount of variation is explained by differences in height. The p-value of this test is given in the Sig. column, and is the same as the p-value from the correlation test in exercise 2.2.
- Coefficients. This table gives the results for the values a and b in the linear model, and gives their standard errors and 95% confidence intervals. [Note that the value and standard error are in the pair of columns headed ‘Unstandardized Coefficients’, and that the coefficient column is unaccountably headed ‘B’]
· Fill in the values of a and b, their standard errors and 95% confidence intervals:
Value / Standard Error / 95% Confidence Intervala / Constant Term
(litres)
b / Height term
(litres/cm)
Given values of a and b it is possible to make a prediction of FEV1 for any specified height using the formula
FEV1 = a + bHeight
· Using the fitted values of a and b, predict FEV1 for a male student of height 180cm
FEV1 = ...... + ...... 180 = litres.[a] [b]
5. Residuals Statistics. We can use the model to predict a value of FEV1 at every observed height. The difference between this predicted FEV1 and the observed value is called the residual. These residuals scatter about zero, and contain all of the scatter in the data which is not explained by the regression line. The residuals statistics table summarises some characteristics of the residuals.
After fitting a regression line, it is a good idea to plot the line on top of the original scatter plot. In SPSS this is not quite as easy as it should be in an ideal world: it involves editing the scatterplot and overlaying another regression fit.
· Find the scatter plot in the Output window, and double click anywhere on it
This opens the ‘SPSS Chart Editor’ window, which will contain a copy of the scatterplot. In the chart editor it is possible to change the annotation on the graph and alter the range and labelling of the axes.
· On the ‘Chart’ pull-down menu select ‘Options’
The ‘Scatterplot Options’ form will appear:
· Click the ‘Total’ tick box, and the ‘Fit Options’ button will become clickable
· Click ‘Fit Options’ and the ‘Scatterplot: Fit Options’ form appears
· Click on the linear regression icon, and then click ‘Continue’
· Back on the ‘Scatterplot Options’ form click ‘OK’
This will add a line onto the scatterplot in the Chart Editor.
· In the ‘File’ menu of the Chart Editor click ‘Close’
This will return you to the normal output window, and you should see your scatter plot with the line superimposed.
· Check to see that your prediction of FEV1 at 180cm matches that of the line
The residual of each datapoint is the vertical distance between the datapoint and the regression line. Some of these are positive and some are negative. In the SPSS Data Editor window you should be able to see that the regression operation has created a new column headed res_1 containing all the residuals.
· Create a simple scatterplot of res_1 and height, with res_1 as the dependent variable
If the linear model is a good fit to the data, then there should be no trends with height in this plot. A residual plot is an important step in model checking: making sure that a regression model is a good fit to the data.
Exercise 2: Linear Regression
The file lifeexpec.sav contains birth rates per 1000 (brate) and female life expectancies (lifeexpf) for 15 countries.
· Create a scatterplot of births per 1000 (independent variable) and female life expectancy (dependent variable)
· Fit a linear regression model to the data:
lifeexpf = a + bbrate
· Report the value of the Pearson correlation coefficient R and the p-value for the relationship between brate and lifeexpf
Pearson CorrelationCoefficient
p-value
· Is there evidence for a significant relationship between lifeexpf and brate?
· Report the values of a and b, their standard errors and 95% confidence intervals
Value / Standard Error / 95% Confidence Intervala / Constant Term
(years)
b / Birth rate term
(years/1000 births)
· Add a regression line to the scatter plot