Statistics for AKT

Everything I knew for the AKT (must admit some of the more obscure equations like pre and post test probability were not well remembered!). Based mainly on Nick Collier’s teaching and passmedicine answers.

Data

Data can be continuous, discrete, or categorical (which may in turn be ordinal or nominal).

Continuous data concerns a continuous variable (e.g. height or weight)

Discrete data must be a whole number (e.g. number of children)

Categorical data concerns categories (e.g. blue or brown eyes)

Categorical data may be ordinal data where it is possible to rank the categories (e.g. small, medium and large) or nominal when ordering is not possible (e.g. eye colour)

Continuous data from a biological population will usually form a bell shaped curve when plotted. If the curve is symmetrical it is called a Normal distribution (a type of parametric distribution and so parametric statistical techniques are required). If it is not symmetrical it is a Skewed distribution (a type of non – parametrical distribution and non-parametric statistical techniques are required)

Normal Distribution (a type of parametric distribution)

Mean=mode (most frequently occurring number)=median (middle number when the numbers are ranked )

Distribution is described using the mean and the standard deviation or variance

Standard Deviation = the average distance each data point is from the mean (i.e. how spread out the data is). 66% of the data points will be within 1SD of the mean, 95% within 1.96SD, 99% within 3SD

Variance = SD2

Standard error of the mean – this concept is quite confusing, so imagine the population of the UK. The heights of everyone will be normally distributed and will have a mean. If you want to find out the mean height of the UK population you could measure everyone but that would take too long. An alternative would be to take a sample of the population e.g. the population of Northallerton, and assume that it is representative of the whole of the UK. You can then measure everyone in Northallerton and calculate the mean and standard deviation for Northallerton. However, how likely is it that the mean for Northallerton matches the mean for the UK? The way to work this out is the Standard Error of the Mean – basically this gives a range of figures measured from the sample mean (i.e. the Northallerton mean) that the population mean is likely to be within. There is a 95% chance that the population mean will lie within 1.96SEM of the sample mean (i.e. take the Northallerton mean and add and subtract 1.96SEM from it and this will give you a range within which there is 95% chance that the UK mean will be) This is how confidence intervals are arrived at. The bigger the sample size is the smaller the SEM becomes. This makes sense, if you measure the heights of half the population you are likely to be more accurate than if you measure the height of only 10 people.

SEM = SD/

Skewed Distributions (a type of non parametric distribution)

Described using median and range (distance between smallest and largest number)

Positive is skewed to the right (i.e. the longer tail sticks out to the right). Mean>median>mode.

Negative is skewed to the left (i.e. the longer tail sticks out to the left). Mean <median<mode.

(To remember the order of mean, median and mode write them out in that order, which is alphabetical, then put in the arrows pointing to the direction of the skew)

Statistical Tests

When doing a study, what is being looked at is whether there is a difference between two sets of data. For example, does treatment with an ACE inhibitor change the blood pressure in the treatment group compared to the non-treatment group?

Before starting a study several values have to be decided on:

The null hypothesis (i.e. the assumption which needs proving or disproving) is that there is no difference between the two groups of data (i.e. that the two samples were drawn from the same population).

p value (level of statistical significance) = the probability that the two samples have come from the same population (i.e. no difference between the two groups) = the probability of obtaining our result or something more extreme if the null hypothesis is true.

P<0.05 = probability of the two samples being from the same population is <1/20

Once the level of statistical significance to be shown has been chosen then the power of the study is used to calculate the sample size needed.

Power = probability of the study rejecting the null hypothesis when it is false (ideally should be >95%). This is affected by the sample size, treatment effect size and the p value to be demonstrated. You want to make sure you have a large enough sample to give a meaningful result that wouldn’t be achieved by chance alone.

Even if the study has a power of 95% there is still a 1/20 chance that it will not reject the null hypothesis when it is false (i.e. will say there is no difference between the two sample groups even if there is). If this happens it is a Type 2 error (i.e. a false negative – the null hypothesis is accepted when it is actually false). The rate of a type 2 error occurring is signified as β and 1-β = power. The other type of error that can occur is a type 1 error where the null hypothesis is falsely rejected when it is true (i.e. a false positive). If p= 0.05 then this type of error will happen with a 1/20 chance. A type 1 error is signified as α.

Therefore if there is no difference between the two samples then if the study has a p value of 0.05 then there is a 1/20 chance of it showing a difference between the two samples ( a type 1 error). If there is actually a difference between the two samples then a study with a power of 95% has a 1/20 chance of showing no difference (a type 2 error).

There are various ways of deciding if there is a difference between the two groups of data. The significance of this difference is determined by the p value. Depending on the distribution of the data different tests need to be used. We don’t really need to know how these tests are done so it is just a case of learning which should be used when.

Numerical data:

Normal/parametric distribution – Student’s T test (‘paired’ if it is the same people in each sample, i.e. a before and after study, ‘unpaired’ is used if the two groups contain different people)

Skewed/non-parametric distribution – Paired – Wilcoxon, Unpaired - Mann Whitney U (these rank the data before analysis)

Categorical data (binomial = two possible outcomes expressed as percentage or proportion) – Fishers’s test or Chi squared for large samples

Correlation and Regression

As well as looking for differences between two sets of data you might want to see if two sets of data correlate (e.g. does blood pressure reduce as ACE inhibitor dose is increased). You again use different tests depending on the data distribution.

Numerical data:

Normal/parametric – Pearson’s correlation coefficient (N.B. Parametric = Pearson’s)

Skewed/non-parametric – Spearman’s rank correlation coefficient (N.B. Spearman’s = Skewed)

Correlation exists when there is a linear relationship between two variables (N.B .there may be a non-linear relationship between two variables but this would give a low correlation coefficient)

Correlation coefficient = r, this is the strength of the relationship i.e. how closely points lie to a line drawn through plotted data. If the points are very scattered then the two variables are not very well correlated, however if all the points lie on the line of best fit then the two variables are well correlated. r can vary from -1 (negative correlation) to +1 (positive correlation) and 0 is no correlation

Correlation does not give any information on how much a variable will change based on the other variable. It also does not show cause and effect.

Linear regression is used to predict how one variable changes when a second variable is altered i.e. it quantifies the relationship between two variables. It describes the line of best fit in mathematical terms. The regression coefficient gives the slope of the line (i.e. the change in one value per unit change in the other). Regression is measured using the method of least squares. Regression is useful if we want to use one measure as a proxy for the other e.g. fasting glucose for HbA1c.

Survival Time

Cumulative survival over time in a cohort of patients if shown in a Kaplan Meier survival curve and it is possible to compare survival in different groups of patients using the curves. Regression models can be used to calculate the relative hazard ratio for events occurring in each (i.e. you can find out the impact on survival that individual events have and give them a figure = the relative hazard ratio).

(I had never heard of a Kaplan Meier survival curve when I did AKT but I’m sure most of us can work out what this graph shows even if we do not know the name of it!)

Odds and Rates

Basic definition of odds and rates:

If you have 6 red balls and 4 blue balls in a bag, the odds (ratio) of picking out a red ball is 6:4 or 3:2 or 1.5. The probability (rate) of picking a red ball is 6/10 or 0.6. Probability can be expressed either as a proportion as above (i.e. 0.6) or as a percentage by multiplying by 100 (i.e. 60%). Odds are never expressed as a percentage. When doing calculations with rates make sure you know if it is a proportion or a percentage that you are dealing with. Often it is best to only change to a percentage at the end of the calculation to avoid confusion.

Consider a study where treatment or exposure is given to one group and placebo or no exposure to the other group. Disease incidence is then measured in each group. It is possible to express the difference between the two groups in various ways. Different ways of expressing the difference are appropriate for different types of studies.

Disease / No disease
Treatment or exposure = experimental / a / b
Placebo or no exposure = control / c / d

N.B. the control group may not be taking a placebo but an established drug as it would be unethical for example, to compare a new chemotherapy agent to placebo if there is already an established treatment.

Basic expression of data:

Control event rate = rate of disease in placebo group = c/c+d (x100 if want to express it as a percentage) = those who have the disease who are in the control group/all those in the control group

Experimental event rate = rate of disease in the treated/exposed group = a/a+b (x100 if you want to express it as a percentage) = those who have the disease who are in the experimental group / all those in the experimental group

Ideally you would hope that if the factor being studied is a treatment then the EER should be lower than to CER. If exposure to a risk factor is being studied then the EER should be higher than the CER. (This may not always be the case if the treatment is actually worse than placebo or the possible risk factor is protective!)

Control event odds = c/d = those who have the disease who are in the control group/those who don’t have the disease who are in the control group (never expressed as a percentage)

Experimental event odds = a/b = those who have the disease who are in the experimental group/those who don’t have the disease who are in the experimental group (never expressed as a percentage)

These basic expressions of the data can be manipulated to show differences between the two groups:

Absolute risk reduction (attributable risk) is the most basic method where you simply subtract one rate from the other to show the proportion (or percentage) of cases that can be considered to be due to the exposure or reduction that can be considered to be due to the treatment. (For studying a treatment where EER is likely to be less than CER then = CER-EER, for studying an exposure where EER is likely to be greater than CER then = EER-CER)

Relative risk difference = ARR/CER (x100 to express as a percentage) (make sure ARR and CER are both expressed either as a proportion or a percentage before calculating this)

Risk ratio (risk ratio) = EER/CER (this is not expressed as a percentage as it is an odds of the rates!) (1= no difference between the two groups)

Odds ratio = EEO/CEO (never expressed as a percentage) (1=no difference between the two groups)

Another number that can be calculated from this basic data is the Number Needed to Treat. I haven’t been able to come up with an explanation for this equation but you just need to remember that:

NNT = 1/ARR (x100 if ARR was expressed as a percentage)

Types of Study

Cohort – looks at outcomes in groups which are divided by their risk factors, used to assess harm or risk (e.g. risk of diabetes in breast-fed and non-breast-fed babies in the future). Use relative risk.

Case control – looks back at risk factors in patients with and without a condition, used to assess aetiology of rare conditions (e.g. were teenagers with leukaemia more likely to have lived near an electricity pylon as a child). Use odds ratio.