Steps in quantitative data analysis

This is a broad outline of key steps in quantitative data analysis likely to be performed in M&E activities. It aims at describing the line of reasoning that is pursued more than specific analysis tools. For each step in quantitative data analysis, the following is presented:

Step
? / Description
What the step is and why you are doing it.
: / Actions and tools
What you will do and what you will use to do it.
3 / Reporting
How you will say what you found. Clarity is very important: results should be written clearly, in plain language, not in statistical jargon! In general, reporting should record and illustrate:
§  How did we implement the step?
§  What considerations guided us (e.g. assumptions)?
§  What results did we obtain?
Formulate an hypothesis and select variables
? / A hypothesis is a statement about an expected relationship between two or more variables that permits empirical testing. Formulating the hypothesis and then choosing the variables represent the key conceptual stage of the research, since it defines the direction of the study. If you play enough with a set of data you will find some sort of relationship, but the relationships that are meaningful to you should be those defined in the hypothesis.
: / You must clarify:
§  Why are we doing a study?
§  What are the key variables?
§  What relationship we can expect among them?
3 / The first part of a study report should declare and clarify the hypothesis.
It is important to show how they are related to the data gathering and analysis. (What data were collected? What analysis tools will be employed?)
Determine sample (randomly selected!)
Collect the data
Prepare the data
? / Data must be cleaned and organised for analysis. (Note that coding and nature of the data should be thought through before the data gathering process starts and should pre-tested.)
: / Actions:
§  Code/ input the data in the analysis software
§  Check the data for errors and accuracy (Are all the responses reasonable? Are all relevant questions answered? Are the responses complete?)
§  Transform the data (e.g. collapse data into categories, handle missing values)
Tools:
Levels of measurement (i.e. nominal, ordinal, interval, ratio) determine what analysis tools can be used for a variable.
3 / Briefly describe the dataset on which you are operating, focusing only on unique aspects of the study (e.g. what categories did you employ?).
Describing the sample
? / To put the data in context, describe it in terms of averages (e.g. average height) and variation (e.g. the range of heights).
: / Employ descriptive statistics:
§  Measures of central tendency, which indicate a typical or central figure for a group of members (e.g. mean, mode, median)
§  Measures of variance, which indicate the dispersion of the data, how scattered the data are (e.g. standard deviation for continuos data/range for categorical data)
3 / This stage is likely to produce extensive information. A report at this stage could read:
“There were x children in the study. The average [variable 1] was x, ranging from x1 to x2. The average [variable 2] was z, with a standard deviation of z…
Use tables and graphs to summarise and clarify the most important information.

Comparing groups within the sample:

When proceeding from simply describing the data to making inferences on them, we enter the realm of probability. We will have to accept that we are working on a sample and that we will never be certain that this perfectly represents the reality. Could the results have arisen by chance? To what extent are they really typical? Can we really generalise our conclusions?

§  Our ability to pick up a real difference between groups (if such a difference exists) is determined by the number of observations within each group (the sample size). The larger the sample size, the more likely we are to pick up differences between groups if they actually exist.

§  The amount of variation (e.g. the range of heights of boys and girls) is a factor. The less variation within each group, the more likely we are to pick up differences between groups if they actually exist.

As an example, if we accept a 95% CI there is 1 in 20 chances that a relation can be by chance. Another way of looking at it is that if you ask 20 questions, one of those is probably not correct. (THIS LAST BULLET IS NOT VERY CLEAR.).

Explore the differences between data
? / This means assessing whether the differences among the same variable in two different groups are statistically relevant. For example, you found out that the values of the averages of a given variable are different in two groups (e.g. study group and control group). Can you say that there is really a difference among the two groups or could this difference have arisen by chance?
: / §  Measures of significance, e.g. the t-test are used to find out if a difference is significant.
3 / N/A
Explore the relationships within data (among pairs of variables)
? / You must understand what relations exist among different variables in your dataset and establish if they are statistically significant, particularly between measures of programme operations and measures of expected effects
: / §  Formulate hypothesis on what relationships are likely to exist amongst your MAIN variable and the others by using “null hypothesis”— assume that there is NO relationship between variable x and z, then run tests to disprove this.
§  Measure the strength of the relationship.
§  Understand the likelihood that this relationship appeared by chance.
Whether variables are nominal, ordinal or interval determines which analytical techniques are appropriate for studying relationships
§  For nominal variables, cross tabulations of the data are used (e.g. chi-square test).
§  Interval variables use correlation tests.
3 / You will have to document the relationships relevant for your study, i.e.:
§  Those that involve the main variable being analysed
§  Those that are likely not to have appeared by chance and are strong enough to be significant.
When writing reports on correlation, don’t limit yourself to the statistical jargon: e.g. “we did a chi-square test and it revealed a p of x." You are not simply looking for the results of a test! Instead, clarify the hypothesis, clarify the effect of the relationship of the data, the strength of the relation, the likelihood that the relationship did not appear by chance. For example, while exploring the relationship within BMI and n. of siblings you could say: “children with more than one sibling had a lower BMI. Children with no siblings had a BMI of x. The relationship was distinguishably different at the 95% confidence interval."

Finding a statistical relationship among variables does not always imply that one caused the other. The relation could be true, but could just appear by accident. Causality is not something that can be revealed only by statistical tests, but is a logic process that builds on the statistical findings (i.e.: the existence of a relationship).

Explore models built on relevant variables
? / Devise and test explanatory models.
: / Actions:
Considering all the relationships you discovered, choose the set of key variables that you think are most likely to influence the main variable. Test them to understand if and to what degree they can explain the variance of the main variable of your study. The models could be applied to different subgroups of the main variables (e.g. male and female population).
In building up your case you could, for example:
§  Choose variables because you discovered that they are strongly correlated with the main variable
§  Choose variables that are not so strongly correlated, but still you deem important for your model
§  Discard variables that are strongly correlated to your main variable, but are on the same causal pathway, therefore would not add relevant information
§  Discard variables that are apparently related but could lead you to wrong conclusions.
Those models are based on your judgement and should be justified.
Note that multivariate techniques can be very powerful analytical tools, but they must be used with great care. They are all based on numerous assumptions, some of which will not be met. As a result, apparent findings often are not valid. A plan for data analysis should not include any multivariate techniques unless the evaluation team -- manager and consultants -- are already well-acquainted with them or can call on the assistance of someone who knows how to use them.
Tools:
Regression analysis/Multivariate analysis
3 / Explain:
§  why you did/did not choose variables
§  the rationale for the model
§  the finding from statistical tests
Organise and present the data (see Overview – managing data analysis)
Validate/discuss with key stakeholders (see Overview – managing data analysis)

Steps in quantitative data analysis - Page 4/4