Chapter 13
Analysing quantitative data
Suggested solutions to questions and exercises
- Describe what is meant by the following:
(a)a case
(b)a variable
(c)a value.
(a) A case is a unit of analysis. A completed questionnaire, containing answers from an individual is a case. If you have a final sample size of 300, you have 300 cases.
(b) A variable is an individual bit of information (a question or part of questions).
(c) The answer the respondent gives to the questions is the value.
- Describe the process of transferring responses from a questionnaire to a data entry package.
Data collected on a paper questionnaire is transferred or entered into an analysis package either manually (responses keyed in to the data entry program) or electronically (responses read by a scanner or optical mark reader). On most questionnaires opposite each response, or at the end of a question, is a number. These numbers are codes that represent the responses (or values) to the question (variable). This code, representing the answer given by the respondent, is recorded or entered as a number at a fixed place in a data record sheet or grid. In order for the program to receive and understand data from the questionnaire, the data must be in a regular, predictable format such as a grid. The grid is made up of rows of cases and columns of variables. Each case makes up a line or row of data and the variables appear as columns of number codes. The purpose of data entry is to convert the answers on the questionnaire into a ‘line of data’ that the analysis program will accept and recognise.
- What is involved in data editing?
As the data are being entered on a case by case basis they can be edited or cleaned to ensure that they are free of errors and inconsistencies. Data editing can be carried out by interviewers and field supervisors during fieldwork and by editors when the questionnaires are returned from the field. With computer-aided data capture this process is incorporated into the data capture program. During the editing process missing values, out of range values and errors due to misrouting of questions are sorted out and the data are checked for other inconsistencies.
- How are missing data handled?
If a response has been left blank it is known as a ‘missing value’. It is important to deal with missing values so that they do not contaminate the dataset and mislead the researcher or client. One way of dealing with the possibility of missing values is at the questionnaire design stage and at interviewer training and briefing sessions. In a well-designed questionnaire there will be codes for ‘Don’t know’ and ‘No answer’ or ‘Refused’. Interviewers should be briefed about how to handle such responses and how to code them on the questionnaire. It is also possible to avoid missing values by checking answers with respondents at the end of the interview or during quality control call-backs.
If missing values remain, a code (or codes) can be added to the data entry program that will allow a missing value to be recorded. Typically a code is chosen with a value that is out of range of the possible values for that variable. Imagine that for some reason a respondent to the Life and Times survey did not answer, or the interviewer did not ask for or record a response to Q3 ‘Would you describe the place where you live as …?’ The values or response codes for this question range from 1 = ‘big city’ to 5 = ‘farm or home in the country’; you could assign a missing value code of 9 for ‘No response’. If you know in more detail why the information is missing – for instance, ‘doesn't apply’, ‘refused to answer’, or ‘don't know’ – and this is not already allowed for on the questionnaire, you can give each of these a different missing value code: ‘doesn't apply’ could be 96; ‘refused to answer’ could be 97; ‘don't know’ could be 98; and ‘missing for some other reason’ could be 99.
There are other ways of dealing with missing values. One extreme approach, known as casewise deletion, is to remove from the dataset any case or questionnaire that contains missing values. This approach, however, results in a reduction in sample size and may lead to bias, as cases with missing values may differ from those with none. A less drastic approach is the pairwise deletion in which only those cases without missing values are used in the table or calculation. This too will affect the quality of the data, especially if the sample size is relatively small, or if there is a large number of cases with missing values. An alternative is to replace the missing value with a real value. There are two ways of approaching this. You could calculate the mean value for the variable and use that; or you could calculate an imputed value based on either the pattern of response to other questions in the case (on that questionnaire) or the response of respondents with similar profiles to the respondent with the missing value. Substituting a mean value means that the distribution of the values for the sample does not change. We are assuming, however, that the respondent gave such a response when of course he or she may have given a more extreme answer. If we substitute an imputed value, we are making assumptions and risk introducing bias.
- Why is it necessary to weight data? Describe how weighting is carried out.
Weighting is used to adjust sample data in order to make them more representative of the target population on particular characteristics, including, for example, demographics and product or service usage. The procedure involves adjusting the profile of the sample data in order to bring it into line with the population profile, to ensure that the relative importance of the characteristics within the dataset reflects that within the target population. For example, say that in the usage and attitude survey, the final sample comprises 60% women and 40% men. Census data tell us that the proportion should be 52% women and 48% men. To bring the sample data in line with the population profile indicated by the Census data, we apply weights to the gender profile. The over-represented group – the women – is down-weighted, and the under-represented group – the men – is up-weighted. Multiplying the sample percentage by the weighting factor will achieve the target population proportion. To calculate the weighting factor, divide the population percentage by the sample percentage. Any weighting procedure used should be clearly indicated and data tables should show unweighted and weighted data.
- What is a holecount and why is it useful?
A frequency count is a count of the number of times a value occurs in the dataset, the number of respondents who gave a particular answer. For example, we want to know how many people in the sample are very satisfied with the level of service provided by Bank S. A frequency count – a count of the number of people who said they are very satisfied with Bank S – tells us this. The ‘holecount’ is set of these frequency counts for each of the values of a variable; it may be the first data you see. It is useful to run a holecount before preparing a detailed analysis or table specification as it gives an overview of the dataset, allowing you to see the size of particular sub-samples, what categories of responses might be grouped together, and what weighting might be required. For example, say we have asked if respondents are users of a particular Internet banking service. The holecount or frequency count will tell us how many users we have. We can decide if it is feasible to isolate this group – to look at how the attitudes or behaviour or opinion of Internet customers compare to those of non-Internet customers, for example.
- What do descriptive statistics tell you? Why are they useful? Give examples.
Descriptive statistics are statistics that summarise a set of data or a distribution. Under the heading of descriptive statistics come frequencies, proportions, percentages, measures of central tendency (averages – the mean, the mode and the median) and measures of variation or spread (the range, the interquartile range, the mean deviation, the standard deviation).
These statistics are useful because they allow us to summarise a mass of numbers in a small amount of numbers and yet still know a lot about the data they describe (and from which they are derived.
For example, if we have collected data on annual household income from a sample of 1,000 people, we can use descriptive statistics to tell us the mean or average income across the group, what the variation in income is within the group, what percentage of the sample have a particular income and so on.
Or, for example, in comparing service A and service B, we know that the mean price paid for A and B was the same at €79 but by calculating the standard deviation in the price paid for service A we find that it is greater – €22 compared to €14. This tells us that while the average prices are the same, the price of A is more variable than the price of B. The next step might then be to check why this variation exists (what might explain it) – is it due to a sub-group of service A providers charging more, or to one or two providers charging a lot more?
- What are cross-tabulations and why are they useful?
Most quantitative data analysis involves inspecting data laid out in a grid or table format known as a cross-tabulation. This is the most convenient way of reading the responses of the sample and relevant groups of respondents within it. For example, we may need to know to what the total sample’s view of a product is, as well as what a particular type or group of consumers thinks – whether the product appeals more to men or women, or to different age groups or different geodemographic groups, for instance. Each table or cross-tab sets out the answers to a question by the total sample and by particular groups or sub-sets within the sample that are relevant to the aims of the research.
- What is meant by ‘filtering’ the data?
A data table is usually based on those in the sample eligible to answer the question to which it relates. Not all questions are asked of the total sample, however, and analysis based on total sample is not always relevant. Those that are will be based on the total sample; those that are not will be based on the relevant sub-sample. For example, in a survey of the use of e-commerce, we might ask all respondents whether or not their organisation uses automated voice technology (Q7, say). Those who say ‘Yes’ are asked a bank of questions (Q8a to Q8f) related to this; those who say ‘No’ are filtered out and routed to the next relevant question (Q9). When the data tables are run it would be misleading to base the tables that relate to these questions on the total sample if the purpose of the table is to show the responses of users of the service. The tables should be based on those who were eligible to answer the questions, in other words, those saying ‘Yes’ at Q7. The tables for Q8a to Q8f that relate to automated voice technology are said to be based on those using automated voice technology (those saying ‘Yes’ at Q7). The table that relates to Q7 is said to be based on the total sample. In designing tables, it is important to think about what base is relevant to the aims of your analysis.
If you have a particularly large or unwieldy dataset and you do not need to look at responses from the total sample, ‘filtering’ the data, excluding some types of respondents or basing tables on the relevant sub-sample can make analysis more efficient and safer. For example, your preliminary analysis of data from a usage and attitude survey in the deodorants market involved an overview of the total sample. Your next objective is to examine the women’s deodorant market. In the interests of efficiency and safety, it may be worthwhile to have the tables re-run based on the sub-set of women only.
- What do inferential statistics tell you? Describe how you select an appropriate test.
There is a battery of inferential statistical tests and procedures that allow us to determine if the relationships between variables or the differences between means or proportions or percentages are real or whether they are more likely to have occurred by chance. For example, is the mean score among women on an attitude to the environment scale significantly different from the mean score among men? Is the proportion of those who buy product A significantly greater than the proportion who buy product B? The choice of test will depend on the type of data and on the level of measurement.
These tests are necessary because most research uses samples rather than populations. When we talk about our findings, we want to generalise – we want to talk about our findings in terms of the population and not just the sample. We can do this with some conviction if we know that our sample is truly representative of its population. With any sample, however, there is a chance that it is not truly representative. As a result we cannot be certain that the findings apply to the population. For example, we conduct a series of opinion polls among a nationally-representative sample of voters of each European Union member state. In our findings we may want to talk about how the opinions of German voters compare to those of French voters. We want to know if the two groups of voters really differ. We compare opinions on a range of issues. There are some big differences and some small differences. Are these differences due to chance or do they represent real differences in opinions? We use inferential statistical tests to tell us if the differences are real rather than due to chance. But we cannot say this for certain. The tests tell us what the probability is that the differences could have arisen by chance. If there is a relatively low probability that the differences have arisen by chance, then we can say that the differences we see between the samples of German voters and French voters are statistically significant – real differences that are likely to exist in the population and not just in the sample we have studied.
It is important to choose the correct test for the data otherwise we risk either ending up with a test result and a finding that is meaningless or we miss an interesting and useful finding. First of all, we have to determine what it is we are testing for – a difference or relationship. Next we check the level of measurement of the data involved; and finally check whether the data are derived from one sample or two, and if two, whether the samples are related or unrelated:
- Type of analysis: testing for difference or association/relationship?
- If testing for difference: data categorical/non-metric or continuous/metric?
- If non-metric: from one sample or two or more samples?
- If one sample: chi-square and binomial
- If two related samples: Sign test, Wilcoxon test and chi-square
- If two unrelated samples: Mann-Whitney U test, chi-square, Kruskal-Wallis, ANOVA
- If metric: from one sample or two or more samples?
- If one sample: z test and t test
- If two or more unrelated samples: z test, t test, ANOVA
- If two or more related samples: paired t test
- If testing for association: level of measurement of dependent variable (DV) and independent variable (IV)?
- Both DV and IV categorical: measures of association chi-square, tau B (and others).
- DV continuous and IV categorical: ANOVA
- IV and DV continuous: regression and correlation
1