Questionnaire Scales:Part 1

Slide 1

The goal of this lecture is to inform you about the different types of question formats that you could use in questionnaires.

Slide 2

As the note to the right indicates, researchers can develop a broad range of scales for measuring things. I included this Beaufort wind force scale as one example of a type of scale you might not have thought existed.

Slide 3

Although I will try in this lecture to structure the different types of scales as much as possible, the point of showing you this cartoon—perhaps my favorite marketing research cartoon—is that it never hurts to be creative when trying to design a scale. Although being creative differs from confusing respondents with bizarre question formats, there can be a fine line between confusing and effective questions.

Slide 4

One key point of this lecture is that the precise language you use in your questions is important. Different words will solicit different answers. You really have to think about each word that you include in your questions. The cartoon illustrates the euphemisms that we sometimes use to describe things; security division of an automobile wreckage site, akajunkyard dog. The first descriptor sounds very different from the second descriptor. Similarly, using some words inquestions will inspire very different answers than using other words.

Slide 5

There are many different ways to ask the same question. Asking those questions in different ways can yield very different responses.

Slide 6

Here are six different response formats for the same question: “How likely would you be to buy Grandma’s peach cobbler?” In Case #1, there are five dashes, with anchors very likely and very unlikely. In Case #2, there are the same anchors, but the numbers 1 through 5 have replaced the dashes. In Case #3, there are five check boxes, with the first and last one labeled. In Case #4, dashes have returned, but instead of only labeling the endpoints, the intermediate points are labeled as well. By the way, it’s strongly recommended that you not merely provide respondents with anchoring endpoints; you also should describe the intermediate points in meaningful language. Case #5 looks like half boxes in which the person can check off their answers. Case #6 shows a scale from +2 to -2 with the same anchors, very likely to very unlikely.

Slide 7

This slide is similar to the previous one, except that seven-point rather than five-point scales are shown. In part, I included this slidefor example #4 about ‘Cheer detergent is.’ Notice the scale point descriptors: very harsh, harsh, somewhat harsh, neither harsh nor gentle, somewhat gentle, gentle, and very gentle. You’ll need to describe each scale point for any question you design.

Slide 8

This slide summarizes researchthat shows the format of a question as simple as “What is your age?” makes a meaningful difference. The top of the slide indicates that there are three basic ways to ask about a respondent’s age: (1) What is your age? (2) In what year were you born? and (3) In which category does your age fall? The table at the bottom of the slideshows thatanswers to the age question differed meaningfully by question format. As each set of 800 respondents that received a given question format should have similar age profiles—thanks to the law of large numbers and random sampling—the age profiles for each group should differ only by sampling error. Instead, people asked about their age directly tend to answer somewhat younger than people asked the year of their birth and meaningfully younger than people asked the category in which their age fell. The bottom on the slide shows that a meaningfully larger percent of respondents refused to answer the direct age and year of birth questionsthan refused to answer the age category question. Although it seems counterintuitive that categorical responses are more accurate despite rounding error, their higher accuracy and lower non-response rate make the agecategory measure superior to the other age measures.

Slide 9

This figure depicts the remainder of this lecture. When talking about different types of scales, researchers tend to divide scales into comparative and non-comparative forms. Then, within those categories, there are sub-categories of scales. Because you’re probably most familiar with non-comparative scales, I’ll start with them.

Slide 10

Non-comparative (or monadic) rating scales ask about a single concept. Here’s an example: “Now that you’ve had an automobile for about one year, please tell us how satisfied you are with its engine power and pickup.” The range of responses runs fromcompletely satisfied to very dissatisfied.

Slide 11

In contrast, a comparative rating scale asks respondents to rate something by comparing it to a benchmark or series of benchmarks. An example of a comparative rating scale: “Please indicate how the amount of authority in your present job compares with the amount of authority that would be ideal for this job.” This question asks for a comparison between a currentprofessional job versus a benchmark ideal job. The responses are too much, about right, and too little.

Slide 12 (No Audio)

Slide 13

The most popular non-comparative scale is the Likert scale. I’ll first show you several examples of Likert items. In this first example, the Likert item concerns tennis. The statement “It is more fun to play a tough competitive tennis match than to play an easy one” is matched with response alternatives that run from strongly agree to strongly disagree.

Slide 14

Here’s an example of several Likert-type items for assessing consumer beliefs about a department store. The format here has the statements on the left hand side: Duncan’s Department Store has lower prices than competitors; Merchandise displays at Duncan’sDepartment Store are messy; Clerks at Duncan’s Department Store are not very friendly; and The downtown Duncan’s Department Store is a convenient location. Notice that some of these items are positive and some of these items are negative. The responsesrange from strongly agree to strongly disagree and the response categories are summarized by letters instead of numbers; ‘SA’ instead of 1, ‘A’ instead of 2, and ‘AL’ instead of 3. One also could use blanks or boxes that require a checkmark. Many formats will work for Likert-type items. I recommend a number format because it’s easier to enter such data into a computer. Clearly, it’s easier totype a string of numbers than to looking at boxes or blanks or letters, try to convert them into a number mentally, and then type that number into a computer.

Slide 15

Here’s an example of several Likert-type items that use an importance scale rather than an agreement scale. Here, respondents indicate non-important, slightly important, important, very important, or extremely important.

Slide 16

Here’s an example of several Likert-type items in which blanks are used. Again, all of these are Likert scales, but I recommend you use numbers instead of blanks because it’s easier to enter the data from that type of question format.

Slide 17

To this point I’ve been careful to say that the previous slide showed Likert-type items. The Likert scale is actually a set or series of such items. A single item is not a Likert scale, although many researchers misuse the term and refer to single Likert-type items as Likert scales. Nonetheless, Likert scales are a multiple-item scale; hence, the notion of summing responses to the multiple items to achieve a total score. Likert scales are very popular for many reasons, including that they are easy or relatively easy to write and respondents are familiar with such questions. If they ignore your instructions, as most respondents do, they’ll still be able to answer your questions properly.

Slide 18

This is probably the most popular format for Likert scale items. The scaling 5 to 1 could easily be reversed as 1 to 5. Typically, scale numbers are run from strongly agree to strongly disagree, and it’s probably best to make ‘agree’ a larger number than ‘disagree’. The 10 items shown here—such as“the commercial was soothing,”“the commercial was not entertaining,” and“the commercial was insulting”—could be related to people’s impressions of how enjoyable it was to view the commercial or the quality of the commercial. If all 10 items relate to the same basic underlying notion, then we can sum people’s scores on these items to derive an overall score of the commercial’s likeability. The assumption is, of course, that all the questions are phrased in the same direction. Notice the items(1) “the commercial was soothing,”(2) “the commercial was NOT entertaining,”(3) “the commercial was insulting,”(4) “the commercial was silly,”(5) “the commercial was too ‘hard sell’,” and (6) “the characters in the commercial were realistic.” Items #1 and #6 are positive items; people who strongly agreed with those items must have liked the commercial. To strongly agree with items #2, #3, and #4is to dislike the commercial. To take a sum that would be meaningful, either the negative items or positive would need to be reverse scored. Reverse scoring puts all the answers in the same direction. If I were scoring all these items in a positive direction, if someone answered ‘4’ for question #2(the commercial was NOT entertaining), then I would enter it into the computer as a 2. If someone answered 5 to “the commercial was insulting,” then I would score it as a 1. Reverse scoring allows a meaningful sum of the scores across all these items to derive an overall likeability score for the commercial.

Slide 19

This slide illustrates what I mean by the kind of data that Likert-type items yield. You can see that 10 different people were asked to respond to 10 different items. Let’s assume the 10 items in the previous slide and these are their answers. We’ll assume that the numbers in the matrix have already been reversed scored, so that for the negative items a ‘5’ means a ‘1’ and a ‘4’ means a ‘2’. The sum at the end of the last column showsscoresthat run from 25 for person #4 to 40 for person #9. Later in the semester, when I talk about reliability, I’ll try to make sense of that last row, which says item-to-total correlations. Let me preface that subsequent explanation by indicating that part of what researchers do, in deciding whether or not they should ask all those different questions and if summing all those scores makes sense, is to determine if people’s responses to each question are related to their responses to the other questions. Researchers assume that if all the questions address the same underlying construct, then the answers to those questions should be somewhat consistent. If the answers to one question are unrelated to the answers to other questions, then that’s a problem. If the answers to one question are strongly but negatively related to the answers to other questions, then reverse scoring is needed. Looking at the numbers in the last row, notice that the answers for item #4 don’t seem especially related; any number close to + 1 indicates that thing is highly related to other things. It appears the answers to question #4 are not especially related to the answers to the other items and the answers to the other items are highly related to one another.

Slide 20

This slide illustrates the types of statements one might develop to measure people’s attitudes toward product quality and warranty responsibility. Some of these items are fairly wordy. The goal is to keep items as concise as possible, yet fully explain the underlying notions. In “Products today are built to last a long time before needing service or repair” the language is relatively simple and it entails the issue of obsolescence and ultimately warranty responsibility. “Too many products available to customers are unnecessarily complex” is relatively simple language. “Customers rarely get stuck with products that don’t work, since most products today have good guarantees” is simple as well. Although these items are imperfect, they are the type of items that people can normally read, understand, and respond to meaningfully using the sorts of 5 or 7 point scales with which you’re familiar.

Slide 21

I include this figure because parts of it are informative yet I also disagree with parts of it. For the scale categories and the labels we would provide for them, the ones for quality are fine: well above average, above average, average, below average, and well below average. The importance categories, interest categories, satisfaction categories, and even the uniqueness categories make sense. However, I disagree with using Likert-type scales to assess frequency or truth. I strongly discourage using Likert-type scales for frequency because the frequency descriptors mean vastly different things to different people. For example, I might say that I sometimes drink coffee. What that means is that I might brew myself a couple of pots a week and drink 6 to 8 cups each time. Somebody else might give the same answer, but mean they drink one cup a month. Somebody else might give that same answer, but mean they drink two cups a day. What I mean by ‘sometimes’ and what someone else means by ‘sometimes’, as it relates to coffee consumption, may differ markedly, and the same answer shouldn’t reflect vastly different behaviors. As for Likert-type scales and truth, I’m a logician at heart, so something is either true or false. For a larger report, certain aspects may be true and other aspects may be false, but to say that something is somewhat true, from a logical standpoint,makes little sense. Thus, I also discourage using Likert-type items to assess the degree to which people believe something is true.

Slide 22

Thurstone scales, which are to some extent related to Likert-type scales, often are ignored by undergraduate marketing research textbooks. Such scales are valuable and provide excellent data, but they are far harder to construct than Likert-type scales. I’ll show you why in the next several slides.

Slide 23

When constructing a Thurstone scale, researcherscreate a series of items to which people can respond either yes or no. Those items are designed and sequenced such thatrespondents are increasingly likely to respond ‘yes’ as they progress from item to item. The ideal Thurstone scale would yield a series of one answers (no) followed by a series of the opposite answer (yes). The point responses change from ‘no’ to ‘yes’ is the point ofmeasurement interest.

Slide 24

To some extent, you can think of a Thurstone scale as similar in concept to a standardized exam like the ACT or SAT. There is a range of questions, in terms of difficulty, on those types of exams; examiners assume that they can identify the abilities of students taking the exam by examining the pattern of responses to those questions. They assume everyone can answer the easiest questions correctly and very few people can answer the hardest questions correctly, but there won’t be a random sort of response in which easy questions are missed while difficult ones are answered correctly. Think about identifying a series of questions that relate to some topic, with each question progressively more positive or more negative. Researchers would expect people’s responses would shift at some point as the statements become more positive. Here’s an example of trying to form a Thurstone scale by asking a series of judges to rate the items by the likelihood to which someone might agree or disagree with them.

Slide 25

We can take those experts’ responses and sum them so that we can derive the scale values shown in the second column. Seemingly, the item that people would be least likely to agree with is item #3, with a scale value of 9.9. Item #2, with a scale value of 2.0, would be the itemwith which they’d be most likely to agree.

Slide 26

This is an example of a response curve a series of Thurstone items might yield. It becomes progressively more likely that someone will switch from ‘no’ to ‘yes’ or ‘disagree’ to ‘agree’ as the items progress from #1 to #11. No one would agree with item #1, but everyone would agree with item #11. Seemingly, at items #6 and #7, roughly 50% of people begin to agree.

Slide 27

Another type of non-comparative scale that’s popular is the Semantic Differential, (SD) scale. With SD scales, there’s a series of bi-polar rating items. The bi-polar adjectives that anchor the endpoints of the scale could be items such as good and bad. In entering the response data into the computer, researchers assign a number to each scale point. Although many people refer to SD scales as scales with bi-polar ratings, the true SD scale assumes three underlying attitudinal dimensions that everyone, regardless of culture or language, uses to evaluate things in their social environment. These three dimensions are evaluation, power, and activity. For a properly constructed SD scale, all the items will relate to one of these three dimensions. However, over time people have adopted SD-type scales so that items may not be related to one of these three underlying dimensions.