About the Prompt: Reliability Vs. Validity

About the Prompt: Reliability vs. Validity

(Comments on your responses to Forum 1 follow my response)

My response to the prompt
Reliability refers to “consistency.” Do the measures yield consistent results? There are three types of reliability that are considered legitimate: test-retest reliability, alternate-form reliability, and internal consistency reliability.
Test-retest reliability (also called “stability” reliability) refers to the correlation between scores obtained from one administration and those obtained from a later administration. This would be useful, for instance, when the trait, attribute, or characteristic measured is assumed to change minimally over time (e.g., career interest, IQ). It would not be useful if it is reasonable to assume that what is being measured will change over time. This is not precisely correct. What is assumed to change, or not change, over time is the ordinality of the scores over time. If the rank ordering of the scores is assumed to be relatively invariant over time, then test-retest reliability is appropriate.
Alternate-form reliability (also called “equivalent” form reliability) refers to the correlation between scores obtained from two forms of an instrument. Again, what is important here is that the ordinality of the scores from the two forms is preserved. Hence, two forms of an achievement test, say, could differ in level of difficulty, but if the rank ordering of the scores from the two forms is relatively the same the two forms are considered equivalent and will generally yield high test-retest reliability.
Internal-consistency reliability is concerned with the extent to which the items (or components) in an assessment share a common source of variance. For instance, if we have a instrument to measure motivation, say, then to have high internal-consistency reliability, all the items in the assessment need to relate (and measure) motivation. An instrument that, instead, contained items related to achievement, efficacy, and so on, would have low internal-consistency reliability. A formal definition of internal-consistency reliability is the average inter-item correlation.
There are several ways to assess internal-consistency reliability. One way is to examine the correlation between split-halves of an instrument. The split-halves could be odd items vs. even items, or the first halve of the items vs. the second halve. Another way of computing internal-consistency reliability is to compute the KR20 (or KR21) coefficient, or Coefficient alpha. Both equations give the average of all possible split-halve correlations. A third way of examining the internal-consistency of an instrument is to perform a principal-components factor analysis. If the first principal component accounts for the lion’s share of the variance in the items than it can be assumed that the items are, essentially, all measuring the same thing (component).
Whereas reliability refers to a characteristic of an instrument (i.e., an instrument can be evaluated in terms of its reliability), validity does not directly refer to a characteristic of an instrument. Rather, validity, ever since Messick’s seminal chapter in 3ed edition of Educational Measurement (Messick, 1989), refers to the inferences that are drawn from scores generated by an assessment. Here the important question is, “is the inference we make about an individual, based on the score he or she obtains, valid? When we talk about the validity of an instrument we are actually talking about the type of evidence we have that the instrument yields scores that warrant the inferences we make. There are three general classes of evidence: content-related evidence, concurrent-related evidence, and construct-related evidence.
Content-related evidence relates to whether or not the content of the instrument articulates well with the entity assessed. One could argue, for instance, that an achievement test provides evidence of content-related validity if the items in the test are all related to the targeted area of achievement, say, algebra. When describing the Organizational Commitment Questionnaire, Hart and Willower provided examples of some of the items, e.g., “I am willing to put in a great deal of effort beyond what is normally expected in order to help the school be successful.” Seeing that he instrument contained items like this helps argue for validity.
Criterion-related evidence is established when a given instrument correlates well with an independent criterion. There are two general types of criterion-related evidence of validity, predictive evidence and concurrent evidence[1].
Predictive evidence of validity is when scores obtained from an instrument at one point in time adequately predict performances at a later point in time. There is evidence that the SAT, for instance, yields (somewhat) valid predictions of future performance in college. Concurrent evidence for validity, on the other hand, is established when scores on one instrument correlate well with scores on another instrument whose validity has already been established. In Hart and Willower could have provided (but did not provide) this kind of evidence for the validity of the Short-form OCQ by giving us the correlation between scores on the short form and the full-form OCQ. Apparently, they did not have this information.
Obtaining construct-related evidence for validity is considerably more involved. In a general sense it involves gathering evidence to support an assertion that inferences based on the test are valid. Typically this entails testing hypotheses involving the construct (e.g., learning, anxiety, attitude, etc.) being assessed and performance on the assessment. Several approaches are applicable here:
Intervention studies: Test the hypothesis that students will perform higher on the test following an instructional intervention.
Differential-population studies: Test the hypothesis that different populations will perform differently on the assessment.
Related measures studies: Test the hypothesis that students will perform similarly on similar tasks.
Hart and Willower provided evidence of construct validity by (1) citing the Porter, et al. (1974) study that showed that the OCQ discriminated between job-stayers and job-leavers, and (2) by showing that the adjective pairs in RSD were discriminatory. Their factor analysis
It should be obvious from the foregoing that validity is primarily supported by argument—evidence supported argument.
Reference
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement
(3rd ed., pp. 13-103). New "York: Macmillan.

General Comments on Your Reponses to Forum 1

(Note, your responses were due by noon, today, 27January2010. In order to have time to post my reactions, I had to stop looking for postings at that time.)

I know that I have cautioned you to first write your response in Word, and then edit it before pasting it into the ning. It is readily apparent to me that many of you did not do this (or, if you did, you did not pay attention to what Word tells you); otherwise, the writing that I’ve seen would not be so sloppy and immature! For the most part, it was pretty bad. There are run-on sentences, rambling sentences, convoluted sentences, and incomplete sentences throughout your writings. I would be willing to bet that several of you really do not understand what you have written (I’ve included some examples later.) If I were grading your responses, which I am not, but may do the next time, I would give most of you a “C-” or lower.

Just about all of you did not read the prompt carefully. The prompt asked you to, “ Describe how [Hart and Willower] argue for the reliability and validity of [their] instruments for their purposes.” You went outside, often way outside, the boundaries of the prompt by addressing the internal and external validity of the study itself. The validity of the instrumentation and validity of the study, although related, are different concerns. We can talk about issues concerning the internal and external validity when we meet next week.

Finally, I have encouraged you to consult additional resources for use in this class. Finding material on reliability and validity is easy. The Internet is rife with it. If you are a serious doctoral student then you should be going beyond the material assigned in class. You need to get with it!

Some more specific comments.

Some of you talked about proving the validity of the instruments.

“Even the short version of OCQ is proved valid because coefficient alpha have ranged from .82 to .93 for both versions of the instrument and shows consistency.”

“They cite later studies confirming that the shorter version of the OCQ used in their own research also proved to be valid.”

“The validity of the instrument was proven for discriminating between job leavers and job stayers…”

Aside from the poor grammar in the foregoing statements, the validity of an instrument is NEVER proven. The best we can ever do is provide evidence of validity.

------

Differentiating between reliability and validity was difficult for several of you.

“Hart and Willower argue for the reliability of the Robustness Semantic Differential Scale by stating that, ‘the RSD was constructed using t tests to determine if adjective pairs discriminated between the concepts dramatic and not dramatic, and a principal components factor analysis was carried out.’” This addresses validity.

“Their internal validity for both measures was strengthened by relatively large randomly chosen samples (which reduced chances of random error), and the method of sampling.” This does not address the validity of the measures.

“The high coefficient alpha demonstrates these two tests to be almost equally valid in making this identification.” Coefficient Alpha does not (much) help establish validity. It is an index of reliability.

“The validity for the OCQ tool was measured against the longer original version of the test and was determined to have a coefficient alpha ranging from .82 to .93.”

“Each adjective pairing was constructed using t-tests to determine test-retest reliability.”

“The authors provided some evidence to support their choice of these instruments in terms of reliability and validity.”

You need to be able to differentiate between reliability and validity! Most of you were unable to do this (at least in your response to Forum 1).They are not the same thing. Here is a question to ponder: What is the relationship between reliability and validity? Can you have one without the other? Does one imply the other?

More general comments on your writing.

DO NOT use the term “utilize.” For an explanation, read your APA manual.

“Therefore, the data collected from teacher’s was considered to be organizational in form; whereas, the data collected from principals, while treated organizational, was actually individual in form.” In technical writing, data is always a plural noun. Also, I have no idea what is being said here.

The OCQ (Organizational Commitment Quesionnaire [sic]) is an example of discriminant (sic) validity, since…” Instruments, themselves, cannot be examples of validity. This is loose writing!

“On a side note, I was struck by the fact that…” Gosh! I hope you were not hurt too badly.

Statements where it is not clear whether the author understood what he or she was writing.

“…the authors established reliability by carrying out test-retest reliability tests of each item and overall scores (p. 176). With a coefficient of .77 and .78, the likelihood is fairly high that these tests provide consistent measurements.”

“Another key point of the study which I think strengthens the research is the steps that were taken to eliminate same respondent bias by only subjecting one half of the teachers at a given school to one of the tools rather than to both tools. This step strengthens the correlations found in the study by avoiding the "high but spurious correlations" as cited in the article”

“The RSD is a bit tautological in the study as the construct of school environmental robustness is operationally defined by responses to the RSD. It’s hard to know for sure what we’re measuring there.”

“Similarly, reliability and validity information was reported….” Two things was reported!??

“Because the sample was self-selective, and as a result, educators in urban areas declined to participate, it cannot be said that the results of this study are fully nomothetic….” I’m not sure what the writer means here. If a reader does not understand something it is NOT the reader’s fault.

“1,431 teachers and principals provided usable….” When you begin a sentence with a number you need to spell the number out.

[1] I know, Trochim listed two more, convergent and divergent, but most authors include these under construct validity.