=

Forum 3: Reliability and Validity
Rudner (2001), Brualdi (1999). And Popham (in his text) discuss notions of reliability and validity. In your own words, differentiate between the two concepts. In differentiating the two, be sure to discuss what it is that each concept concerns. Then, discuss how reliability and validity are important (or not important) to classroom assessment. Finally, as a school leader, what would you tell your teachers about reliability and validity? Why?

Reply by Kathy Courtemanche on February 4, 2010 at 6:51pm

To me, reliability refers to the consistency of the results whereas, validity is the degree to which one can be confident that the inferences made from the results are accurate and, therefore, meaningful. I see validity as more important in CLASSROOM assessment than reliability. (Both are important in standardized tests.) When teachers give assessments, they should ensure that the assessment items are relevant to the domain being tested and that the items adequately sample the entire domain (Popham, 2006). It is unfair and inappropriate to have a test item that “comes out of the blue.”

I couldn’t agree with you more. You have identified the main points of the article and hit the prompt squarely. Well done!

As a school leader, I would suggest that my teachers start with their curricular aims, develop their assessments from the curricular aims ensuring the test items are valid, and finally, design their instruction to facilitate the desired learning outcomes. I would also caution my teachers to be careful in their interpretations of assessment data and the weighty decisions made from those interpretations. For example, several scores or academic data should be used to determine student placement in advanced or remedial classes and not just base the decision on one score. Teachers also need to be careful when comparing students by their EOG percentile ranks. Those percentile ranks are only for that specific test and are not equated like the developmental scale score is. (I guess we’ll see more about scale scores later in this course.)

I can see reliability of an assessment instrument being called into question if a specific form of the EOG seems to yield more below level scores than other forms signaling a possible flaw in the alternate forms reliability. A couple of years ago, 7th grade Math teachers in our county were meeting to develop a curriculum pacing guide and as an aside, discovered that the majority of our students who scored below grade level on the recent EOG had all taken the same form of the test. Of course, we shared our concern with DPI and never heard back from them.

This should not happen. All forms are equated back to a single scale.

Last year DPI decided it would no longer consider the Standard Error of Measurement (SE) in determining whether a student had passed the EOG on grade level or not. Previously, if a student was within 2 SE of the cut score, they were given the “benefit of the doubt” and marked as passing the EOG on grade level. I think DPI should consider the SE since it allows for the slight performance variability of an individual test-taker. This past year they made all level 2 students retest even if they were within the 2 standard errors. On the retest they then allowed students within 1 standard error to be labeled as “passing.” Given the gravity of some decisions based on the EOG score and the cost of retesting, I think they should err on the side of leniency.

What about using the SEM the other way around? Seeing if kids who pass are still within the SEM. What should we do with those kid?

Reply by Charity Nightingale Horton on February 5, 2010 at 9:44am

Kathy,

I agree with the inappropriateness of creating tested items "out of the blue"! It is really unfair to the students and truly doesn't yield any relevant info for the teacher, not to mention it almost sets students up for failure.

Reply by Sara Gilliam Crater on February 5, 2010 at 10:15am

Teaching in a high school, I do not hear about these issues concerning standard error of measurement. I agree with you about giving the student two SE because the cost of the test and because they are so close. Two points...is that really such a big deal?

Reply by Tammy Essic 1 day ago

I agree with the idea of developing curricular aims, assessment, and then instruction. It's not the pattern typically followed, but doesn't it make more sense to have those clear targets in mind and how you're going to determine if you met the targets when you're planning the instruction? I also agree about using caution when making interpretations of assessment data. Important decisions about a child's future often rest on one score which can be so unfair to that student.

Reply by Brenda Little 1 day ago

Kathy: You discuss having several sources of academic data for students when determining placement into special programs. This is exactly what is required for children being placed in special education. I have to have 2 observations, individual test scores from various domains, teaching input, parent input, and interventions. This process sounds like it might work for all students. I think that is why more schools are moving to RtI.

Reply by Cynthia Barber 18 hours ago

I know last year when we had to conduct retest on all the students with a 2SE it dramatically icreased the number of students at my school. The decisions made based on EOG seem unfair--given the fact that it is how the student performed on a given day. Also when working with an at-risk populations many other factors go into the testing situation. You never know what some students have dealt with at home before arriving at school to take the test.

Reply by Rebecca Free 12 hours ago

I wonder though how much of the improvement in the score was due to replicating the test experience rather than the the idea that the students just arbitrarily performed poorly that day. It seems most students would improve when asked to perform the same skill repeatedly.

------

Reply by Cynthia Barber on February 5, 2010 at 9:02am

Unless you are someone that often works with tests and measurement, the concepts of validity and reliability are not given much thought. As with most people, I did some brief work with test validity and reliability in my graduate studies, but they were not concepts which I gave much consideration in the classroom as a fifth grade teacher. In my current position as curriculum coordinator, I have become somewhat more familiar with validity and reliability. When reading the text I often thought about some of the current tests administered to our elementary students (Iowa Test of Basic Skills, Cognitive Ability Tests, EOG-to name a few).

When thinking about test validity it is important to understand that validity relates to the inferences drawn about student’s response on specific tests. Our main concern is making accurate inferences about the students. There are several ways to view tests in order to verify your inferences. The first way is to consider the content of the tests. Does this test measure the content it is suppose to measure? In talking to teachers, I would encourage them to think about their own teacher made or ClassScape assessments of class material. As a grade level, I might have them craft an assessment based on the learning targets they are currently teaching (in one subject area). I would then ask them to review the learning targets to decide whether questions on the test truly measure the targeted skill. It might also be helpful for teachers to review previously used tests and measure it against the learning targets to identify whether it was a valid assessment.

I would also begin to have conversations with teachers about criterion-related validity in relation to End of Grade Tests for 3-5 teachers and ITBS & CogAT for K-2 teachers. With 3-5 teachers I would discuss the End of Grade tests. It is important to stress to teachers that the EOG is only one measure of student’s achievement. As 3-5 teachers, we compare student performance on EOG tests with class grades and perceived class performance. At my school, one question we discuss is whether our class grades are indicative of the student’s performance on EOG. We talk about the fact that, barring extenuating circumstances, a student that is making good progress in reading and math should be scoring a Level 3 or 4 on EOG tests. If this is not the case, we need to look at our own classroom assessment. When we administer the ITBS and CogAt to second grade students, we are making inferences about a student’s aptitude and achievement. We use these tests to predict whether the student would be successful in an academically gifted program. However as K-2 teachers we might compare the student’s results on these two tests with their grades on current and previous report cards and teachers perception of their class performance.

This is good stuff! Your comments are very insightful. Perhaps many teachers should look at the reliability and validity of their own classroom assessments.

Construct-related validity allows the test writer to have confidence in the test created. I would ask teachers to think about whether their tests measure the learning target they want it to measure. I would not come into a meeting and announce that we were going to discuss “construct-related validity”, but I would encourage teachers to think about an intervention study to determine that the students in their classes respond differently after an intervention, such as after-school tutoring or attending ESL class. An optimal time to do this would be at the end of the quarter when we are having data conferences with teachers. I could also relate this to the School Assistance Team process. In the SAT process the team determines interventions for students having learning difficulties. After several weeks of using the interventions, the team looks at before and after tests measurements to determine if further testing is need or additional interventions

Having teachers think about how they would show that their test are functioning the way they should is an excellent idea. I wonder what you would learn from such an exercise? (Might make a good action research project.)

Discussions with teachers about assessments reliability is a bit trickier to explain. It would be a waste of time to discuss reliability coefficients etc with teachers. Most classroom teachers want to know how the information presented is related to their classroom practice. I would explain that a test must be reliable to be make valid inferences therefore; teachers must make every effort to create assessments that are indicative of the learning taken place in the classroom, create optimal conditions for testing in the classroom, and consistently score teacher-created assessments. In my school we always discuss consistent score practices in relation to constructed-response assessments such as writing. We spend a great deal of time discussing the rubrics used to score these assessments. In the future, I will pointedly discuss the reason we do this is so our assessments will be reliable measures of student achievement and so that we can make valid inferences about student needs.

Reply by Charity Nightingale Horton on February 5, 2010 at 9:49am

Cynthia,

Reliability definately is a harder concept to pull out! I like how you would relate validity to reliability....a natural integration that seems easier to understand, especially for teachers that are not confident with one or both concepts.

Reply by Brenda Little 1 day ago

Cynthia: I think teachers need to understand reliability and validity as it applies to the classroom. You made some good points: school leaders need to spend their time helping teachers develop test and grading rubrics that are valid and reliable. I am not sure that many teachers really know how to use a grading rubric to its fullest potential. I think more training is needed in this area.

Reply by Rebecca Mankins 20 hours ago

I agree that not many teachers are aware of using a rubric. My friends who were education majors talked all of the time about how they were graded on rubrics. That was not the case for me as a psychology and sociology major. The only time I have been graded on a rubric is when I took an education course. So…what do you think about being graded according to rubrics?

------

Reply by Sara Gilliam Crater on February 5, 2010 at 12:59pm

I liked the reference Popham gave when he said, “A test is merely a tool that we use in order to make inferences.” Validity and reliability are two evaluative factors that we as educational leaders can use in our perspective schools.

Does the test measure what it is supposed to measure? That would be my definition of validity. (But this is NOT the current way we think about validity. Now, we recognize that validity is concerned with the inferences we draw from test results.) The book mentioned three varieties of validity evidences: content-related (seems to be based on appropriate content), criterion-related (predictions of how they will achieve later on another test), and construct-related (an accumulation of evidences).

Reliability means the stability or consistency of an assessment. Three approaches to reliability discussed in the book were stability (consistency of a test’s measurement over time), alternate-form (two forms of the same test), and internal consistency (a test’s internal elements). Rudner & Schafer also discussed rubrics and scoring papers. I have a hard time grading papers when I see their name at the top. Even though I follow the rubric, if I know this person try's really hard, I might give them a slightly higher grade. I know this is horrible (Not horrible, just problematical.) and not reliable at all so I started having the students put their school code on the top so I coudn't see their names. This helped with the realiablity of the test. (Good idea!)

What does this mean to me as an educational leader? If a test is unreliable, it will not yield any information concerning student achievement. Validity and reliability are both important to classroom assessment because without them, the true results of student achievement will be flawed.

I would inform my teachers to look at every test question they give and only keep them if the information was part of the curriculum, gone over previously, and displayed in a concise manner. Brualdi described the traditional and modern concept of validity and explained that each test question needs to be meaningful, useful, and appropriate. Another thing I would tell my teachers would be Rudner’s advice in his article about how the lack of sleep and negative attitudes affect students’ scores. Teachers need to be engaging students in a dialogue concerning their personal habits and explaining the importance of a positive attitude going into the test.

Reply by Brenda Little 1 day ago

Sara: Great tip about adding the school code instead of the student’s name. I can understand where you are coming from. It's hard not to let other factors bleed into you attempt to be reliable in grading tests and assignments!

------

Reply by Kristi Gaddis on February 5, 2010 at 3:01pm

Validity and reliability are ways in which we measure the bias and distortions of tests. Validity is reached when a test measures what it is suppose to measure. Popham notes, “Validity refers to the accuracy of the inferences” (2006). Brualdi more specifically defines it as, “the degree with which the inferences based on test scores are meaningful, useful, and appropriate” (1999). Inferences will in essence be accurate when a test correctly measures what it is intended to measure. Traditional validity can be broken into three separate categories content, criterion, and construct. Brualdi makes a point to mention Messick’s modern approach of validity encompasses six interdependent aspects of validity. Popham also notes Messcik’s (1996) new approach to validity, yet he states that “consequential validity” should be kept separate from a test’s validity (1999). I agree with Popham’s approach to separate the two concepts to maintain clarity.

You are not alone on this. Many assessment specialists disagree with Messick over consequential validity.

When a test is valid it is almost always reliable; reliable tests on the other hand are not always valid. Reliability refers to the consistency of a test. My metaphor for reliability: If you weigh an apple in the morning and weigh it again in the afternoon, receiving the same result, the scale you are using is reliable. A test will never reach a perfect (1.0) reliability measure; therefore the standard error of measure (SEM) is used to combat any inconsistency test might possess. Popham, states that the SEM is more useful information for educational leaders and test users than the reliability coefficient (1999). If a standard error of measure is over 8 points, I question the validity of the test. I think you mean the reliability of the test.

Sandra Lindsay said, “The key for school administrators is to remember both the power and limitations of assessments” (Popham, 1999). She went on to say that the best evaluation is derived from talking with the student’s teacher. AfL is only as good as the validity of the assessments. Content validity is imperative for assessment for learning. If a teacher is assessing to see if their students have mastered slop e-intercept then the questions on the test need to be assessing for all the aspects of slope-intercept. Rudner states, “Classroom assessments seldom need to have exceptionally high reliability coefficients (2000). This statement makes sense, if a teacher is re-testing after teaching the material in a different way they should expect to have different results. Reliability is not necessary in classroom assessments. I think Popham agrees with you, here.