Constructing Valid and Reliable Tests

ASSESSMENT MODEL

CONSTRUCTING VALID AND RELIABLE TESTS

Source: Jim Haf, COOR ISD

Pre-Planning Tests and Analyzing Results

Using Tests to Guide Instruction and Improve Student Achievement

“X” Indicates “Mostly Contributes to” Validity or Reliability

PROCESS
PRE-PLANNING / Validity / Reliability / Discussion
Deciding on Content
Examples: expectations from 1st quarter on pacing guides; 10 essential understandings from a building’s enacted curriculum / X / In Michigan, it is essential for schools to use the Grade Level (GLCE) and High School (HSCE) Content Expectations as the basis of their curricular tests. In a perfect situation, the expected curriculum is the same as the enacted curriculum – enacted meaning that which is actually implemented with students. Many schools use “Pacing Guides” to organize the delivery of content expectations within their curriculum.
Checking Alignment:
Between expected content, enacted curriculum, and actual content covered for the intended test / X / There should be an alignment between the stated content in a local curriculum, the content covered in classrooms, the content covered on a test, and items measuring each content area. Items should be agreed upon as clear measures of the content. This means that a group of teacher experts should pass judgment on the items and the test and on the match between the content and the test.
Deciding Learning Levels to be Tested
Examples: Knowledge, Comprehension, Deep Understanding / X / The GLCE’s and HLCE’s are written at several cognitive levels including knowledge, comprehension and deeper understanding. Items on a test need to be aligned with the content expectation at the level it is written. Thus the test needs to be both aligned and balanced not only in content but in cognitive levels.
Creating a Test Blueprint that includes:
Content and Alignment and
Required number of items for the expectations and learning levels tested / X / This is very important to both face validity and actual test validity. Face validity means the process is systematic and precise and will most likely result in a test measuring what it was intended to measure. In this way it can serve as evidence of apparent validity to others. Actual test validity means the process maps the content, items, and levels in a way thatis acceptable to experts in measurement. A test blueprint demonstrates the structure of the test and the match of items to content and cognitive levels.
PROCESS
ADMINISTRATION AND SCORING / Validity / Reliability / Discussion
Administering Test
Including
Clear directions, standardized conditions, appropriate time for the purpose of the test; guidelines about amount of assistance given during the test / X / Performance on the test is affected by both internal and external factors. Internal factors include the mental and physical state of the test-taker, preparedness, motivation, etc. External factors include the room, time of day or week, directions, amount of time, and noise, etc. Standardization helps control these internal and external factors so that the test results are more reliable (consistent0 estimates of actual ability.
Test Scoring Including:
Scanning into scoring and analysis software, error correction and data checking,creating manageable data files / X / X / Using mechanical scoring (scanning) into test analysis software is very helpful in further analysis of a test. This is essential for test improvement and using the test to identify priorities for further instruction. This includes creating data files of test results that are correct and in a format to be analyzed. See post test analysis that follows next.
PROCESS
POST TEST ANALYSIS / Validity / Reliability / Discussion
Test Analysis - Including:

Subgroup and total group information
Descriptive data such as mean and score distribution
Test analysis for test error and reliability
Item statistics for item difficulty and item discrimination
Item analysis for understanding item results (for both correct and incorrect responses
Reporting results to all stakeholders

/ X / X / A series of steps can be taken to understand the results of a given test and to analyze it for test and item quality. These include 1) Subgroup analysis, 2) Descriptive information (mean and variance), 3) Statistical Test analysis (reliability and error analysis), 4) Item statistics (Difficulty and discrimination), and 5) Item Anal;ysis of correct and incorrect responses. Along with pre-planning and use of a test blue print, this post analysis can insure better, more reliable and more valid tests.
Reporting the results of testing give information to test takers about performance and progress, to test writers for continued development of the tests, and to stakeholders that need the information for accountability, program improvement, and overall efforts at improving student achievement.
* This process assumes that the test is objective (multiple choice; true/false etc.) and can be scored with numbers. However, test analysis software also allows inclusion of number scores for rubrics and scales that can be entered into the data file.
Test Analysis: Detail / Validity / Reliability / Discussion
Sub Group Analysis
Includes:
Special codes / X / Answer sheets need to be coded to allow for subgroup analysis. This allows for the disaggregating data and reporting scores by groups such as gender, at risk, special program, etc
Descriptive Analysis
Including:
Mean
Range
Variance
Overall Difficulty / X / Gives the average, range of scores, variance, standard deviation, etc. for total test and sub tests to aid in the interpretation of the test and results. This also gives some idea about the overall difficulty of the test, the spread of scores, and performance by various groups on subparts of the test. For example, students may do well on vocabulary and not as well on comprehension and deep understanding and at-risk students may do so to a further extent than the rest of the group.
Test Quality Analysis
Including:
Data to make judgments about the way the test is functioning with the particular group and the reliability of the test / X / Statistics to look at are:
Mean and Standard Deviation
Reliability: Cronbach Alpha; Kuder-Richardson
Overall Difficulty and Discrimination
Standard Error and Error Bands
All tests have some error. A test constructor is usually looking for: 1) A reasonable mean score for the group, and a standard deviation that shows some range in scores; 2) Acceptable reliability (.75 to .95) for total and subgroups – subgroups may be lower depending on number of items; 3) A moderate Difficulty and Discrimination Index (35/40% to 65/70%); and a fairly small standard error and error band (error band is + and – 1 standard error).
This information is usually considered along with item statistics to make judgments about the quality of the test. (See item statistics that follows next).
Item Statistics
Include:
Item Difficulty
Item Discrimination
Item correlations with the Total Score / X / These include the Item Discrimination Index for each item and the Point Biserial correlation. Both give a measure of an item’s ability to discriminate those getting the right answer from those getting the wrong answer. In short, there should be a fairly high correlation between the item and the number of students scoring high on the test. Low correlationsmight indicate a problem with the item.
Item Analysis
Including:
Answers to all options / X / X / Gives the responses for each option chosen.The major possibilities are:1) Most students get an item right;
2) Most students get an item wrong; 3) There is a mix of responses – can have a pattern or be scattered. In the case of most getting it wrong, look for the chosen option and analyze why it was chosen. A mixture of wrong choices probably indicates guessing.
PROCESS
AFTER ANALYSIS / Validity / Reliability / Discussion
Editing Test Items / X / X / If an item is too difficult or too easy, sometimes the wording of the item is not clear and a simple rewriting makes the item more useful. Sometimes the options all seem correct and the writer needs to make the choices clearly different. Sometimes the item should be discarded and replaced. If answers seem random, distractors can be changed to make the answers less ambiguous.
In addition, item analysis gives the test constructor valuable information about what students are getting wrong. It may identify common misconceptions or seemingly correct options. Item analysis also gives information about how to edit the item to make it more useful. For example, a distractor that is too attractive can be changed to be more obviously wrong or one that is not attractive enough can be closer to correct.
All of this implies that tests are given more than once and there is a desire to “fix” the items to make them more useful.
Reporting Results
To the test-takers and to school groups (teachers, administrators, school improvement teams, parents, other stakeholders) / X / X / If a test is public and reported to others, it might motivate the writer to pre-plan and post analyze the test and its results. If test-takers can scrutinize the test, it makes it more important for both writer and taker.
As a rule of thumb, the more important the decisions made with test results, the more valid and reliable it should be. High stakes tests, for example, should be carefully constructed, piloted, systematically analyzed, and subject to objective review by experts and stakeholders.
Reporting results should focus on the purpose of the test, provide clear, unambiguous information, be reliable and valid, be aligned to the important standards of the building, and be acceptable as evidence of achievement. Lastly, and perhaps most important, any test given should be useful for instruction. Even if the design and purpose are mostly for determining status, an effort should be made to use the results to guide instruction.
Using Results for Instruction / X / Again, even if the design and purpose of a test is mostly for determining status, an effort should be made to use the results to guide instruction. Results should be reported to show areas of strength and weakness and information provide for relevant subgroups. Sub-test and item analysis results should be used to show items and areas that are priorities for instruction. A good rule of thumb is a model that always provides a corrective for students who have not learned the standards and expectations tested on the test.