Psychological Testing

Fields of Study: Algebra, Data Analysis and Probability, Representations

Testing is used for many different purposes within psychology—among them to evaluate intelligence, diagnose psychiatric illness, and identify aptitudes and interests. Although the results of testing are rarely used as the sole criterion to make a diagnosis or other decision about an individual they are often used in conjunction with information gained from other sources such as interviews and observations of behavior. There are many types of psychological tests but most share the goal of expressing an essentially unobservable quality such as intelligence or anxiety in terms of numbers. The numbers themselves are not meant to be taken literally: no one seriously believes that a person’s intelligence is equivalent to their IQ score, for instance. Instead the numbers are useful tools which to help evaluate a person’s situation: for instance, how does the intellectual development of one particular child relate to that of other children of his age? Of course, the results of psychological testing should be evaluated with the social context of the individual in mind and with full respect for human diversity.

Psychometrics

Psychometrics is a field of study which applies mathematical and statistical principles to devise new psychological tests and evaluate the properties of current tests. The two most common approaches to psychometrics today are classical test theory and item response theory (IRT).

Classical test theory is the older of the two approaches and the calculations required can be performed with a pencil and paper, although today computer software is often used. Classical test theory assumes that all measurements are imperfect and thus contain error: the goal is to evaluate the amount of error in a measurement and develop ways to minimize it. Any observed measurement (for instance a child’s score on an intelligence test) is made up of two components: true score and error. This may be written as an equation:

X = T + E

where X is the observed score, T is the true score (the score representing the child’s true intelligence) and E is the error component (resulting from imperfect testing). Classical test theory assumes that that error is random and thus will sometimes be positive (resulting in a higher observed score than true score) and sometimes negative (resulting in a lower observed score than true score) so that over an infinite number of testing occasions the mean of the observed scores will equal the true score. Although normally a test is administered only once to a given individual this is a useful model which facilitates evaluation of the reliability and validity of different tests.

Item-response theory (IRT) is a different approach to psychological testing which assumes that observed performance on any given test item can be explained by a latent (unobservable) trait or ability so that individuals may be evaluated in terms of the amount of that trait they contain and items may be evaluated in terms of the amount of the trait required to answer them positively. For an item on an intelligence test (intelligence being the latent trait), persons with higher intelligence should be more likely to answer the question correctly. The same principle applies to IRT-based tests evaluating other psychological characteristics: for instance if an item in a psychological screening test is meant to diagnose depression a person with more depressive symptoms should be more likely to answer it positively. IRT is a mathematically complex method of analysis which depends on the use of specialized computer software but has become a popular means to evaluate psychological tests as computers have become more affordable. Although the mathematical models of IRT differ from that of classical test theory, the goals are the same: to devise tests which measure characteristics of individuals with a minimum of error.

Reliability and Validity

Reliability refers to the consistency of a test score: if a test is reliable it will yield consistent results over time and without regard to the irrelevant conditions such as the person administering the test. Internal consistency is considered an aspect of reliability: it means that all the items in a test measure the same thing. Temporal reliability is also called test-retest reliability because it is typically evaluated by having groups of individuals take the same test on several occasions and seeing how their scores compare: some differences are expected due to random nature of the error component but there should be a strong relationship between the observed scores of individuals on multiple occasions.

Inter-rater reliability refers to the consistency of a test or scale regardless of who administers it. For instance, psychiatric conditions are often evaluated by having an observer rate an individual’s behavior using a scale, and the results for different observers evaluating the same individual at the same time should be similar. For instance, three psychologists using a scale to evaluate the same child for hyperactivity should reach similar conclusions. Both types of reliability are typically evaluated by correlating test results on different occasions (temporal) or the scores returned by different raters (inter-rater).

Internal consistency can be measured in several ways. The split-half method involves having a group of individuals take a test then splitting the items into two groups (for instance, odd numbered items in one group and even in the other) and calculating the correlation between the total scores of the two groups. Cronbach’s alpha (coefficient alpha) is a refinement of the split-half method: it is the mean of all possible split-half coefficients.

Validity refers to whether a test measures what it claims to be measuring. Three types of validity are typically discussed: content, predictive and construct. Content validity refers to whether the test includes a reasonable sample of the subject or quality (for instance, mathematical aptitude or quality of life) it is intended to measure and is usually established by having a panel of experts evaluate the test in relation to its purpose. Predictive validity means that test scores correlate highly with measures of similar outcomes in the future: for instance a test of mechanical aptitude should correlate with a new hire’s success working as an auto repairman. Construct validity refers to a pattern of correlations predicted by the theory behind the quantity being measured: the scores on a test should correlate highly with scores on other tests which measure similar qualities and less highly with those which measure different qualities.

SEE ALSO: Diagnostic Testing; Educational Testing; Intelligence Quotients.

FURTHER READINGS:

Embretson, Susan E. and Steven P. Reise. Item Response Theory for Psychologists. Mahwah, NJ: Erlbaum, 2000.

Furr, R. Michael and Verne R. Bacharach. Psychometrics: An Introduction. Thousand Oaks, CA: Sage Publications, 2007.

Gopaul McNicol, Sharon-ann and Eleanor Armour-Thomas. Assessment and Culture: Psychological Tests with Minority Populations. Burlington, MA: Elsevier, 2001.

Kline, Paul. The Handbook of Psychological Testing. New York: Routledge, 2000.

Wood, James M, Howard N. Garb and M. Teresa Neszworski. “Psychometrics: Better Measurement Makes Better Clinicians,” in The Great Ideas of Clinical Science: 17 Principles That Every Mental Health Professional Should Understand, eds. Scott O. Lilienfeld and William T. O’Donohue. New York: Routledge, 2007.

Sarah Boslaugh, Ph.D.

Washington University School of Medicine