Unidentifiability and Lack of Monotonicity in The

Multidimensional 3-PL

Unidentifiability and Lack of Monotonicity in the

Multidimensional Three-Parameter Logistic Model

Matthew Finkelman, Giles Hooker and Jane Wang

Abstract

The Multidimensional 3-parameter logistic (M3PL) model is well-known in the psychometric literature as a means to formalize the relationship between test items and examinee abilities. According to this model, an examinee with low ability along one dimension may maintain a high probability of correct response as long as he/she has a high ability level along another dimension. This article investigates two irregularities that can occur in the M3PL model: the unidentifiability of the ability parameter and the lack of a monotonic relationship between item responses and ability estimates. A new computer-adaptive testing item selection technique is introduced to curb the occurrence of problems associated with unidentifiability. An analysis of real data indicates the prevalence of these issues in an operational setting.

Key words: Multidimensional Item Response Theory, statistical inference, test fairness.

1. Introduction

Multidimensional Item Response Theory (MIRT) is a popular method for modeling examinee behavior along multiple latent dimensions of ability (Reckase, 1985; Ackerman, 1996). MIRT models are often considered more realistic than unidimensional models when examinee responses are a function of more than one latent trait. Additionally, since MIRT models give estimates of an examinee’s ability along multiple dimensions, they can be used to provide diagnostic information about several subscales simultaneously.

Although MIRT is a relatively new method, it has received considerable attention in the psychometric literature. Reckase (1997) provided a historical account of the model, describing its relation to other tools (in particular, factor analysis and unidimensional IRT), giving examples of its applicability in practical problems, and suggesting topics for future work. There have been a number of recent articles on MIRT, including the study of its applications to longitudinal data (te Marvelde, Glas, Van Landeghem, & Van Damme, 2006) and subscale proficiency estimation (Yao & Boughton, 2007).

The multidimensional 3-parameter logistic (M3PL) model (Reckase, 1985; Ackerman, 1996) is a compensatory MIRT model: decreasing an examinee’s ability along one dimension can be offset by increasing ability along another dimension. Let denote the vector of abilities in an dimensional model. Let be an indicator variable such that if a given examinee responds correctly to item and otherwise. Under the M3PL, the probability of a correct response to item given an examinee of ability is

(1)

where is a vector of discrimination parameters is the discrimination along dimension is a single difficulty parameter, and is the guessing parameter. A similar compensatory model is the multidimensional Normal ogive model (Bock, Gibbons, and Muraki, 1988):

(2)

where is the cumulative distribution function of the Normal (Gaussian) distribution.

This article will explore two estimation irregularities that can occur within the M3PL model: (a) unidentifiability of extreme ability parameters, and ramifications in frequentist statistical inference; and (b) lack of monotonicity between item responses and ability estimates. As will be seen, these properties of the M3PL model can result in ability estimates that may be difficult to justify from the perspective of test fairness. While the M3PL is used as an example throughout, analogous results hold for the Normal ogive model.

The remainder of the article is organized as follows. Section 2 defines unidentifiability, offers motivation for new theoretical results presented in the appendix, and discusses why these results are problematic. Section 3 introduces a new item-selection method in computerized adaptive testing that is designed to avoid undesirable outcomes associated with unidentifiability. Section 4 provides motivation for lack of monotonicity in the M3PL model and describes related issues concerning test fairness. Section 5 evaluates the prevalence of these estimation irregularities in an operational dataset. Finally, Section 6 offers concluding remarks.

2. Unidentifiability

Before proceeding to a statistical definition of unidentifiability, we first consider a motivating example. Suppose a test is designed to measure mathematical acumen, but it also requires reading to solve the problems; thus, the M3PL is employed as the model that relates observed performance to latent ability. Suppose further that an examinee taking the test is reasonably proficient in mathematics, but cannot read English at all. Despite the student’s proficiency in mathematics, his/her reading ability is effectively -∞; hence, the compensatory nature of the M3PL (Equation 1) results in a prediction that he/she will perform at chance level: no better than random guessing. Now suppose there is another student who is moderately proficient in reading, but who is completely lacking in mathematical ability. As before, the M3PL model predicts a chance-level performance. If, as modeled, both students do in fact generate a string of responses at chance level, the test is incapable of distinguishing the first from the second (or from a third student who is completely lacking in both dimensions). In this case, it is statistically unjustifiable to apply a rule that would simply assign the lowest possible score, along both dimensions, to all chance-level students. If such a rule were applied, the first student’s mathematics ability would be grossly underestimated, as would the second student’s reading ability, with no information for the model to discern between the two.

Turning from the motivating example to a more formal analysis, suppose that the goal of a test is to provide an estimate of examinee ability along each dimension being measured. That is, the result of the test is an estimate, denoted of the true ability and it is desired for the estimate to be as close as possible to the true ability. An ideal psychometric examination would be able to discern any ability vector from any other ability vector using statistical inference. Although it is generally impossible to distinguish two candidate values with certainty in a short test, it would be hoped that statistical information would favor one of these values over the other, and that the two could be discerned with a high degree of confidence in a long test. Two such values can never be distinguished from one another, however, if the associated statistical model is unidentifiable. Suppose that items have been administered to an examinee, yielding a response pattern A stated model, utilizing such a set of items, is said to be unidentifiable if there are two values, that induce the same probability distribution on (Lehmann & Casella, 1998). That is, the model is unidentifiable if for all possible

= (3)

In this case, there is no hope for the data to discern between and because the likelihoods of these two vectors are always identical to each other. Under the usual psychometric assumption of local independence, and assuming dichotomous outcome variables such as in (1), the likelihood of based on is defined as

(4)

The value of maximizing (4) is said to be its maximum likelihood estimate (MLE); by definition, no other value of is more consistent with the data than is the MLE. The usefulness of an MLE is fundamentally based on the data’s inducement of different likelihoods for different values of If (3) holds, however, we obtain = for all so neither nor can ever be favored over the other—the two values are statistically indistinguishable. We note that this notion of unidentifiability is distinct from having multiple roots to the likelihood equations as studied in Samejima (1972) and Yen, Burket and Sykes (1991). The existence of multiple roots represents a numerical problem in optimizing the likelihood; our situation is one of statistical unidentifiability in the sense of lacking information to judge between candidate parameter values.

Unidentifiability is a concern for the M3PL model due to complications that arise with extreme values of the ability parameter. Consider an dimensional M3PL model where all items “load onto” both dimensions, that is, for each dimension and item Equation (3) is satisfied because for any and with an element approaching infinity, the probability of a perfect response pattern goes to one, and the probability of any other response pattern goes to zero. In the limit, the same probability distribution is obtained no matter which is infinite, and no matter what the other ability values are.

Although discerning between vectors with an infinite element is generally not of direct import, the M3PL’s unidentifiability carries problematic consequences for inference about the individual abilities being assessed. Suppose the test practitioner seeks to report a subscore along a specific dimension and therefore, a precise estimate of is needed. If a given examinee answers all items correctly, however, there exist infinitely many MLEs of for that examinee; moreover, every finite value of is represented in a vector that is an MLE (loosely, a vector that has an infinite value along some other dimension, thereby achieving a likelihood of one). Thus, all finite values of are considered equally plausible candidates for the examinee’s ability along dimension none can be favored over the others based on the likelihood. Additionally, a finite lower confidence bound does not exist: the only confidence interval for is of the form (-∞,∞). The above statements are true for every dimension (equivalently, for every ability so frequentist inference about the individual traits breaks down. See Theorem 1 in the Appendix for a formal statement and proof of these irregularities.

From the preceding discussion, it is statistically unjustifiable to simply assign the highest possible score along all dimensions to examinees with perfect response patterns—just as in the motivating example, it was unjustifiable to assign uniformly lowest possible scores to chance-level examinees. We want to make clear that this situation is distinct from that of the unidimensional 3PL model, where the MLE is uniquely equal to ∞ (-∞) for perfect (completely incorrect) response patterns. Finite one-sided confidence bounds exist for these extreme response patterns in the 3PL model, and hence frequentist inference can be justifiably made when this model fits the data. See Lehmann (1986) for an overview of single-parameter confidence bounds.

Readers may note that the above problems can be avoided when scoring examinees via Bayesian methods, such as the expected a posteriori (EAP) estimate. The EAP is mathematically defined as follows:

/ (5)

(Veldkamp & van der Linden, 2002), where the function is a prior distribution on This estimate is finite and unique even with an all-correct or all-incorrect response pattern (Luecht, 1996; see also Segall, 1996, 2000), so irregularities associated with such patterns do not occur. Although the use of EAP stabilizes the estimate of in this way, it may be undesirable for final inferences about an examinee’s ability to be reliant upon an informative prior. This is because the prior distribution is inherently subjective, and some practitioners seek to report ability estimates, such as the MLE, that are based only on empirical results of the examinee. Stated by Veldkamp & van der Linden (2002, p.578) in the context of adaptive testing:

If, for example, in a high-stakes testing program, the effect [of an informative prior] is unwanted, a pragmatic approach would be to use prior information during item selection in the adaptive test, but to report final scores based on an estimator that relies only on the response vector and does not assume prior information.

If indeed the effect of the prior is unwanted as part of the final estimate of or the MLE is preferred to the EAP for any other reason, then the statistical issues of Theorem 1 arise. For an example of an application of the MLE to a compensatory MIRT model, see van der Linden (1999).

3. Selection of CAT Items to Minimize the Probability of Irregularity

In the previous section, it was shown that an irregularity is present in the M3PL model when a response pattern consists of entirely correct responses or responses below chance level. Because of the difficulties in performing frequentist inference when such response patterns occur, it may be desirable to select items so that these patterns are avoided. In the current section, a computerized adaptive testing (CAT) item selection method is introduced to address this concern. CAT is of particular use here because it can identify potentially irregular examinees as they are being tested, and then alter item selection accordingly.

Previous item selection methods for the M3PL model include maximizing the determinant of the posterior information matrix (Segall, 1996, 2000) and maximizing the posterior expected Kullback-Leibler information at an appropriate estimate of (Veldkamp & van der Linden, 2002). To avoid patterns with all correct responses or responses below chance level, the following “hybrid” item selection method divides the test into two parts:

a) Select the first item based on the method of Segall (1996, 2000), Veldkamp and van der Linden (2002), or some other psychometric criterion.

b) At each subsequent stage of the test, determine whether there exists an MLE with an element of ∞ (or -∞). If so, select the next item to minimize (maximize) the chance of a correct response at the next step. If not, select the next item via the criterion of part a).

The minimization/maximization component of part b) can be approached in a number of ways. One method would be to minimize or maximize the posterior probability of a correct response to the next item, assuming a Bayesian prior. Let denote the posterior distribution of after items have been administered. The marginal posterior probability of a correct response to item is then given by

. (6)

This quantity could be minimized or maximized, as appropriate, to select item A simpler method would be to evaluate the probability of a correct response at the EAP estimate of In other words, choose the next item to maximize or minimize