Psychometrics and Test Theory

Psychometrics and Test Theory

Psychometrics and Test Theory

Genuine interdisciplinary:

{mathematics & statistics}  {psychology & behavioral sci}

„Test“ = formalized abstractive modeling variable substituted to any original psychological variable whose use typically was for a diagnostic purpose

General aim of „test theory“: improving quality of diagnostic methods

(a weak comparison is error theory in technical measurement)

Practical uses:

- assessment of scientific “provability” of data

- new test construction

- old test verification

- prediction

- person selection

- correction due to selection

etc.

Object of study: formalized properties of „tests“

Method: mathematics, statistics, probability,

partly also: operation research, optimization (linear programming), ...

Specific purpose: to formulate mathematical relationship between test properties

to enable manipulating some of them to optimalize the desired target

property/ies, usually thediagnostically important, e.g. validity etc.

Test properties are conceptually formalized and expressed in terms of statistical indexes:

illustration: „difficulty“ of test as percentage of persons fulfilling/notfulfiling a „norm“

Test properties:

validities - there are about 20 types of validity of the

„how test measures the attribute it is supposed to measure“

(nice operationalistic tautology !)

classical test theory: validity defined as absolute magnitude of correlation

reliability - an analogy of measurement error and accuracy in technical measurement

(mistaken confusion with similarity during repetition}

objectivity

difficulty

length

test time

test speed

dimenzionality (vs. non-statistical „content homogeneity“)

consistency

generalizability

equivalency

specificity

and several other

Illustration of some nonformalized “farmer’s logic” relationships between the properties:

- difficulty \ validity

- length \ reliability

- reliability \ validity

Some types of validity: ... type of criterion = D.V. , test = I.V.

- simple \ composed (in classics: simple \ multiple correlation)

in the composed case too: incremental, “pure”

- internal \ external

- manifest \ latent ... empirical \ “construct” (usually “factor”)

- convergent \ discriminant (= divergent)

- concurrent \ predictive (test=predictor, criterion=predictand)

- absolute \ differential

some nonformalized: - “content” validity by expert’s opinion

- “face“ validity and motivation

Some types of test equivalency - :homogeneous” tests measuring the same construct can be

- unidimensional

- congeneric

- tau-equivalent

- parallel

Scientific concept formation \ weak associative measurement of “constructs”

(mathematical dimensionality of latent variables vs. psychological homogeneity)

Three traditional models

(in modern approach they are subsumed in one genralized model)

1. Classical model of test score

2. Factor analysis

3. IRT-models for binary tests (Item Response Theory, obsolete “Item Analysis”)

Practical use:

- assessment of scientific “provability” of data - critical difference of scores

- new test construction

- old test verification

- prediction

- person selection

- correcting predictive validity with respect to selection and selection ratio

Examples of formulas derived on the theoretical-axiomatic basis:

maximum validity of test x to errorless criterion y : max rxy =  Relx , sq. root of reliability

dtto to criterion with reliability Rely : max rxy = rxy / ( Relx. Rely)

suppose rxy= .48 and reliabilities .64, .81, max = .66

Spearman-Brown, influence of length L : RelL.x = L . Relx / [1 + (L-1) Relx ]

standard error of test score se = sx(1-Relx)

critical difference of two scores: D95 = 2.8 se

correction of predictive validity : Fajfer’s example

Further:

- multivariate simultaneous optimalization of number of subtests, their lengths, reliabilities

and mutual weights

- constituting test batteries of items from item banks:

- test difficulty equating

- linear programming to obtain desired reliability

- dimensionality of test batteries - simplest approach:

FactorAnalysis, confirmatory - as a part of methods for test construction

- constitutive and reductive process of test consttruction

- incremental validity and suppressors

An illustration of formal building of classical test theory

Testing as a formalized procedure

A general problem to be formalized:

“test” is a “diagnostic device” which should allow to “approximate” the level of an psychological attribute which is not accessible to our direct observation and the “approximation” should fulfill certain conditions to ensure its minimum diagnostic quality

Formalized formulation (i) - deterministic:

pi ... an object possessing attribute A

P = { pi }... nonemtpy set, possibly infinite ( i = 1, 2, ...)

an assumed attribute with levels conceptually representable on

a scale of interval type

A(pi) ... a real point function representing the level of A

(in the sense of theory of representation measurement)

fA(pi) ... function approximating the unknown function A(pi)

{A1, A2,...} set of axioms and conditions setting the way f approximates  ,

say, among other things, dealing also with the discrepancy between f and  ,

e.g. the difference f -  , or so, which indicates that the formalized definition

must include the pair {f ,}

In ideal case the axioms should be consisten (non-contradictory), independent and complete.

Formally the problem is set, then.

Interpretation: ... population of persons

A ... psychological attribute to be “tested”

 ... ideal diagnostic procedure’s score representing the level of A

f ... the practically accessible “test score” aproximating the ideal 

Re-formulation (ii) - stochastic:

 or  A(pi) ... random variable substituting 

x or x A(pi) ... “ “ “ f

Statistical assumptions:

(Sample space, resp. variation range, of , x corresponds to the domain, resp. range, of f,)

S1 : 1st and 2nd moments of  and x are finite

S2 : their 2nd moments are nonzero

Definition - Tests

Assume P be set of elements pi , (i=1, 2,...), and for each of the n elements j, k,... from non-empty index set I1n assume an ordered pair { xj , j }, { xk ,k }... of random variables, each in any pair with finite first moments and finite and nonzero second moments over P, where the first variable be directly observable*) but the second one not. Then set of the pairs,

T = [{ x1 ,1 },...,{ xj , j}, { xk , k}, ... { xn , n}] is family of tests,

and each of the pairs a test, if the following system of three axioms A1, A2, A3 holds.

Axioms, for simplicity formulated using auxilliary definition xj -j j :

( Remark: the definition is related to the “quality of approximation)

A1 Expectation E (j ) = 0 for all j

A2 Correlation R (j , k ) = 0 for all pairs j , k

A3 Correlation R (j, k ) = 0 for all mutually different j , k

The “auxilliary definition” xj -j j then turns into

xj = j + j , the model of classical test theory,

with the interpretations:

xj is observed score of a test

j is true score

j is “error” of the observed score

------

*) The observability vs. unobservability from a statistical point view is another problem, depends on randomness in the measurement/observation conditions, the true score then can be defined as expectation of observed score of a test with infinite length. It is also related to the problem of decomposition of “error” or “unique” part of score into “specific” and “unreliable” component. For this moment let us rely on just intuitive meaning of “observability”.

------

The true score can be of two kinds:

- specific true score - specific to the only one given test. e.g. “true” number of words in a verbal test, i. e. non-influenced by unreliability conditions like factors of environment disturbing the optimum concentration, motivation of the tested person, climate conditions etc., so the atribute A is concrete and empirically or operationally defined by the test-specific task, i.e. uniquelly and completely measured by the test

- generic true score - common to the content-homogeneous battery of several tests measuring the same abstract psychological concept - theoretically assumed “construct” A (say “verbal ability”), the tests thus being of the same “generic” character

For the generic true score the model then can be further decomposed:

xj = j + j = j + j + j

where, in the generic case, the term j is to be recognized not as just an error due to unreliability of measurement of the abstract concept j but rather as

a) from point of view of modeling j is discrepancy of the model

b) from the point of viewm of testing j is a unique part of test score, i.e. the part unique to the separate test while the rest is part common with the other tests that jointly measure their common generic attribute A

In any of the two viewpoints the decomposition

j= j + j

is into j ... the specific part which is not due to unreliability, and

j ... the error due to unreliability

An illustration of derivation of several consequencies of the formal formulation:

Decomposition of variance of the specific true score:

S2(xj) = S2(j +j) = S2(j ) + S2(j) + 2 S2(j , j) by the known theorem,

but due to A2 the covariance as the last member is 0, and therefore

S2(xj) = S2(j ) + S2(j).

True variance is equal to observed-true score covariance:

- by a general definition this covariance is

S2(xj ,j) = E (xjj) - E (xj) E (j)

- substitution yields S2(xj ,j) = E [(j+j) j] - E(j+j) E (j)

= E (j2) - [E (j)] 2 + [E (j j)] 2 - E (j ) E (j)

= S2(j ) + S2(j,j),

and by A2 : S2(xj ,j) = S2(j ) . (*)

Coefficient of determination:

R 2 (xj ,j) = S2(j ) / S2(xj) ,

as folows from correlation coefficient after substituting the last finding S2(xj ,j) = S2(j ) into its definition: R (xj ,j) = S2(xj ,j) / S2(xj) S2(j ) = S 4(j ) / S2(xj) S2(j ) .

This coefficient of determination has two interpretations:

1. just as the ratio of

true to observed variance or complement of error to observed variance

S2(j ) / S2(xj) = [ 1 - S2(j) ] / S2(xj)  Rel

it is called reliability of test.

2. Its square root, the correlation of observed and true score

R (xj ,j)

it is one of the cases of validity.

Since, according to the definition of test, the variances S2(j ) > 0 , S2(xj) > 0 , it follows

from (*) that the covariance is also positive, and therefore the correlation

R (xj ,j) > 0 ,

so by the definition the pair(xj ,j) is a test only if it has nonzero reliability and if the observed score has nonzero validity toward true score.

The square root of the error variance i.e. the standard deviationof the errors

S(j) is called standard error of test score

Axiom of parallelism - this is the strongest type of test equivalency:

parallel tests or “classic equivalent tests” - :

{ xj , }, { xk , }, ...

are defined as tests of the same (their common true score) that have the same standard errors

S(j) = S(k) = ...

Consequencies of the classic-model parallelism:

- parallel tests have the same means of observed scores

(follows from applying E to the the model xj = j + j , axiom A1, and the equality (*) )

- parallel test have the same observed variances

(follows from definition of parallelism and from the variance decomposition )

- true variance of parallel tests equal to their observed covariance:

S2(xj , xk) = S2() (**)

follows from substituting model into definition of covariance and axioms A2, A3, A1:

S2(xj , xk) = E ( xj xk) - E (xj) E (xkj)

= E [(+j) ( +k) ] - E (+j) E ( +k)

= E [2 + j + k + j, k] - { [ E () + E (j) ] [E () + E (k) ] }

= E (2) + 0 + 0 + 0 - [E ()] 2 - 0.E () - 0.E () - 0.0

= E (2) - [E ()] 2  S2()

This important theorem makes possible to estimate reliability if two parallel tests or

parallel forms of a test are at disposal

BUT: the often use of two repeated measurements assumed to be parallel forms

leads to the mistakenly tragic pseudo-explanation of reliability as an

“ability of test to yield similar or stable results on repetitions”

- parallel tests have the same reliability Rel j = Rel k = Rel .

(follows from definitions of paralelism and reliability ratio)

- mutual observed validity of parallel tests is equal to their reliability :

R (xj , xk ) = S2() / S2(x.)

(substitute formula (**) into the numerator of reliability ratio, further due to the equal variances

its denominator can be substituted by the product S(xj)S(xk ), which transforms the ratio into the

correlation coefficient as by definition)

- correlation coefficients of mutual observed validities between all pairs of parallel tests are

equal - and the correlation matrix of the battery of parallel tests contains the constant in its off-diagonal cells

(the correlations are equal since they are equal to their equal reliabilities)

- the correlation matrix of the parallel tests is therefore unidimensional or speaking in factor analytyc terms it is “explained” perfectly with one common and general factor, with factor validity couefficients (loadings) mutually equal (and with equal uniquennesses/communalities)

- it is also possible to show, that parallel tests have the same validity for predictin any external criterion variable (as dependent variable)

So, in this sense the parallel test - as the strongest case among other types of equivalent test - are by intuition really “fully equivalent” and mutually completely interchangeble.