combining multiple indicators of teacher performance 1

Combining Multiple Measures of Teacher Practice and Performance:

Technical and Conceptual Considerations for Teacher Evaluation

Evaluacion Docente con Medidas Múltiples de Práctica y Desempeño:

Consideraciones Técnicas y Conceptuales

José Felipe Martínez1

1 University of California, Los Angeles, 2019B Moore Hall, Box 951521, Los Angeles, CA 90095-1521.

Abstract

Reform efforts in education systems in the United States and other countries place increasing emphasis on the promise of teacher evaluation systems for helping complement traditional school-level accountability systems in improving classroom instruction and student learning.

This paper examines the key conceptual and methodological issues faced in measuring a construct as complex and multidimensional as teacher quality or effectiveness in the context of high stakes accountability policies. The common methods of assessing teacher practice and performance are reviewed, along with the appropriate methods for using these different indicators in combination for formative and summative evaluation. The focus here is on the validity of the inferences about teacher effectiveness that can be derived from these sources of information, as well as the implications and potential consequences for education policy and practice of implementing these systems on the field.

Keywords: Teacher Evaluation, Validity, Multiple Measures, Classroom Observation, Teacher Portfolios

Teacher Evaluation and Validity: Conceptual and Methodological Considerations in Multiple Indicator Systems

We fully understand that standardized tests don't capture all of the subtle qualities of successfulteaching.That's why we call for multiple measures in evaluating teachers.In an ideal world, that data should also drive instruction and useful professional development.

Arne Duncan

U.S. Secretary of Education

Introduction

Teacher evaluation is an area of growing interest and relevance for local, state, and national educational systems around the world, reflecting mounting pressures to assess the performance of teachers, and hold them more concretely accountable for the levels of academic achievement attained by their students. A variety of methods have been proposed and are being tested to assess key constructs and aspects of teacher performance. In addition to technical issues specific to each of these methods, the critical questions at the center of policy debates around teacher evaluationincreasingly refer to the appropriate ways of combininga variety of measures available for purposes of teacher performance appraisal.

Unfortunately, growing attention from policymakers and the public is not necessarily being accompanied by a correspondent increase in awareness and understanding of the many complex technical issues involved in evaluating the work of teachers. Little specific guidance is availableto researchers and policymakersto aid in addressing the variety of conceptual, technical, and policy issues that emerge in trying to answer this question. Much systematic work and discussion is therefore needed to address questions about the indicators to be collected, and the appropriate sources and uses of that information;the extent of uniqueness and overlap in the information provided by these indicators and sources;and the appropriate ways to use these indicators in combination to improve teacher performance.

In this paper I examine some of the critical conceptual and methodological issues faced by researchers and policymakers who try to measure teacher quality or effectiveness in the context of recent trends in teacher evaluation internationally. I briefly discuss the most common methods used to assess teacher practice and performance, and focus mostly on the validity of the inferences about teacher effectiveness that can be derived from these sources of information, either alone or together in combination,as well as the implications and potential consequences for education policy and practice.

Policy Context

Reform efforts in education systems in the United States and other countries increasingly emphasize the notion of performance review and accountability for individual teachers, as opposed or in addition to the traditional focus on school-level accountability. As prior waves of accountability-based reform, this one has its roots in perceptions about the (inadequate) performance of students in national or international assessments (e.g. NAEP, PISA, TIMSS). Feuer (2012) describes an interesting kind of reverse-Lake Wobegon effect taking hold worldwide whereby states and countries consistently find reasons in the data to be pessimistic about the performance of their students, relative to peer or competitor systems.

Unlike reform targeted at the school level, however, these initiatives are also heavily informed by assumptions about and evidence of the importance of quality teaching and,more generally,effective teachers (or lack thereof) in explaining and potentially improving the outcomes observed. Additionally, they reflect a critique of a majority of existing teacher evaluation systems which, as many reports and authors have noted became perfunctory rituals with no consequences or enforcement, and little formative or informative value for the teacher or the district (Loeb, Raudenbush, et al 2010). Finally, they reflect a series of optimistic assumptions about the ability of the proposed revamped system and procedures to reliably identify effective and ineffective teachers. These rationales can be seen clearly underlying current policy efforts targeting teacher evaluation in large and medium size districts across the U.S. (e.g. New York,Los Angeles, Chicago, Tennessee). The same discourse and assumptions are increasingly common to reform efforts targeting teacher evaluation internationally, including for example national systems in Singapore, Chile, and Mexico, and nascent efforts in the United Kingdom and Australia.

Teacher Evaluation: Why, What, How

Why Evaluate. Educational theory, mounting empirical evidence, and common sensesuggest that effective teachers can exert an important influence on the levels of achievement of their students (Baker et. al 2010; Rowe, 2003). Teacher evaluation systems can have differentorigins, motivations, and goals depending in local policy context, but typically seek information on aspects of teacher practice believed to be aligned with improved system outcomes, crucially student achievement, which is used for both formative and summative purposes. These include,among others, identifying struggling teachers for intervention to help them improve their teaching and classroom practices; identifying teachers who are persistent underperformers for remedial action, sanction, or dismissal; providing incentives to the best teachers; informing school practice and district policies on teacher preparation and professional development; identifying effective teacher practices in order to develop models of effective instruction to scale up for implementation in classrooms across the system.

Typically, teacher evaluation systems are conceived to support a combination of these intended uses and goals, in turn using the information collected formatively to guide teacher education and professional development, and summatively for decisions related to career advancement, remuneration, and retention. Unfortunately, policy and political priorities can often be at odds with careful design and thorough consideration of long-term goals, uses, and consequences. The opening quote from Secretary Duncan, for example,reflects a decision to move forward with high stakes teacher evaluation,irrespective of whether we live in an ideal worldwhere data from the evaluation is used drive instruction and professional development. Thus effectively turns what should be the core formative component of any such system into an optional luxury that may or may not be available in the real world, at least initially.

What to Evaluate.

While researchers, educators, and policymakers increasingly agree about the importance of developing sensible approaches to evaluation to support teacher development and accountability, reaching a consensus around the specific aspects teachers’ professional practice that should be included in the evaluation is more challenging. Teachingand teacher practice are inherently complex, multidimensional constructs. Teaching involves a variety of processes and interactions that take place in the classroom and outside; some substantive in nature, others related to practical aspects of classroom work (daily routines, classroom management), yet others pertaining to psychological aspects of teacher-student interactions (e.g. motivation, respect, feedback).Teacher practice more broadly defined further includes a multitude of aspects of the work of a teacher outside the classroom, including among others communication with parents, administrators, and other teachers at the school, school citizenship, and contributions to the broader community.Thus, although the notion of assessing teacher effectivenesshas simple intuitiveappeal, in practice it involves selecting, defining, collecting information about, and making inferences involving dozens of complex component constructs (Peterson, 1987). Terms like Teaching Quality, Teacher Practice,Teacher Effectiveness, or Teacher Performance are often treated as interchangeable, but they are more appropriately seen as closely related, with important areas of overlap, but also uniqueness.

Evaluating teacher quality or effectiveness entails first developing an agreeable definition for each of these components. The notion of teacher competences (Reynolds, 1999) provides a particularly useful global heuristic for teacher quality that identifies four main components of competence: teacher knowledge (e.g. subject and pedagogical), skill (e.g.applied knowledge), disposition (e.g. attitudes, perceptions, beliefs), and practice (e.g. instruction, assessment, management). Notably, this definition excludes indicators like teacher education, credentials, experience (seniority), or contributions to student achievementor other non-cognitive outcomes. While indicators like these can be important and informative, and are often considered for formal teacher evaluation, they are seen here as correlates more than components of teacher competence.

The discussion above suggests first that the most appropriate definition of teacher quality or competence depends on the intended uses and context. It also suggests ultimately, that other things being equal, the richer the definition of the construct the better. A simple answer to the question of what constructs to evaluate when considering how best to appraise teacher performance could thus be “all of the above” or at least “as many of the above as is affordable”. To the extent some of them are excluded the definition of teacher quality or competence is narrower, and so by extension is the evaluation.

How to Evaluate.

Having decided that teacher evaluation ought to encompass as many of the relevant constructs as is feasible, attention turns to the methodological questions that arise in measuring each of them in practice. A variety of methods can be used to capture each of the constructs that compose teacher quality or competence: Teacher knowledge and skills, either subject matter or pedagogical is often assessed by means of standardized tests (e.g. the Praxis I, or MKT tests) or through alternative open-ended performance assessments (e.g. the Praxis II assessment), or vignettes of classroom practice (Stecher, Le, et.al., 2006).Information about dispositions and beliefs is typically collected directly from teachers by means of surveys or interviews (Mayer, 1999). For constructs related to school citizenship and contributions to the broader community surveys can also be used to collect information from teachers and other sources of information (e.g. parents, administrators).

Classroom Observation.Finally, a variety of methods can be used to develop indicators of classroom practice. Traditionally, direct observation has been the default method for studying and monitoring the work of teachers in the classroom. Observation (either live or through videotape) has considerable face validity as a method for seeing teaching as it happens in classrooms, and providing direct evidence for identifying areas in need of improvement to inform professional development (Pianta and Hamre, 2009). It remains a staple of teacher evaluation systems old and new as the central explanatory and formative complement to the summative evidence offered by measures based on summaries of student achievement. On the other hand observation in classrooms presents significant challenges, starting with identifying and defining the many distinct but closely interconnected notions that collectively compose the broader construct of classroom practice.Table 1 shows the constructs targeted for observation in Singapore side by side with a typical list of constructs from an observation system based on the Danielson framework in the United States. While the figure shows some areas of overlap across countries, it also highlights the subjectivity and culture-specificity inherent to all definitions of good teaching. Thus, classroom observation in Singapore, a high performing country in international assessments whose educational system has been the subject of much study and praise in recent years,emphasizes aspects of classroom life (e.g.nurturing the whole child, winning hearts and minds) that get little attention in American systems, which typically focus on more technical aspects of classroom practice (e.g. instruction, lesson planning).

------

Insert Table 1

about here

------

Moreover, as a method for large-scale standardized data collection classroom observationfaces challenges in understanding, quantifying, and monitoring the extent of human rating error in the resulting measures. It requires of specialized measurement techniques, and large investment of resources for training, deploying, and monitoring the work of groups of professionals who serve as standardized observers across a state or district.Even with available resources however, developing and maintaining reliable measures of classroom practice on a large scale, remains a significant challenge. The disappointing results of a recent widely publicized study that tested some of the best known observation rubrics on a large scale are a sobering reminder of the extent of the challenge if these measures are meant to support inferences and decisions involving individual teachers (Kane et. al., 2011)

Surveys.A variety of alternative methods have been used to collectinformation about classroom practice, which may offer substantive, logistic, or cost advantages compared to direct observation in classrooms. In particular, teacher surveysoffer a cost efficient alternative, that permits to collect information about large numbers of practices and constructs with little added cost or burden to teachers. As with other surveys, researchers can construct teacher surveys so that the resulting aggregate indicators reflect key constructs of classroom practice of interest, and show adequate levels of reliability by common psychometric standards. On the other hand, teacher surveys have significant limitations in their ability to produce consistent and valid information about classroom processes. Surveys are prone to errors from loss of or inaccurate memory, and from inconsistency in teacher interpretations of the content and focus of items. Moreover, they are subject to strong social desirability effects, particularly in a situation of relatively high stakes and subjective areas of practice. Thus, three teachers who report that they always emphasize higher order cognitive skills in their instruction could be over or underestimating the true frequencies; alternatively they could each be reporting truthfully and accurately but mean different things by always, by emphasizing, or by higher order. Finally, teachers could knowingly under or over reportthe true frequency based on their perception of the desirability of the practice in question (Mayer, 1999). As a result, teacher surveys have been notoriously inconsistent as predictors of student outcomes (Bill and Melinda Gates Foundation, 2010).

Student surveys have gained in popularity recently as an alternative that can address some of the limitations faced with teacher surveys. In particular, student surveys can be used to create classroom aggregates that are as reliable as those obtained from teacher surveys, and more strongly predictive of student achievement (Kane et.al., 2011;Martinez, 2012). Because they are not as susceptible to social desirability, student surveys are also often seen as more valid for teacher evaluation purposes (Ferguson, 2010). Finally, student surveys provide additional useful information typically not available from teachers; specifically, within- classroom variance in student reports can be used to monitor differentiated or individualized instruction (Martinez, 2012; Muthen. 1995). At the same time, collecting information from students has potential drawbacks in accuracy and consistency of the reports, particularly with young children, whereas with older children there may be concerns about potential bias and engagement. Student surveys also face methodological challenges designing and constructing indicators of the right constructs at the right level (Schweig, 2012). Notably, a question that asks how often “My teacher asks me to read books in the classroom” may behave differently psychometrically than one that asks how often “Our teacher asks us to read books”. Finally, the use of student surveys for high stakes teacher evaluation has not been thoroughly tested and could create issues of validity and bias.

Portfolios.Teacher portfolios are another method gaining prominence as an alternative for collecting information about classroom practice. Teachers use portfolios to compile, annotate, and reflect on artifacts of classroom practice like lesson plans, assignments, and quizzes, over a period of time. Research suggests that teacher portfolios can be used to collect rich information about classroom practice with reliability comparable to classroom observations (Martinez, Borko, and Stecher, 2011). Moreover, because they require sustained effort and mental engagement from teachers, portfolios are seen as promising mechanisms for professional development for monitoring and improving classroom practice (Shulman, 1997). This is exemplified recently by the rapidly expanding use of EdTPA, a portfolio-based assessment system for pre-service teachers developed at Stanford and endorsed by 25states in the United States(Teacher Performance Assessment Consortium, 2012). On the downside, portfolios require of considerable resources for development, collection, and review, and are very burdensome for teachers if not embedded in the formal professional development cycle. Portfolios also are limited for capturing interactive or verbal aspects of instruction like on the fly questioning.

Value Added Models.While not part of the competenceframeworkoutlined above where it is at most a key correlate or byproduct, student achievement is central to the notion of effectiveness at the center of recent policy reforms involving teachers in school districts in the United States.This has led to the growing popularity of Value Addedmodels (VAMs) for estimating the contributionof individual teachers to student achievement, and tomuch debate in research and policy circles. A variety of critical concerns have been raised about VAM estimates, including their limited scope (Baker et.al., 2010) and lack of explanatory or diagnostic value (Goe, Bell, and Little, 2011) for formative uses, and theirinstability (Schochet & Chiang, 2010) and ultimately non-causal nature on the summative side (Rubin, Stuart, and Zanutto, 2004).Taken together, these concerns have led to a widespread view that VAMs cannot be used alone to assess teachers, but only alongside other measures within a broader approach to evaluation (National Research Council, 2010).