Arnett Grading Trends in First Year Composition Instructors Page 20 of 19

Arnett — Grading Trends in First Year Composition Instructors Page 20 of 19

Jonathan Arnett

ENGL 5365 – Quantitative Methods

Final Paper

Investigations toward a Method of Analyzing Instructor Grading Trends in First Year Composition

Introduction

In every first-year composition program that employs graduate students as teachers, Writing Program Administrators (WPA) face the annual challenge of orienting their new instructors to the composition program’s goals and procedures, not the least part of which involves grading. Aside from providing pedagogical theory and course development information — and especially the classroom management tips and techniques that the new instructors crave — a WPA has to train the new instructors how to grade student papers, norm their grading standards to the program’s standards, and provide refresher training on grading for experienced instructors (Qualley, 2002). This cycle of training and retraining helps both the new and experienced graduate instructors provide consistent, reliable grades, but it obscures rather than addresses the interrelated issues of how graduate instructors develop over time, and at what rate.

These general issues are fundamental to an understanding of how educators develop professionally, and they are partially addressed in educational literature. For example, according to Coulter (2000), novice graders in first-year composition tend to follow scoring rubrics very closely, while experienced graders tend to use more internal guides to grading. Saunders & Davis (1998) state that instructors’ understanding of grading criteria is dependent on their discussions about the criteria, and this understanding — and hence their use — changes with staff turnover. Qualley (2002) discusses the “perpetual cycle of teacher preparation” (p. 279) faced by a first year composition program that depends entirely on Master’s degree students as instructors. Hipple & Bartholomew (1982) discuss instructor-orientation strategies that problematize the process of grading student papers. Wyatt-Smith & Castleton (2005) discuss the relative impacts of official educational materials and practices and contextually-based, personal criteria on the grading practices of Australian fifth-grade teachers. Flores (2003) examines elementary school teachers and reports that the second year of teaching is when new teachers turn their attention from teacher/student relationships and classroom management to the assessment of student essays. Campbell & Evans (2000) review student teachers’ lesson plans and determine “that during student teaching, preservice teachers do not follow many of the assessment practices recommended in their coursework” (p. 350).

Although these kinds of article address the state and development of instructors’ grading practices, they deal in generalities, do not address the specific evolution of any particular teacher’s grading patterns, and often address elementary education, which is often not closely related to first year composition instruction. A search of the literature revealed just one department-wide study of English instructors’ grading practices in first year composition courses — a comparison of the final grades assigned by Rhetoric 100, 101, and 102 instructors during the 1951-2 academic year at the University of Illinois’ Chicago campus (Thompson, 1955). In it, the author demonstrates that a) several instructors in each course provide abnormally high or low final grades and b) several instructors in each course disperse their classes’ final grades over unusually broad or narrow ranges. These conclusions are unlikely to surprise an experienced first year composition instructor, but they make clear the need for a more in-depth examination of grading practices across the different sections of first year composition programs.

Challenges

An examination of grading practices in first year composition, however, is complicated by the difficulties inherent in evaluating any writing course’s student products. As Branthwaite, Trueman, & Berrisford (1981) point out, “As it is normally carried out, marking essays is a very private and intuitive procedure” (p. 42). Some first-year composition programs, including those at SUNY-Stonybrook and the University of Cincinnati, have attempted to counter this problem with portfolios. Many authors, such as Belanoff & Elbow (1991) and Durst, Roemer, & Schultz (1994) have described the positive effects of the portfolio system; however, portfolio grading contains its own problems, such as creating conflict between members of grading groups, fostering a sense of disempowerment and lowering morale among less experienced graders, and generating deep, program-wide resistance on the part of the graders (Coulter, 2000). In addition, many first year writing programs do not use portfolios and employ a traditional model, where instructors grade their own students’ work, which has not been widely addressed in the literature.

The majority of the literature regarding individual graders’ marking of individual student essays is on high-stakes testing of the sort conducted by the National Center for Education Statistics (NCES) (White, Smith, & Vanneman, 2000), colleges and universities that use testing to place students in courses (Barrett, Stock, & Clark, 1986; Weigle, 1998), professional organizations that use tests for accreditation purposes (O’Neill & Lunz, 1997), and, more recently, Educational Testing Service (ETS), the entity that develops and administers the SAT. These organizations have developed strict protocols for accurately and efficiently grading large numbers of essay tests, but their high-stress, time-limited procedures (Engelhard, 1992; Gyagenda & Engelhard, 1998; MacMillan, 2000; McQueen & Congdon, 1997) are far removed from the (typically) steady, semester-long effort of grading student papers. For example, NCES has conducted the National Assessment of Educational Practice (NAEP) tests every year since 1969. The 2000 iteration was expected to generate “close to 10 million constructed responses” (White, Smith, & Vanneman, 2000, p. 3) involving some degree of writing, which were to be graded by approximately 150 scorers for the mathematics portion, 175 scorers for the science portion, and 50 for the writing portion. Clearly, although the methods developed by NCES and ETS are valuable for their intended purpose of evaluating large numbers of documents with accuracy, analysis of their graders’ marking patterns do not shed light on the typical situation faced by a first year composition instructor.

Another problem, but one that is not addressed specifically in the literature, is the dearth of grades to analyze. College registrars maintain records of student final grades, from which an investigator could conceivably assemble sets of final grades given by individual instructors, but final grades, as is clear from the Thompson article, are not deeply illuminating for two reasons. First, final grades only come out once or twice in a year, and in the case of graduate students who instruct first year composition courses for only a handful of semesters before leaving, the data set left behind is not sufficient for in-depth analysis. Second, final grades are a cumulative figure that often contain extraneous factors, such as extra credit and class participation, and thus may mask significant variations in instructors’ everyday grading habits. Alternatively, a researcher may find it possible to obtain detailed grade records from individual instructors, but it is questionable whether the effort in gathering the data would be cost-effective because data sets would likely be incomplete — teachers throw away old grade books; computer disks are erased, damaged, and discarded, or spreadsheet data is either corrupted or simply unreadable by modern programs — and it would be extremely difficult to find a relatively homogeneous sample of instructors who graded the same assignments in the same sequence.

As it stands, no longitudinal studies exist that track individual teachers’ reliability and severity under normal grading conditions; i.e., where teachers grade students’ writing as an integral part of their everyday work, over the course of multiple semesters. Hence, in order to accurately examine the ways in which first-year composition instructors change their grading practices over time, it would be worthwhile to examine a complete set of grades, comprising every grade on every assignment, given by actual first year composition instructors in a college setting. Ideally, the object of study should be a complete set of individual instructors’ assigned grades during a semester-long time frame that requires grading more than one type of student essay, as is the case with first year composition courses and not one-time tests like the SAT.

Severity and Reliability

The extant literature on individual teachers’ grading patterns is somewhat limited, but two basic issues discussed at length in the literature on grading patterns are rater severity — the degree to which graders tend to provide high or low grades — and intra-rater reliability — whether the same grader provides consistent marks over time.

Severity

A great deal of the literature on rater severity concerns studies on high-stakes testing, where teams of judges rate large numbers of essay tests, often in order to make pass/fail judgments or place students into classes. For example, Barritt, Stock, & Clark (1986) describe the tensions associated with evaluating student essays on a 1-4 scale for the purposes of placing students in composition courses at the University of Michigan. In particular, the authors focus on the confusion they encountered when essays were rated 1 (superior) by one judge and 4 (poor) by another. McQueen & Congdon (2000) examine rater severity over the course of a nine-day long (seven days of active grading, not including a weekend) test-grading session in which 16 raters examined 8285 one- to two-page student-written papers. Ten of the raters display significant differences between the severity of their first and ninth days’ ratings, with nine becoming more severe and one becoming less severe.

Other studies of rater severity examine long-term judging sessions. Lumley & McNamara (1995) examine the results of three judging sessions of an ESL speaking test, administered over a 20-month time frame, with each individual judging session lasting about a week. The results indicate that significant intra-rater differences in severity exist over time, even with training. Myford (1991) describes judges with three different levels of expertise — “buffs,” experts, and novices — rating student dramatic performances over a one-month period. The results show significant differences in the judges’ severity over time. O’Neill & Lunz (1997) pool the results of 17 examinations, held at intervals over the course of 10 years, from a practical examination of scientific laboratory competence and statistically analyze the severity of the nine raters who participated in at least 10 examinations. They concluded that raters’ severity is usually consistent and predictable, but some raters are more consistent than others and even the most stable raters occasionally vary without notice.

Reliability

Variety in intra-rater stability is no surprise to an experienced first year composition instructor; educators have been aware of intra-rater reliability as a problem for many years. For example, Coffman’s 1971 article, “On the Reliability of Ratings of Essay Examinations in English,” cites five sources dated 1936 (Hartog & Rhodes), 1951 (Findlayson), 1953 (Pearson), 1954 (Vernon & Millican), and 1963 (Noyes) in a footnote specifically regarding intra-rater reliability.

Intra-rater reliability is not a problem in purely objective testing, in situations where only one answer can be correct; it is, however, of paramount importance in situations where writing samples are to be graded, which demands a degree of subjectivity. One popular measure of intra-rater reliability is to apply repeated tests — i.e., for graders to evaluate the same written materials at least twice. Eells (1930, as cited in Branthwaite, et al., 1981) describes a study in which 61 teachers rated two history and two geography essays twice, the second time after an 11-week interval. The mean correlation between the grades was only 0.37, which is surprisingly low. (A correlation of 1.0 indicates perfect matches, a correlation of 0.0 indicates no matches, and a correlation of -1.0 indicates perfect inverse matches; a correlation of 0.37 indicates a “definite but small relationship” [Guilford, 1956, p. 145]).

Blok (1985) compared 16 schoolteachers’ independent, holistic ratings of 105 elementary school student essays, rated on a 1-10 scale, repeated once after a 3-month interval. Intra-rater reliability correlations were inconsistent, ranging between a rather low 0.415 and a very respectable 0.910 (p. 51).

In contrast to this result, other researchers provide results that suggest intra-rater reliability is likely on repeated tests. Marsh & Ireland (1987) compare grades on essays composed by 139 seventh-grade students. Three experienced teachers graded each essay twice, with a 10-month break between rating sessions; in addition, three student teachers graded the essays during the second grading session. During the first grading session, the experienced teachers gave each paper a holistic score on a 1-100 scale. During the second grading session, all teachers graded the papers on six components using an nine-point scale and provided a holistic score on a 1-100 scale. The grades assigned by the experienced teachers to each paper display correlations of 0.80 between the first and second holistic evaluations and 0.82 between the first holistic evaluation and the second evaluation’s totaled component scores (p. 362).

Similarly, Shohamy, Gordon, & Kraemer (1992) use a 2x2 design with a repeated test to examine the effects of training versus no training on both professional English teachers and ordinary English speakers. Four groups of raters, each comprising five members, rated 50 essays. In a measure of intra-rater reliability, the trained professional English teacher group re-graded ten randomly selected essays three weeks after the first rating session. The raters’ correlation coefficients ranged from 0.76 to 0.97, and the researchers described the ratings as “relatively stable, although they did vary from rater to rater” (p. 30) and recommended repeated training sessions as a method of maintaining high rater reliability.

Anecdotal evidence also exists that graders tend to be internally inconsistent. Branthwaite, Trueman, & Berrisford (1981) describe the grades assigned to two cases of plagiarized academic assignments, in which the submitted papers “differ[ed] only in handwriting and minor changes of wording” (p. 42).. In the first case, one student wrote and submitted 11 science lab reports that were copied and resubmitted by another student. The same instructor graded all 22 reports, but none of their marks were identical, and the plagiarizing student received a higher grade on seven of the 11 assignments. None of the marks were more than one whole letter grade apart, but there was no statistically significant correlation between the paired grades (p. 43). In the second case, five students handed in plagiarized essays during a social science course; three pairs of these essays were graded by one instructor. Of those three pairs, none received the same grade, and the difference in grades was less than one full letter mark (p. 43).