Essay-Grading Software Seen As Time-Saving Tool

Essay-Grading Software Seen as Time-Saving Tool

Teachers are turning to essay-grading software to critique student writing, but critics point to serious flaws in the technology

By Caralee J. Adams

Jeff Pence knows the best way for his 7th grade English students to improve their writing is to do more of it. But with 140 students, it would take him at least two weeks to grade a batch of their essays.

So the Canton, Ga., middle school teacher uses an online, automated essay-scoring program that allows students to get feedback on their writing before handing in their work.

"It doesn't tell them what to do, but it points out where issues may exist," said Mr. Pence, who says the a Pearson WriteToLearn program engages the students almost like a game.

With the technology, he has been able to assign an essay a week and individualize instruction efficiently. "I feel it's pretty accurate," Mr. Pence said. "Is it perfect? No. But when I reach that 67th essay, I'm not real accurate, either. As a team, we are pretty good."

With the push for students to become better writers and meet the new Common Core State Standards, teachers are eager for new tools to help out. Pearson, which is based in London and New York City, is one of several companies upgrading its technology in this space, also known as artificial intelligence, AI, or machine-reading. New assessments to test deeper learning and move beyond multiple-choice answers are also fueling the demand for software to help automate the scoring of open-ended questions.

Critics contend the software doesn't do much more than count words and therefore can't replace human readers, so researchers are working hard to improve the software algorithms and counter the naysayers.

While the technology has been developed primarily by companies in proprietary settings, there has been a new focus on improving it through open-source platforms. New players in the market, such as the startup venture LightSide and edX, the nonprofit enterprise started by Harvard University and the Massachusetts Institute of Technology, are openly sharing their research. Last year, the William and Flora Hewlett Foundation sponsored an open-source competition to spur innovation in automated writing assessments that attracted commercial vendors and teams of scientists from around the world. (The Hewlett Foundation supports coverage of "deeper learning" issues in Education Week.)

"We are seeing a lot of collaboration among competitors and individuals," said Michelle Barrett, the director of research systems and analysis for CTB/McGraw-Hill, which produces the Writing Roadmap for use in grades 3-12. "This unprecedented collaboration is encouraging a lot of discussion and transparency."

Mark D. Shermis, an education professor at the University of Akron, in Ohio, who supervised the Hewlett contest, said the meeting of top public and commercial researchers, along with input from a variety of fields, could help boost performance of the technology. The recommendation from the Hewlett trials is that the automated software be used as a "second reader" to monitor the human readers' performance or provide additional information about writing, Mr. Shermis said.

"The technology can't do everything, and nobody is claiming it can," he said. "But it is a technology that has a promising future."

'Hot Topic'

The first automated essay-scoring systems go back to the early 1970s, but there wasn't much progress made until the 1990s with the advent of the Internet and the ability to store data on hard-disk drives, Mr. Shermis said. More recently, improvements have been made in the technology's ability to evaluate language, grammar, mechanics, and style; detect plagiarism; and provide quantitative and qualitative feedback.

The computer programs assign grades to writing samples, sometimes on a scale of 1 to 6, in a variety of areas, from word choice to organization. The products give feedback to help students improve their writing. Others can grade short answers for content. To save time and money, the technology can be used in various ways on formative exercises or summative tests.

The Educational Testing Service first used its e-rater automated-scoring engine for a high-stakes exam in 1999 for the Graduate Management Admission Test, or GMAT, according to David Williamson, a senior research director for assessment innovation for the Princeton, N.J.-based company. It also uses the technology in its Criterion Online Writing Evaluation Service for grades 4-12.

Over the years, the capabilities changed substantially, evolving from simple rule-based coding to more sophisticated software systems. And statistical techniques from computational linguists, natural language processing, and machine learning have helped develop better ways of identifying certain patterns in writing.

But challenges remain in coming up with a universal definition of good writing, and in training a computer to understand nuances such as "voice."

In time, with larger sets of data, more experts can identify nuanced aspects of writing and improve the technology, said Mr. Williamson, who is encouraged by the new era of openness about the research.

"It's a hot topic," he said. "There are a lot of researchers and academia and industry looking into this, and that's a good thing."

High-Stakes Testing

In addition to using the technology to improve writing in the classroom, West Virginia employs automated software for its statewide annual reading language arts assessments for grades 3-11. The state has worked with CTB/McGraw-Hill to customize its product and train the engine, using thousands of papers it has collected, to score the students' writing based on a specific prompt.

"We are confident the scoring is very accurate," said Sandra Foster, the lead coordinator of assessment and accountability in the West Virginia education office, who acknowledged facing skepticism initially from teachers. But many were won over, she said, after a comparability study showed that the accuracy of a trained teacher and the scoring engine performed better than two trained teachers. Training involved a few hours in how to assess the writing rubric. Plus, writing scores have gone up since implementing the technology.

Automated essay scoring is also used on the ACT Compass exams for community college placement, the new Pearson General Educational Development tests for a high school equivalency diploma, and other summative tests. But it has not yet been embraced by the College Board for the SAT or the rival ACT college-entrance exams.

The two consortia delivering the new assessments under the Common Core State Standards are reviewing machine-grading but have not committed to it.

Jeffrey Nellhaus, the director of policy, research, and design for the Partnership for Assessment of Readiness for College and Careers, or PARCC, wants to know if the technology will be a good fit with its assessment, and the consortium will be conducting a study based on writing from its first field test to see how the scoring engine performs.

Likewise, Tony Alpert, the chief operating officer for the Smarter Balanced Assessment Consortium, said his consortium will evaluate the technology carefully.

Open-Source Options

With his new company LightSide, in Pittsburgh, owner Elijah Mayfield said his data-driven approach to automated writing assessment sets itself apart from other products on the market.

"What we are trying to do is build a system that instead of correcting errors, finds the strongest and weakest sections of the writing and where to improve," he said. "It is acting more as a revisionist than a textbook."

The new software, which is available on an open-source platform, is being piloted this spring in districts in Pennsylvania and New York.

In higher education, edX has just introduced automated software to grade open-response questions for use by teachers and professors through its free online courses. "One of the challenges in the past was that the code and algorithms were not public. They were seen as black magic," said company President Anant Argawal, noting the technology is in an experimental stage. "With edX, we put the code into open source where you can see how it is done to help us improve it."

Still, critics of essay-grading software, such as Les Perelman, want academic researchers to have broader access to vendors' products to evaluate their merit. Now retired, the former director of the MIT Writing Across the Curriculum program has studied some of the devices and was able to get a high score from one with an essay of gibberish.

"My main concern is that it doesn't work," he said. While the technology has some limited use with grading short answers for content, it relies too much on counting words and reading an essay requires a deeper level of analysis best done by a human, contended Mr. Perelman.

"The real danger of this is that it can really dumb down education," he said. "It will make teachers teach students to write long, meaningless sentences and not care that much about actual content."