Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
Recent research published on arXiv demonstrates notable progress in the automated scoring of long essays through the use of generative language models. Traditional models such as BERT have shown limitations in scoring accuracy, with baseline Quadratic Weighted Kappa (QWK) scores around 0.822. The study highlights that employing generative language models leads to a significant improvement, increasing QWK scores to approximately 0.8878. This enhancement in scoring precision suggests that generative approaches can better capture the nuances of lengthy written responses compared to earlier methods. The findings are supported by multiple pieces of evidence confirming the positive impact of these models on automated essay evaluation. Such advancements hold promise for more reliable and efficient educational assessments in the future. Overall, the research establishes that generative language models represent a meaningful step forward in automated scoring technology.

