Sampling Issues and Error Control in Calibrating Automated Scoring Models for Essays

Mo Zhang

Several testing programs incorporate an essay component. For efficiency as well for measurement reasons, many programs use automated scoring systems. The use and interpretation of automated scores merit examination. From a validity perspective, the impact of sample selection on automated scoring has been under-investigated. The primary purpose of this study was to explore sampling approaches that facilitate valid use and interpretation of automated scores. Because the fairness of automated scoring partly depends on scores having the same meaning across population groups, the impact of sampling on population invariance also was examined. Two research questions were addressed. One related to the impact of sampling on model calibration and resultant automated scores. The other dealt with the impact of sampling on the population invariance of scores. This study analyzed the e-rater® automated essay scoring system using data collected in the GRE® program. In Research Question 1, three general classes of sampling approach with four stratification schemes in five sample sizes were compared. Results suggested that proportional stratification by "country/territory" and by "country/territory×language" were the most effective sampling approaches. In Research Question 2, several sampling approaches were selected based on their performance in Research Question 1. These approaches were compared for their impact on population invariance. Results showed that equal-proportional allocation by "language" performed optimally. Across both research questions, proportional stratification by "country/territory" was preferable. Eight guidelines were given to assist practitioners in sampling design for model calibration. Those guidelines included stratification strategies for dealing with heterogeneous populations and minimum sample size requirements. Following these guidelines may lead to better automated scoring models and scores that have similar meaning across population groups. Three other factors were discussed which could interact with sampling and potentially cause a lack of population invariance for automated scores: (a) lack of invariance in human ratings across population groups, (b) large differences in writing proficiency across population groups, and (c) linear modeling effects. Limitations of this research were described, including the lack of second randomly-assigned human raters and the choice of external criteria for invariance analysis. Finally, derived from this study's findings, additional research was suggested.

Sampling Issues and Error Control in Calibrating Automated Scoring Models for Essays

Files and links (1)

Abstract

Metrics

Details