IMPACT OF RATER BIAS ON ESSAY GRADE POPULATION INVARIANCE: A SIMULATION STUDY

Andrew E. Iverson

The use of human raters is a commonly accepted aspect of any testing environment that incorporates constructed responses. This, however, ignores a significant amount of evidence that human raters tend to have individual biases and variation that can influence the resulting score. One technology that appears to have the potential to remove this source of bias is automated essay scoring. Yet this tool is developed using a sample of human rated responses, and potentially conditioned by the same human rater biases. The purpose of this simulation study was to explore through which factors rater bias can enter into a sample, and through which sampling designs this bias can be ameliorated. Three research questions were evaluated. The first focused on factors that appear in the rating process—including the number of raters available, the number of raters selected to create a single score, or the bias that each rater brings into the final score—and how they influence the recovery of ratee ability parameters after sampling. The second evaluated which of three sampling designs—two equal allocation methods along with simple random sampling — reduced the sample bias in parameter recovery. The final question investigated if the model, and population parameters, used to simulate rater bias in scores influenced the results of the previous two questions. Results evaluating the first research question indicate that while some rater centric factors influence the sample level estimates of ratee ability, the amount of bias simulated did not directly influence recovery. The results also point toward the equal allocation methods improving ratee ability recovery over the simple random sampling design. Finally, the rater bias model employed did influence the results, as the two models were drastically different in terms of the amount of error they added to the scores. These results allow for recommendations—from the number of raters to use in developing a sample, along with the number of raters used per score—to those making sample based decisions (or using samples to develop AES systems) on scores involving raters. Limitations of this study are described, and directions for additional research given.

IMPACT OF RATER BIAS ON ESSAY GRADE POPULATION INVARIANCE: A SIMULATION STUDY

(1)