Evaluating the psychometric properties of ChatGPT-generated questions

Paper · Source

Not much is known about how LLM-generated questions compare to gold-standard, traditional formative assessments concerning their difficulty and discrimination parameters, which are valued properties in the psychometric measurement field. We follow a rigorous measurement methodology to compare a set of ChatGPTgenerated questions, produced from one lesson summary in a textbook, to existing questions from a published Creative Commons textbook. To do this, we collected and analyzed responses from 207 test respondents who answered questions from both item pools and used a linking methodology to compare IRT properties between the two pools. We find that neither the difficulty nor discrimination parameters of the 15 items in each pool differ statistically significantly, with some evidence that the ChatGPT items were marginally better at differentiating different respondent abilities. The response time also does not differ significantly between the two sources of items. The ChatGPT-generated items showed evidence of unidimensionality and did not affect the unidimensionality of the original set of items when tested together.

Introduction. Much research has been conducted on different processes of question generation (Sayin & Gierl, 2024), question difficulty control (Lu & Wang, 2024), and psychometric validation (Colvin et al., 2014; Liu et al., 2024). However, the current body of literature lacks the use of psychometric methods to evaluate the properties of AI-generated items (Tan et al., 2024). If high-quality question generation is possible with AI, we will be able to develop intelligent tutoring systems capable of both unlimited question generation and effective control of question difficulty. The potential problems this system would solve are multifaceted. Firstly, the lack of control over question difficulty poses issues such as generating questions of inappropriate difficulty that hinder student learning and requiring tedious manual question searches from exam designers. With an intelligent tutoring system that can efficiently control difficulty, these issues would be mitigated.

Discussion / Conclusion. Item generation using LLMs is a method that can be used to scale the item development process and produce the large numbers of items needed to support Intelligent Tutoring Systems and other formative assessment systems. In this study, the aim of our research was to compare the quality of ChatGPT and human-generated items regarding the psy- chometric properties for each set of items. The results of our study showed that ChatGPT-generated questions have comparable psychometric properties when compared with gold standard, human-authored textbook questions in College Algebra. Specifically, we found no statistically reliable difference in the difficulty and discrimination parameters between the two. Based on our preliminary analysis, ChatGPT, when given the appropriate prompt, can generate items with locations equally spaced within the ability distribution of respondents and even generate items with higher discrimination power. While item parameters did not differ, we wanted to further investigate if the subject matter of the generated questions differed.

Lines of inquiry this paper opens 1

Research framings built by reading the notes related to this paper — the questions it feeds into.

Why do benchmark improvements fail to reflect actual reasoning quality?

Could AI assessment quality differ across subjects or question formats?

Evaluating the psychometric properties of ChatGPT-generated questions

Synthesis notes that discuss concepts related to this paper 1

Lines of inquiry this paper opens 1