컨텐츠 시작

학술대회/행사

초록검색

제출번호(No.) 0236
분류(Section) Contributed Talk
분과(Session) (ME) Mathematical Education (ME)
발표시간(Time) 19th-B-13:50 -- 14:10
영문제목
(Title(Eng.))
Can we use GPT-4 as a mathematics evaluator in education?: Exploring the efficacy and limitation of LLM-based automatic assessment system for open-ended mathematics question
저자(Author(s))
Unggi Lee2, Youngin Kim1, Sangyun Lee1, Jaehyeon Park1, Jin Mun1, Eunseo Lee1, Yunjoo Yoo1
Seoul National University1, Korea University2
초록본문(Abstract) This paper presents an exploratory investigation into the potential of Large Language Models (LLMs), with a particular focus on GPT-4, to improve the precision and effectiveness of Automated Assessment Systems (AAS) in the context of open-ended mathematics problems. Despite the transformative potential of LLMs in a vast array of disciplines, their incorporation into AAS has yet to be explored, particularly in mathematical logic and open-ended problem-solving. Our research aims to close this gap by creating a GPT-4-based AAS and critically evaluating its efficacy.
The study was conducted on 4,180 responses to open-ended mathematics questions solved by 380 students. Three human experts independently assessed these responses based on a pre-established rubric, while the GPT-4 model was concurrently tasked with assessing the responses against the same rubric. In most test instances, our findings indicate high consistency between human and GPT-4 assessments, demonstrating the promising potential of integrating GPT-4 into AAS. For the GPT-4 and human raters' scoring errors that nonetheless emerged, we qualitatively categorized them according to error type. We pointed out the limitations of automated assessment errors in specific mathematical contents.
During this investigation, we also assessed two methods for enhancing GPT-4's assessment capabilities: (1) the use of elaborate prompts and (2) the implementation of advanced prompt engineering techniques: Chain-of-thought, Self-consistency, and Tree-of-thought. We found that the use of comprehensive prompts significantly improved the assessment quality. In contrast, the straightforward application of advanced engineering techniques yielded suboptimal results, indicating a need for additional refinement in their implementation.
This study is the first to evaluate GPT-4 in the context of AAS for open-ended mathematics problems, illuminating its strengths and weaknesses. Our findings lay the groundwork for future research to refine the deployment of LLMs in AAS, especially in mathematics education.
분류기호
(MSC number(s))
97U10
키워드(Keyword(s)) Automatic assessment system (AAS), open-ended mathematics question, large language model (LLM), GPT-4
강연 형태
(Language of Session (Talk))
Korean