XI Conference – European Congress of Methodology

Name: XI Conference – European Congress of Methodology
Start: 2025-07-22T09:00:00+01:00
End: 2025-07-25T15:00:00+01:00
Location: EAM2025

22–25 Jul 2025

EAM2025

Atlantic/Canary timezone

Contact

info@eam2025.eu

Automated Scoring of Open-Ended Responses: Evaluating LLMs and Prompting Strategies

23 Jul 2025, 18:00

15m

Faculty of Social Sciences and Communication. (The Pyramid)/13 - Room (Faculty of Social Sciences and Communication. (The Pyramid))

Faculty of Social Sciences and Communication. (The Pyramid)/13 - Room

Faculty of Social Sciences and Communication. (The Pyramid)

Show room on map

Oral Presentation Measurement Session 21 : "Psychometric Innovations and Diagnostic Methodologies"

Daniil Talov (HSE University)

Before the rapid development of artificial intelligence, standardized tests mainly relied on multiple-choice questions because evaluating open-ended tasks required significant resources. Modern large language models (LLMs), such as ChatGPT, Gemini, and Llama, now enable automated assessment of open-ended tasks. Unlike traditional machine learning or deep learning methods, foundational LLMs do not require labeled datasets or extensive expertise in data science and programming. Users only need to create a well-structured prompt and verify alignment with human raters on a small sample.
This study aims to assess the feasibility of using LLMs to score open-ended tasks through prompting. Additionally, it explores effective prompting strategies. Several LLMs were compared, with ChatGPT-4o serving as the baseline model. It was evaluated against ChatGPT-o3-mini, which features advanced latent reasoning. YandexGPT4 was also included, as it has been specifically trained in Russian, the language used by respondents in this study. To explore alternatives to cloud-based solutions, the DeepSeek r1 8B and Llama 3.1 8B models were tested, as they can run locally on a computer, reducing evaluation costs and ensuring confidentiality.
To evaluate the effectiveness of different prompting strategies, both analytical and holistic rubrics were used. Techniques included zero-shot (no examples), one-shot (one example), and few-shot (multiple examples). Additional strategies included chain-of-thought prompting (reasoning before making a decision), tree-of-thought reasoning (deliberation among "multiple evaluators" before deciding), and enhancing LLM "motivation" using specific phrases.
The models were tested on three types of tasks: (1) short-answer reading comprehension tasks for children aged 10-11 (1-2 sentences); (2) a prompt engineering task for undergraduate students (up to 150 words); and (3) an economics essay task for economics students (up to 144 words). LLM performance was measured using Weighted Quadratic Cohen’s Kappa (WQK) and Mean Absolute Error (MAE). Scores were compared with human raters, who evaluated 50 randomly selected responses for each task type.
The presentation will compare five LLMs across three types of open-ended tasks. Results show that LLMs can achieve sufficient accuracy for use in low-stakes assessment scenarios. Additionally, findings on effective prompting strategies will be shared, identifying best practices. Specifically, analytical rubrics combined with multiple examples provided the most accurate assessments. Interestingly, phrases like "focus and take a deep breath before answering" or "I will pay you for good work" resulted in accuracy comparable to example-based prompting. These results highlight the potential of LLMs for scoring open-ended responses. At this stage, they can be used in low-stakes assessments or as an assistant to human raters. Furthermore, LLMs can support data annotation and the creation of training datasets for other machine learning models.

Abstract

Oral presentation	Automated Scoring of Open-Ended Responses: Evaluating LLMs and Prompting Strategies
Author	Daniil Talov
Affiliation	HSE University
Keywords	LLM, Automated Scoring,Open-Ended Tasks

Daniil Talov (HSE University)

There are no materials yet.

XI Conference – European Congress of Methodology

Contact

Automated Scoring of Open-Ended Responses: Evaluating LLMs and Prompting Strategies

Faculty of Social Sciences and Communication. (The Pyramid)/13 - Room

Faculty of Social Sciences and Communication. (The Pyramid)

Speaker

Description

Abstract

Primary author

Presentation materials

Choose timezone

XI Conference – European Congress of Methodology

Contact

Speaker

Description

Abstract

Primary author

Presentation materials