22–25 Jul 2025
EAM2025
Atlantic/Canary timezone

Measurement and Machine Learning

Not scheduled
1h 30m
EAM2025

EAM2025

Av. César Manrique, 38320 La Laguna, Santa Cruz de Tenerife

Speakers

David Goretzko Melanie Viola Partsch (Utrecht University) Philipp Sterner (Ruhr University Bochum)Dr Qixiang Fang (Utrecht University)

Abstract

Despite the popularity of structural equation modeling (SEM), investigating the fit of SEM models is still challenging—especially, if the global model fit evaluation implies non-negligible misfit, and researchers need to further investigate the type and severity of the misspecification in their model. Being overwhelmed by poorly fitting models, researchers sometimes strain the interpretation of their global model test (e.g., the χ2-test or model fit indices, such as the CFI and the RMSEA, in combination with cutoff values) and attest acceptable model fit, even though they would be well advised to reject or revise their model. To counteract this questionable research practice, we developed a method that guides researchers through a more thorough process of model fit evaluation and, if necessary, revision.

Based on a proof-of-concept study, in which we have previously shown that a pre-trained machine learning (ML) model can detect misfit in multifactorial measurement models with a high accuracy, we developed an automated ML-based workflow for SEM evaluation and revision. This workflow involves several ML models that we trained based on a maximum of 173 model and data features extracted from more than 1 million simulated data sets and multifactorial models fitted by means of confirmatory factor analysis. In the first step of the workflow, the researcher’s model is classified as either (a) correctly specified or misspecified by neglecting (b) a factor, (c) factor correlations, (d) cross-loadings, or (e) residual correlations. For classes a–c, we, in summary, give the following recommendations: (a) accept the model, (b) reject the model and revise the underlying theory or operationalization, (c) free the factor correlations, if willing to lift orthogonality constraints, or revise model by including method factor(s). For classes d–e, the second step of the workflow is initiated that determines the number of cross-loadings or residual correlations. Based on the severity of the misspecification, we, in summary, recommend the following: In case of a mild misspecification, researchers might freely estimate the concerned parameter(s), scrutinize their operationalization to understand the misspecification, and cross-validate it based on new data. In case of a moderate misspecification, researchers might revise their operationalization. In case of severe misspecification, researchers might reject the model and revise the underlying theory.

While this ML-based workflow for SEM evaluation and revision is not without limitations (e.g., it cannot identify a mix of misspecifications, it is only applicable for multifactorial measurement models so far), it provides applied researchers with unprecedented guidance in the complex, often iterative process of measurement and theory development, thereby hopefully encouraging them to face up to model misfit instead of neglecting it.

Keywords: Structural Equation Modeling (SEM), Latent Measurement Models, Model Misspecifications, Model Fit Evaluation, Model Revision, Machine Learning

Abstract

In recent years, psychological research has increasingly utilized novel (often digital) data sources. Sensing data, such as those collected from smartphones, enable researchers to monitor human behavior across diverse, ecologically valid contexts and extended periods with relative ease. These rich datasets offer great potential for predicting psychological traits, such as personality facets, through approaches like personality computing and machine learning. While previous research shows promising results, the quality and comparability of sensing data — both within and across studies — remain challenges. Smartphone sensing data, for example, are not only influenced by different preprocessing steps, but can also depend on the used hardware, the respective operating system and to some degree even on the version of a specific app. Consequently, measurements derived from sensing data may contain systematic biases unrelated to the intended behavioral constructs. To address these issues, this project adopts a measurement invariance perspective for analyzing sensing data. We adapt and apply methods from latent variable modeling to ensure comparability between data from different devices. Additionally, we explore potential biases introduced by “non-invariant” sensing variables and discuss their implications for subsequent statistical modeling.

Keywords: Sensing Data, Personality Computing, Machine Learning, Measurement Invariance, Device-Induced Non-invariance

Abstract

Psychology is increasingly interested in the prediction of psychological constructs via machine learning (ML) models, for example, predicting a person’s personality or intelligence. To measure these psychological constructs, psychologists often draw on questionnaire data. In supervised ML, these measurements are then used as target variables (i.e., the “ground truth”) for model training. Recently, Tay et al. (2022) introduced a conceptual framework that outlines various sources of bias throughout the ML modeling process. One potential bias is non-invariance of the questionnaire data across groups that is used as target values for supervised learning. As Tay and colleagues state, if the questionnaire used to collect the target data produces different expected scores between two groups with the same true score, this might bias the predictions of the final ML model. Specifically, two groups with the same underlying true score on the construct of interest might receive different predicted scores by the ML model. The goal of this work is to assess the actual impact of a lack of measurement invariance in target variables on the predictive performance of ML models. We address and investigate the impact of non-invariance in three different ways: empirically, semi-empirically, and simulation-based. We also discuss possible solutions to counter the impact of non-invariance in target variables.

Keywords: Machine Learning, Predictive Modeling, Measurement Invariance, Bias in Machine Learning

Abstract

Machine learning (ML) has become very popular in the social sciences, for example, to predict psychological constructs from digital behavioral data and sensing data or to develop advanced psychometric methods. Thereby it introduced both new chances and challenges to measurement and modeling in the social sciences. Our symposium covers a broad variety of them by discussing issues of and solutions to measurement non-invariance in both features (i.e., the predictors) and target variables (i.e., the respective outcomes) of predictive ML models, addressing reliability and validity issues in measurements based on large language models (LLMs), and introducing an ML-based psychometric method to detect, classify, and resolve model misspecifications in structural equation modeling (SEM).

In the first talk, David Goretzko (UU) focuses on device-induced non-invariance of measurements derived from sensing data, that are often used as features or input for ML models to predict psychological traits. He discusses methods ensuring the comparability between data of different devices that are inspired by approaches to measurement invariance in latent variable modeling.

The second talk by Philipp Sterner (RUB) centers on the non-invariance of target variables in ML models, for example, non-invariant survey measurements of psychological constructs. Thereby, he examines how measurement non-invariance of target variables impacts the predictive performance of ML models and discusses possible solutions to this problem.

In the third talk, Qixiang Fang (UU) presents a study that synthesised findings on approaches to questionnaire-based measurement error, on the one hand, and ML-/LLM-based measurement error, on the other hand. He then introduces a framework aiming to enhance the reliability and validity of LLM-based social science outcomes.

The last talk by Melanie Partsch (UU) focuses on model fit evaluation and model revision in SEM. She introduces an ML-based workflow that detects if a researcher’s measurement model is correctly specified or misspecified, classifies the type and severity of the misspecification, and makes recommendations on how to revise the model, if applicable.

Keywords: Machine Learning, Large Language Models, Personality Computing, Measurement (Non-)Invariance, Measurement Error, Latent Variable Modeling, Model Fit Evaluation

Abstract

With the advent of machine learning tools and large language models (LLMs), the collection of measurements related to social science constructs (e.g., personality traits, political attitudes, human values) has become easier, faster and more affordable. These measurements are subsequently used for modelling of societal and group processes that social scientists typically engage in, where inferences from samples to populations are also made. Valid modelling and inferences, however, requires high-quality measurements or at the very least, methods to deal with the presence of measurement error. Just like traditional questionnaire-based measurements, machine learning- and LLM-based measurements have been shown to suffer from validity and reliability issues.

While there is an abundance of research literature in dealing with measurement error, they focus on questionnaire-based measurement error. It is unclear yet how measurement issues arising from machine learning tools and LLMs should be handled in social science modelling research.

This study has two primary objectives. First, we review existing literature to identify practices for addressing machine learning- and LLM-related measurement error, both in computer science and in social sciences. Second, we synthesise these findings with existing measurement modelling literature to propose a practical framework for making valid inferences using machine learning- and LLM-based measurements in social sciences. By bridging the gap between modern machine prediction capabilities and social science inference requirements, our framework aims to enhance the reliability and validity of social science research outcomes in the era of machine learning and LLMs.

Keywords: Machine Learning, Prediction-Based Inference, Large Language Models, Measurement Error, Validity, Reliability, Computational Social Sciences

Symposium title Measurement and Machine Learning
Coordinator Melanie Viola Partsch
Affiliation Utrecht University
Keywords Machine Learning, Personality Computing, SEM
Number of communicatios 4
Communication 1 About the (Non-) Invariance of Sensing Data
Authors David Goretzko & Clemens Stachl
Affiliation Utrecht University
Keywords Sensing Data, Measurement Invariance
Communication 2 The Impact of Measurement Non-invariance in Target Variables on Machine Learning Predictions
Authors Philipp Sterner, Eunsook Kim, & David Goretzko
Affiliation Ruhr University Bochum
Keywords Machine Learning, Non-invariant Targets
Communication 3 Addressing Measurement Error in Machine Learning-Assisted Social Science Modeling
Authors Qixiang Fang , Javier Garcia Bernardo, & Erik-Jan van Kesteren
Affiliation Utrecht University
Keywords Large Language Models, Measurement Error
Communication 4 A Machine Learning-Based Workflow for Model Evaluation and Revision in SEM
Authors Melanie Viola Partsch & David Goretzko
Affiliation Utrecht University
Keywords SEM, Model Fit, Machine Learning

Primary authors

David Goretzko Melanie Viola Partsch (Utrecht University) Philipp Sterner (Ruhr University Bochum) Dr Qixiang Fang (Utrecht University)

Presentation materials

There are no materials yet.