What is it about?
Many experiments in psychology are excellent at showing that an effect exists at the group level, but much less reliable when researchers want to compare people. For example, two tasks may be designed to measure the same cognitive process, such as inhibition or attention, but the individual differences obtained from those tasks often show weak or inconsistent correlations. This paper introduces a new statistical framework, the Standardized Generalized Hierarchical Factor Model, to study these individual differences more accurately. The model combines two traditions that are often treated separately: experimental psychology, which works with trial-by-trial data, and psychometrics, which studies how well measures capture psychological constructs. The framework allows researchers to estimate the true associations between experimental effects while accounting for trial-level noise. It also handles the asymmetric shape of response time data, which is common in psychological experiments. Finally, it can estimate factor loadings, a key feature of psychometric models such as Confirmatory and Exploratory Factor Analysis, allowing researchers to determine which tasks best capture a shared underlying process rather than assuming that all tasks measure it equally well.
Featured Image
Photo by Carlos Muza on Unsplash
Why is it important?
This work is important because weak correlations between experimental tasks are often interpreted, in substantive terms, as evidence that different cognitive processes are independent from one another. However, these weak correlations may partly reflect measurement error, especially given that many experimental measures are already known to show low reliability at the individual level (i.e., the reliability paradox, Hedge et al., 2018). When measurement error is high, low correlations are exactly what we should expect to observe, even if the underlying psychological processes are genuinely related. For this reason, low correlations provide no basis for substantive conclusions unless reliability has been properly taken into account. Otherwise, methodological artifacts can be mistaken for theoretical findings, allowing unreliable measures to shape psychological theories. Beyond the problem of measurement error, the statistical model used to analyse the data can also matter. Our results show that standard Gaussian models can substantially underestimate true correlations when response times are skewed. This means that researchers may conclude that there is no shared cognitive process when, in fact, the statistical model itself is obscuring that underlying structure. The proposed framework offers a more rigorous way to study individual differences in experimental psychology because it explicitly accounts for measurement error and is better suited to the asymmetric distributions that are typical of response time data. Beyond improving the estimation of associations between experimental effects, it also provides a way to validate experimental tasks that is conceptually similar to item validation in classical psychometrics. This allows researchers to determine which task variants best capture the psychological process that is shared across tasks, rather than assuming that all variants measure that process equally well. In this sense, the paper provides a bridge between experimental research and construct validation in psychometrics.
Perspectives
For me, this paper is about validity. In experimental psychology, we often use tasks as if they were interchangeable measures of latent cognitive processes. We talk about "the Stroop effect" or "the Flanker effect", and these labels are informative, but they can also hide an important problem: there is not just one Stroop task or one Flanker task. There are many variants, and each variant may differ in stimuli, timing, response format, scoring procedure, and many other design decisions. Assuming that all these variants measure the same cognitive process equally well, without testing that assumption empirically, has contributed to a confusing situation in the study of individual differences. At first, this problem was often framed as a problem of reliability. Experimental effects were simply too noisy to correlate with other variables. But I think the issue has become broader than that. If different tasks show weak or inconsistent associations, we need to ask what this means. Are the cognitive processes truly distinct? Are the measures too unreliable? Are we using the wrong statistical model? Or are some task variants simply better indicators of the construct than others? These are not only technical questions. They are validity questions. What I find most valuable about this paper is that it provides a framework for asking these questions inside the model itself. It allows researchers to estimate true associations between experimental effects, but also to evaluate how strongly each task reflects a shared latent process. In other words, it helps us move from assuming that a task measures a construct to empirically evaluating whether, and how well, it does. From this perspective, the paper offers more than a methodological extension. It provides a psychometric lens for looking at experimental psychology. If experimental tasks are going to be used as measures of individual differences, then they need to be validated as measures. Everything comes back to validity.
Ricardo Rey-Sáez
Universidad Autonoma de Madrid
Read the Original
This page is a summary of: A unified framework for psychometrics in experimental psychology: The standardized generalized hierarchical factor model., Psychological Methods, June 2026, American Psychological Association (APA),
DOI: 10.1037/met0000846.
You can read the full text:
Contributors
The following have contributed to this page







