What is it about?

Researchers often compute Cohen's kappa coefficient for example on two occasions using the same subjects or two overlapping groups of subjects. The first kappa may be calculated before the raters receive any training, while the second aims at quantifying the extent of agreement among raters after they have received a formal training. The fundamental research question is to determine whether the observed difference between the 2 coefficients is statistically significant. This article shows the blueprint for executing this task.

Featured Image

Why is it important?

Before the publication of this article, there was no known formal procedure with broad applicability that could be used for testing two agreement coefficients for statistical significance. Our procedure handles every structure of dependency that may exist in the design of the inter-rater reliability experiment.


I wrote this article primarily to address the multiple requests I received from researchers across the world. My solution is based on a very simple principle that Jordan Ellenberg explained in a very eloquent way "If the universe hands you a hard problem, try to solve an easier one instead, and hope the simple version is close enough to the original problem that the universe doesn't object." The difficulty of testing 2 agreement coefficients for statistical significance stems from the non-linear form of most agreement coefficients. So, I decided to use the linear approximations of these agreement coefficients, which are generally valid for a large number of subjects. This became an easier problem. My simulations proved that the solution works well even when the number of subjects is limited.


Read the Original

This page is a summary of: Testing the Difference of Correlated Agreement Coefficients for Statistical Significance, Educational and Psychological Measurement, July 2015, SAGE Publications, DOI: 10.1177/0013164415596420.
You can read the full text:



The following have contributed to this page