What is it about?
Many analyses have been performed on Information Retrieval (IR) evaluation benchmarks. Benchmarking also plays a central role in evaluating the capabilities of Large Language Models (LLMs). In this paper, we apply an IR approach to LLM evaluation. Adapting a method developed for TREC test collections, we analyze LLM benchmark results through the lens of network science. We construct a bipartite graph between models and benchmark questions and apply Kleinberg’s HITS algorithm to uncover latent structure in the evaluation data. In this framework, model hubness quantifies a model’s tendency to perform well on easy questions, while question hubness captures its ability to discriminate between more and less effective models. We conduct experiments on seven multiple-choice QA benchmarks with a pool of 34 LLMs. Through this IR-inspired approach, we show that the ranking of models on leaderboards is strongly influenced by subsets of easy questions.
Featured Image
Read the Original
This page is a summary of: Analyzing AI Evaluation Benchmarks Through Information Retrieval and Network Science, January 2026, Springer Science + Business Media,
DOI: 10.1007/978-3-032-21300-6_25.
You can read the full text:
Contributors
The following have contributed to this page







