A nonstandard evaluation of the Waterloo spam rankings for the ClueWeb datasets. A standard evaluation measures the effectiveness improvement gained by the spam filtering. This one uses qrels (manually judged document-query pairs) as the ground truth. Binary classification (spam vs. non-spam) accuracy is measured/reported and the results are discussed from a different perspective.

It is argued that spam scores (ClueWeb09) are actually document-quality metrics (e.g. PageRank). Eliminating 70% of the corpus means working with the highest quality (30%) subset of the full dataset.


By contract to the ClueWeb09 dataset, the spam scores of the ClueWeb12 are not useful: spam filtering actually degrades the IR effectiveness. This finding calls attention to the need of a new spam classifier for ClueWeb12.

Assoc. Prof. Dr. Ahmet Arslan
Eskisehir Technical University

This page is a summary of: An intrinsic evaluation of the Waterloo spam rankings of the ClueWeb09 and ClueWeb12 datasets, Journal of Information Science, August 2019, SAGE Publications, DOI: 10.1177/0165551519866551.
