An intrinsic evaluation of the Waterloo spam rankings of the ClueWeb09 and ClueWeb12 datasets

İbrahim Barış Yılmazel; Ahmet Arslan

doi:10.1177/0165551519866551

What is it about?

A nonstandard evaluation of the Waterloo spam rankings for the ClueWeb datasets. A standard evaluation measures the effectiveness improvement gained by the spam filtering. This one uses qrels (manually judged document-query pairs) as the ground truth. Binary classification (spam vs. non-spam) accuracy is measured/reported and the results are discussed from a different perspective.

Photo by Hannes Johnson on Unsplash

Why is it important?

It is argued that spam scores (ClueWeb09) are actually document-quality metrics (e.g. PageRank). Eliminating 70% of the corpus means working with the highest quality (30%) subset of the full dataset.

Perspectives

By contract to the ClueWeb09 dataset, the spam scores of the ClueWeb12 are not useful: spam filtering actually degrades the IR effectiveness. This finding calls attention to the need of a new spam classifier for ClueWeb12.
Assoc. Prof. Dr. Ahmet Arslan
Eskisehir Technical University

This page is a summary of: An intrinsic evaluation of the Waterloo spam rankings of the ClueWeb09 and ClueWeb12 datasets, Journal of Information Science, August 2019, SAGE Publications,
DOI: 10.1177/0165551519866551.
You can read the full text:

Read

Contributors

The following have contributed to this page

Assoc. Prof. Dr. Ahmet Arslan
Eskisehir Technical University

Evaluating the spam-classifier power of the Waterloo spam rankings for the ClueWeb datasets.

What is it about?

Why is it important?

Perspectives

Contributors

You might also like

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Evaluating the spam-classifier power of the Waterloo spam rankings for the ClueWeb datasets.

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

You might also like

Reach for gold

Post-records survey inspections in Zimbabwe

Malware Detection with Artificial Intelligence: A Systematic Literature Review

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management