What is it about?

Duplicates are multiple database entries of a real-world object. Typical examples are persons who appear multiple times in a customer database—and thus, for instance, receive multiple copies of the same advertisements. Duplicate detection systems try to clean data by automatically finding and removing such duplicates. As an active research area, it is important to be able to evaluate such systems: Are they successful in finding all true duplicates? To measure this, normally you need to know the actual duplicates in advance. With our method, you can avoid this and instead make use of the wisdom of many such systems. If they all agree on some data, it is probably not necessary to manually check that data anymore.

Featured Image

Why is it important?

Manually annotating duplicates is a difficult and tedious task. In principle, we must compare every record with every other record and determine if they might represent the same real-world object, e.g., customer. Performing these many comparisons costs time and money, and any savings here are welcome.

Perspectives

A particularly fun part of the project was to work with 35 student teams in 3-day workshops, all working on the same dataset, and all coming up with different duplicate detection techniques. Their work was the basis for our own evaluation.

Felix Naumann

Read the Original

This page is a summary of: Reach for gold, Journal of Data and Information Quality, September 2014, ACM (Association for Computing Machinery),
DOI: 10.1145/2629687.
You can read the full text:

Read

Contributors

The following have contributed to this page