What is it about?
Duplicates are multiple database entries of a real-world object. Typical examples are persons who appear multiple times in a customer database—and thus, for instance, receive multiple copies of the same advertisements. Duplicate detection systems try to clean data by automatically finding and removing such duplicates. As an active research area, it is important to be able to evaluate such systems: Are they successful in finding all true duplicates? To measure this, normally you need to know the actual duplicates in advance. With our method, you can avoid this and instead make use of the wisdom of many such systems. If they all agree on some data, it is probably not necessary to manually check that data anymore.
Featured Image
Photo by Duy Pham on Unsplash
Why is it important?
Manually annotating duplicates is a difficult and tedious task. In principle, we must compare every record with every other record and determine if they might represent the same real-world object, e.g., customer. Performing these many comparisons costs time and money, and any savings here are welcome.
Perspectives
Read the Original
This page is a summary of: Reach for gold, Journal of Data and Information Quality, September 2014, ACM (Association for Computing Machinery),
DOI: 10.1145/2629687.
You can read the full text:
Contributors
The following have contributed to this page