Reach for gold

Tobias Vogel; Arvid Heise; Uwe Draisbach; Dustin Lange; Felix Naumann

doi:10.1145/2629687

What is it about?

Duplicates are multiple database entries of a real-world object. Typical examples are persons who appear multiple times in a customer database—and thus, for instance, receive multiple copies of the same advertisements. Duplicate detection systems try to clean data by automatically finding and removing such duplicates. As an active research area, it is important to be able to evaluate such systems: Are they successful in finding all true duplicates? To measure this, normally you need to know the actual duplicates in advance. With our method, you can avoid this and instead make use of the wisdom of many such systems. If they all agree on some data, it is probably not necessary to manually check that data anymore.

Photo by Duy Pham on Unsplash

Why is it important?

Manually annotating duplicates is a difficult and tedious task. In principle, we must compare every record with every other record and determine if they might represent the same real-world object, e.g., customer. Performing these many comparisons costs time and money, and any savings here are welcome.

Perspectives

A particularly fun part of the project was to work with 35 student teams in 3-day workshops, all working on the same dataset, and all coming up with different duplicate detection techniques. Their work was the basis for our own evaluation.
Felix Naumann

This page is a summary of: Reach for gold, Journal of Data and Information Quality, September 2014, ACM (Association for Computing Machinery),
DOI: 10.1145/2629687.
You can read the full text:

Read

Contributors

The following have contributed to this page

Felix Naumann

How to evaluate duplicate detection systems while avoiding the tedious annotation of data

What is it about?

Why is it important?

Perspectives

Contributors

You might also like

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

How to evaluate duplicate detection systems while avoiding the tedious annotation of data

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

You might also like

Wireless Networking-Driven Healthcare Approaches in Combating COVID-19

Provision of Records Created in Networked Environments in the Curricula of Institutions of Higher Learning in Africa

Novel design of artificial ecosystem optimizer for large-scale optimal reactive power dispatch problem with application to Algerian electricity grid

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management