What is it about?
We introduce DataVinci, a fully unsupervised string data error detection and repair system. DataVinci learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such majority patterns as data errors. DataVinci can automatically derive edits to the data error based on the majority patterns and using row tuples associated with majority values as examples.
Featured Image
Photo by Claudio Schwarz on Unsplash
Why is it important?
String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Automatically cleaning such string data can have a significant impact on users. Previous approaches are limited to error detection, require that the user provides annotations, examples, or constraints to fix the errors, and focus independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings.
Perspectives
Writing this article was very fulfilling as the problem of cleaning data in spreadsheets and analysis engines is one I personally resonate with. I hope this article makes automating data cleaning easier for people and motivates new architectures based on AI models for automated data tasks.
Mukul Singh
Microsoft Corp
Read the Original
This page is a summary of: DataVinci:
Learning Syntactic and Semantic String Repairs, Proceedings of the ACM on Management of Data, February 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3709677.
You can read the full text:
Contributors
The following have contributed to this page