What is it about?

The volume of data continues to grow exponentially from all sides. There is an urgent need for more efficient utilization of storage resources in big data management. Distributed data deduplication enables users to handle data deluge in the most efficient way, since it firstly tries to assign the same or similar data from clients to the same node in distributed storage system by a data routing scheme, and then performs independent intra-node redundancy suppression in individual storage nodes. Moreover, there are some key techniques in distributed data deduplication, including data partitioning, chunk fingerprinting, data routing, index lookup, data restoring, garbage collection, the security and reliability of deduplicated data.

Featured Image

Why is it important?

Different from the throughput and capacity limitations of single-node centralized deduplication, distributed data deduplication has become a popular technology in big data management to save more storage space, enhance I/O performance, and improve system scalability. It not only identifies redundancy across different files in distributed storage system, significantly improving storage efficiency with the high data reduction ratio, but also achieves high I/O throughput for massive datasets by performing fingerprint-based index lookup in parallel at chunk level. In this paper, we first describe the background of big data deduplication. Then we summarize and classify the state-of-the-art in the key techniques of distributed data deduplication, including data partitioning, chunk fingerprinting, data routing, index lookup, data restoring, garbage collection, the security and reliability of deduplicated data. These help identify and understand the system implementation of the existing distributed data deduplication methods. Moreover, we present some representative industrial products that have successfully applied distributed data deduplication technologies. Finally, we discuss the main challenges and industry trend of distributed data deduplication, and outline the open problems and its future research directions.

Perspectives

Distributed data deduplication enables users to handle data deluge in the most efficient way, since it can satisfy scalable capacity and performance requirements in big data management, while the traditional centralized data deduplication suffers limited system scalability due to the resource intensive nature of deduplication tasks.

Yinjin Fu
Sun Yat-Sen University

Read the Original

This page is a summary of: Distributed Data Deduplication for Big Data: A Survey, ACM Computing Surveys, May 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3735508.
You can read the full text:

Read

Contributors

The following have contributed to this page