What is it about?

Gzip is a tool for reducing large files into smaller ones in order to save space. The algorithm used by gzip is also used to create zip files. This file reduction needs to be reversed when the original data is needed. The file reduction, or compression, and the decompression are slow compared to currently available solid state drive speeds. We successfully made the decompression faster than any other existing program. Benchmarks of our program, "rapidgzip", on a processor with 128 cores show that rapidgzip is 55 times faster than GNU Gzip. It achieves a decompression bandwidth of 8.7 GB/s for this case. The compression process can easily be made faster in three steps: 1. break up the file into smaller parts 2. compress the parts independently in parallel using modern multi-core processors 3. join all compressed parts into one result file. Modern processors have dozens of cores, which make this process a dozen times faster. Decompression can be made faster similarly but there are difficulties unique to decompression. For compression, the original file can be broken up at any point without issue. For decompression, the breakpoints cannot be chosen at will or else the decompression from an unsuitable point will lead to an error. The algorithm used by the gzip decompression program "pugz" searches for valid points by assuming that the original data contains DNA sequence data. It will return an error for any other data. We show that our program "rapidgzip" can find valid breakpoints without making any assumptions about the original data. Rapidgzip works with any gzip file and it even is faster than pugz. This is achieved with a self-stabilizing architecture. The small parts are decompressed in parallel. But, if anything goes wrong, rapidgzip can fall back to normal non-parallel decompression to recover from that.

Featured Image

Why is it important?

Gzip and the deflate algorithm used for gzip are used everywhere, for HTTP compression, for zip archives, for .tar.gz archives, and many more. Often data is stored in a compressed manner and needs to be decompressed before analyzing it. This decompression step might be the bottleneck if the the analysis is fast. Our program, rapidgzip, can get rid of this bottleneck by speeding up the decompression step.

Read the Original

This page is a summary of: Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching, August 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3588195.3592992.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page