Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching

Maximilian Knespel; Holger Brunst

doi:10.1145/3588195.3592992

What is it about?

Gzip is a tool for reducing large files into smaller ones in order to save space. The algorithm used by gzip is also used to create zip files. This file reduction needs to be reversed when the original data is needed. The file reduction, or compression, and the decompression are slow compared to currently available solid state drive speeds. We successfully made the decompression faster than any other existing program. Benchmarks of our program, "rapidgzip", on a processor with 128 cores show that rapidgzip is 55 times faster than GNU Gzip. It achieves a decompression bandwidth of 8.7 GB/s for this case. The compression process can easily be made faster in three steps: 1. break up the file into smaller parts 2. compress the parts independently in parallel using modern multi-core processors 3. join all compressed parts into one result file. Modern processors have dozens of cores, which make this process a dozen times faster. Decompression can be made faster similarly but there are difficulties unique to decompression. For compression, the original file can be broken up at any point without issue. For decompression, the breakpoints cannot be chosen at will or else the decompression from an unsuitable point will lead to an error. The algorithm used by the gzip decompression program "pugz" searches for valid points by assuming that the original data contains DNA sequence data. It will return an error for any other data. We show that our program "rapidgzip" can find valid breakpoints without making any assumptions about the original data. Rapidgzip works with any gzip file and it even is faster than pugz. This is achieved with a self-stabilizing architecture. The small parts are decompressed in parallel. But, if anything goes wrong, rapidgzip can fall back to normal non-parallel decompression to recover from that.

Photo by Slejven Djurakovic on Unsplash

Why is it important?

Gzip and the deflate algorithm used for gzip are used everywhere, for HTTP compression, for zip archives, for .tar.gz archives, and many more. Often data is stored in a compressed manner and needs to be decompressed before analyzing it. This decompression step might be the bottleneck if the the analysis is fast. Our program, rapidgzip, can get rid of this bottleneck by speeding up the decompression step.

This page is a summary of: Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching, August 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3588195.3592992.
You can read the full text:

Read

Resources

Project
rapidgzip Github
The source code for rapidgzip. It can also be installed with "pip install rapidgzip".

Contributors

The following have contributed to this page

Maximilian Knespel
Technische Universitat Dresden

Speeding up gzip decompression by using all processor cores

What is it about?

Why is it important?

Resources

rapidgzip Github

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Speeding up gzip decompression by using all processor cores

What is it about?

Featured Image

Why is it important?

Read the Original

Resources

rapidgzip Github

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management