Smart-DNN+: a Memory-Efficient Neural Networks Compression Framework for the Model Inference

Donglei Wu; Weihao Yang; Xiangyu Zou; Wen Xia; Shiyi Li; Zhenbo Hu; Weizhe Zhang; Binxing Fang

doi:10.1145/3617688

What is it about?

Deep Neural Networks (DNNs) have achieved remarkable success in various real-world applications. However, running a DNN typically requires hundreds of megabytes of memory footprints, making it challenging to deploy on resource-constrained platforms such as mobile devices and IoT. Although mainstream DNNs compression techniques such as pruning, distillation, and quantization can reduce the memory overhead of model parameters during DNN inference, they suffer from three limitations: (i) low model compression ratio for the lightweight DNN structures with little redundancy; (ii) potential degradation in model inference accuracy; (iii) inadequate memory compression ratio is attributable to ignoring the layering property of DNN inference. To address these issues, we propose a lightweight memory-efficient DNN inference framework called Smart-DNN+, which significantly reduces the memory costs of DNN inference without degrading the model quality. Specifically, (1) Smart-DNN+ applies a layer-wise binary-quantizer with a remapping mechanism to greatly reduce the model size by quantizing the typical floating-point DNN weights of 32-bit to the 1-bit signs layer by layer. To maintain model quality, (2) Smart-DNN+ employs a bucket-encoder to keep the compressed quantization error by encoding the multiple similar floating-point residuals into the same integer bucket IDs. When running the compressed DNN in the user’s device, (3) Smart-DNN+ utilizes a partially decompressing strategy to greatly reduce the required memory overhead by first loading the compressed DNNs in memory and then dynamically decompressing the required materials for model inference layer by layer.

Why is it important?

Experimental results on popular DNNs and datasets demonstrate that Smart-DNN+ achieves lower 0.17%-0.92% memory costs at lower runtime overheads compared with the state of the arts without degrading the inference accuracy. Moreover, Smart-DNN+ potentially reduces the inference runtime up to 2.04x that of conventional DNN inference workflow.

This page is a summary of: Smart-DNN+: a Memory-Efficient Neural Networks Compression Framework for the Model Inference, ACM Transactions on Architecture and Code Optimization, August 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3617688.
You can read the full text:

Read

Contributors

The following have contributed to this page

A memory-efficient DNN compression and inference workflow.

What is it about?

Why is it important?

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

A memory-efficient DNN compression and inference workflow.

What is it about?

Featured Image

Why is it important?

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management