What is it about?

Modern HPC systems today are designed for a few important and demanding applications, but the majority of the time the applications that do execute do not use all or most of available resources. In addition, resources are allocated in units of statically configured nodes. This results in vast underutilization of expensive resources. In this paper, we study and quantify this effect. We then use different approaches to quantify how much fewer resources we can deploy in an HPC system similar to Cori if resources within a rack could be allocated in a fine-grain manner.

Featured Image

Why is it important?

In the future, HPC systems will become more heterogeneous and larger-scale. This will intensify the resource underutilization problem and unless we take action this may result in overly expensive systems that deliver a fraction of this capacity. Allocating resources in a fine-grain manner (resource disaggregation) is a potential solution, but until now there was no concrete and data-driven study to show what range of disaggregation is appropriate.

Perspectives

This study does not rely on models and simulation, and instead measures the actual usage of a top production and open-science HPC system over a period of three weeks. This allows to look into reality and not predictions. The conclusions that this study makes are powerful in terms of quantifying the expected benefit of intra-rack resource disaggregation and therefore we think it makes a convincing case for an open-science system. Finally, this study also shows some new or uncommon metrics we used to extract insight from real system data.

George Michelogiannakis
Lawrence Berkeley National Laboratory

Read the Original

This page is a summary of: A Case For Intra-rack Resource Disaggregation in HPC, ACM Transactions on Architecture and Code Optimization, June 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3514245.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page