What is it about?

Modern data centers use tiered memory (combining fast, expensive memory with slower, cheaper memory) to manage escalating costs and handle memory-intensive workloads. The system attempts to free up the expensive primary memory by finding infrequently used ("cold") data pages and moving them to the secondary, slower tier. Underlying all these efforts are benchmarks, which are essential for evaluating and reliably comparing tiered memory systems. However, our experience deploying tiered memory systems at Google, revealed a critical problem: existing benchmarks often fail to capture the complexity, variability, and long-term memory access patterns observed in real production workloads, leading to design gaps. In this article, we systematically explore this divergence and propose a data-driven framework capable of generating representative benchmarks with high fidelity (accurate reproduction of tiering behavior) and wide coverage (representing the full range of production behaviors). Our end-to-end pipeline for synthesizing benchmarks relies on continuous, lightweight, workload-intrinsic instrumentation collected fleet-wide. We use statistical modeling to condense this telemetry into compact "embeddings" that serve as memory behavior fingerprints for each workload. Leveraging these embeddings, we automatically developed an application-agnostic synthesis algorithm to replicate the complex memory access patterns observed in production. Preliminary results demonstrated the success of this approach: our synthesized benchmarks achieved up to 5x higher fidelity compared to traditional internal and open-source benchmarks.

Featured Image

Why is it important?

This work is crucial and timely because the increasing adoption of tiered memory systems in modern data centers requires reliable benchmarks for evaluation. However, existing benchmarks fail to reflect the complexity and variability of real-world production workloads, leading to design gaps. The uniqueness of this paper lies in its novel data-driven framework that systematically addresses this structural issue using statistical models and fleet-wide telemetry. It is the first work to formalize memory behavior representativeness for warehouse-scale workloads, defining quantifiable metrics like Fidelity (reproducing behavior that triggers tier transitions) and Coverage (representing the range of production behaviors). A key innovation is the application-agnostic synthesis algorithm (DB-synth) which constructs benchmarks that precisely mimic real-world memory access patterns that are simple to run and easy to configure, enabling rapid and safe evaluation of new memory tiering policies. This will help the broader research community test and iterate quickly, driving faster innovation in tiered memory system design.

Perspectives

As one of the authors affiliated with both the University of Washington and Google, my perspective on this work is twofold. The hypothesis that current benchmarks for tiered memory systems are inadequate and fail to capture the complexity and variability of production workloads was already well recognized among our colleagues at Google working on memory system design. Practitioners understood this gap through real-world deployment experience. From an academic standpoint, it was particularly rewarding to systematically study and rigorously demonstrate this inadequacy. Our data-driven pipeline, built on fleet-wide telemetry and statistical modeling, confirmed that existing benchmarks “run hot,” missing the long-term and cold-access characteristics of real workloads. The most gratifying aspect of this research is the shift in benchmarking methodology it represents. By introducing a systematic pipeline that combines lightweight telemetry, statistical modeling, and compact embeddings, we provide a rigorous way to address long-standing questions: What does it mean for a benchmark to be representative and how can we quantify it? Seeing our synthesized benchmarks consistently outperform traditional load tests reinforces that this data-driven approach is a viable and forward-looking path for benchmarking.

Rajath Shashidhara
University of Washington

Read the Original

This page is a summary of: Closing the Benchmark Gap for Tiered Memory, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3764862.3768177.
You can read the full text:

Read

Contributors

The following have contributed to this page