What is it about?

The volume of data generated from new scientific experiments and simulations is exponentially increasing, and the data access for such projects drives up the network bandwidth demand and time constraint data delivery requirements, especially for geographically distributed collaborations. As the research community builds large, one-of-a-kind instruments, the data collected by these instruments are converted into scientifically relevant data sets, which are then used by collaborations between scientists across the world to generate discoveries. During this process, the popular datasets are delivered multiple times to different users all focused on the same problem. In many cases, the same dataset is delivered multiple times to the same user for various reasons. While many solutions have been created to enable efficient movement of data over distances, there is significant time spent by the scientist to ensure that the appropriate data is at the right location before starting the actual analysis. If there are co-located researchers using the same popular dataset for their analysis, multiple copies of that data would be moved to each researcher’s private storage allocated at the same compute location. However, sharing data among geographically distributed users can be accommodated with some type of content-delivery network. In-network caching also provides the unique capability for a network provider to design data hotspots into the network topology. For this study, we collected data access measurements from the Southern California Petabyte Scale Cache, where client jobs requested data files for High-Luminosity Large Hadron Collider (HL-LHC) analysis. We studied how much data is shared, how much network traffic volume is consequently saved, and how much the in- network data cache increases application performance. Additionally, we analyzed data access patterns in applications and the impacts of singular data caching nodes to the regional data repository.

Featured Image

Why is it important?

Understanding data access patterns and the characteristics of the data access gives us insights into how the data or dataset can be delivered and shared, as well as how the needed resources such as compute, storage and network can be allocated.

Perspectives

I hope this article helps people think about the network bandwidth as a shared, limited resource and the characteristics of the data access and understanding data access patterns enabling more efficient policies on data delivery and resource sharing especially for science use cases.

Alex Sim
Lawrence Berkeley National Laboratory

Read the Original

This page is a summary of: Analyzing Scientific Data Sharing Patterns for In-network Data Caching, June 2020, ACM (Association for Computing Machinery),
DOI: 10.1145/3452411.3464441.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page