What is it about?

Unlocking the inner workings of cloud-native applications, especially when things start failing, is crucial. Meet MicroView, a clever system that helps us better understand what's happening inside these applications without causing too much stress on the cloud servers. When we try to observe microservice applications using traditional tools, it often puts a strain on the servers, impacting how well the services run and degrading user experience. This is increasingly true as we ask traditional tool to collect real-time monitoring data at fine temporal resolution. The cost of communicating telemetry data collected from distributed machines to a centralized server for subsequent analysis, might quickly blow up. MicroView takes a different approach. The paper describes a new three-tier archiecture, where a mini-observer is placed at each machine in the system to locally process data rather than sending it all over the place. The key innovation is to team up with new hardware accelerators, like the NVIDIA BlueField-3 SmartNIC, which can handle all the local processing in place of the machines' CPU. This means less interference with what the applications are trying to do, and accurate insights. Using MicroView, we can follow every pulse of the system in real-time and promptly detect potential application issues with much less overhead than before. Think of it like having a sharp radar that helps us navigate through the complexity of these cloud-native applications.

Featured Image

Why is it important?

Traditional tools tend to slow things down and sometimes miss important details. MicroView focuses on the key metrics that matter, making sure we get the right information at the right time. It's like upgrading from a blurry snapshot to a clear, high-definition picture. For enterprises that have migrated to the cloud, this means reducing the storage costs charged by the provider to monitor the infrastructure. In fact, MicroView can filter which pieces of monitoring information is relevant to be collected and diminish the storage needs. For developers, who spend significant effort to triage and resolve bottlenecks for distributed applications, this means more precise and insightful information to start with. In fact, precisely pinpointing in time anomalous events helps correlating data from different monitoring sources, such as metrics and distributed tracing.

Read the Original

This page is a summary of: MicroView: Cloud-Native Observability with Temporal Precision, December 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3630202.3630233.
You can read the full text:

Read

Contributors

The following have contributed to this page