Fine-grained and Non-intrusive LLM Training Monitoring via Microsecond-level Traffic Measurement

Yibo Xiao; Hao Zheng; Haifeng Sun; Qingkai Meng; Jiong Duan; Xiaohe Hu; Rong Gu; Guihai Chen; Chen Tian

doi:10.1145/3779212.3790163

What is it about?

Training large AI models requires hundreds of computers to work in tight synchronization. When even one machine slows down, it can create a network “traffic jam” that delays the entire training process. However, traditional monitoring tools operate at the millisecond scale—too coarse to reveal the real cause. Pulse acts like an ultra-high-speed camera for data center networks. By measuring traffic directly on smart network cards at microsecond resolution, Pulse exposes fine-grained communication behavior and pinpoints problematic machines—without requiring any changes to complex AI training software.

Photo by imgix on Unsplash

Why is it important?

As AI models grow larger, training failures or “straggler” machines can waste enormous amounts of expensive GPU resources. Identifying the faulty machine is difficult because even a tiny delay can quickly cascade across the tightly synchronized training network. Pulse addresses this challenge by providing unprecedented fine-grained visibility while remaining non-intrusive. By operating directly on network hardware, it can be deployed by cloud providers without requiring users to modify their proprietary training code. By capturing microsecond-level transmission gaps that traditional tools miss, Pulse enables faster diagnosis of failures, reduces wasted computation, and improves the reliability of large-scale AI infrastructure.

This page is a summary of: Fine-grained and Non-intrusive LLM Training Monitoring via Microsecond-level Traffic Measurement, March 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3779212.3790163.
You can read the full text:

Read

Contributors

The following have contributed to this page

yibo xiao
Nanjing University

Pulse: A Fine-grained and Non-intrusive System for Anomaly Localization in LLM Training

What is it about?

Why is it important?

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Pulse: A Fine-grained and Non-intrusive System for Anomaly Localization in LLM Training

What is it about?

Featured Image

Why is it important?

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management