TrainVerify: Equivalence-Based Verification for Distributed LLM Training

Yunchi Lu; Youshan Miao; Cheng Tan; Peng Huang; Yi Zhu; Xian Zhang; Fan Yang

doi:10.1145/3731569.3764850

What is it about?

In distributed LLM training, model parallelization is inherently error-prone. TrainVerify provides formal correctness guarantees for parallelization by verifying that the parallelized models are algebraically equivalent to the semantics of their corresponding single-device model definitions.

Photo by Piret Ilver on Unsplash

Why is it important?

TrainVerify eliminates silent, hard-to-diagnose parallelization bugs that degrade model quality. It verifies parallelization equivalence symbolically—using symbolic tensors and operators free from the numerical noise that hinders differential testing—and compares graph-level representations between original and parallelized models. Its stage decomposition and shape-reduction techniques enable scalable verification for the world’s largest models across thousands of GPUs.

Perspectives

We hope this article conveys that the correctness and reliability of ML training are both important and achievable. TrainVerify takes the first step towards introducing formal correctness for parallelization, and we hope it will inspire broader efforts to bring similar rigor to the entire ML infrastructure stack.
Yunchi Lu
University of Michigan

This page is a summary of: TrainVerify: Equivalence-Based Verification for Distributed LLM Training, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3731569.3764850.
You can read the full text:

Read

Resources

Project
GitHub Repo
Actively developed — we welcome adoption and contributions!

Contributors

The following have contributed to this page

Yunchi Lu
University of Michigan

TrainVerify: Equivalence-Based Verification for Distributed LLM Training

What is it about?

Why is it important?

Perspectives

Resources

GitHub Repo

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

TrainVerify: Equivalence-Based Verification for Distributed LLM Training

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Resources

GitHub Repo

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management