What is it about?

Tracking players in a full-match soccer video sounds simple, but it is not. Players are tiny in wide-angle views, they often block each other, and teammates wear almost identical uniforms. Our study presents a practical way to follow every player, frame by frame, using a single fixed camera. The focus is reliability in real match conditions without close-ups or extra sensors. By keeping player identities stable over time, the system produces clean movement trails for all 22 players. These trails enable richer analysis: how teams press and defend, who creates space, where transitions begin, which matchups matter, and how tactics change minute by minute. In short, we make long and complex matches easier to read so coaches, analysts, broadcasters, and fans can see the game’s structure as clearly as its highlights.

Featured Image

Why is it important?

First, it tests two complementary appearance embeddings at scale for soccer: language-aligned CLIP and part-aware PRT, and shows when each helps. A key finding is that CLIP can outperform part-based methods when boxes are very small, which challenges common assumptions in ReID. Second, it introduces a practical global tracklet association (GTA) that reconnects fragments across long gaps using appearance and motion continuity, without requiring pitch geometry. This improves long-term ID consistency, which matters for analysis. Third, it quantifies the impact of detector resolution and shows that high-resolution detection is decisive for small-player regimes. Together, these results offer concrete design guidance for real-world deployments: keep detector resolution high, prefer CLIP for small scales, fuse parts only when visibility is reliable, and add global reconnection to stabilize identities over a full match.

Perspectives

I set out to build a tracker that remains reliable even in fixed panoramic soccer videos, which are particularly challenging. Conventional multi-target tracking leans on appearance cues such as clothing, but in soccer, teammates wear nearly identical uniforms, so naive methods tend to misassociate players. That’s why I combined two appearance extractors, CLIP-ReID and PRT-ReID. Surprisingly, my experiments showed that CLIP-Only performed best: at very small scales, CLIP’s global semantics were more effective than part-based cues. Practically, this suggests using high-resolution detection with CLIP as the foundation, and activating PRT only when players are large enough and parts are stable. Next, I plan to inject pitch-geometry constraints and team priors into GTA and push toward real-time operation. Ultimately, I want the system to support coaching questions directly, such as who created space and what movement patterns characterize each player.

Yuki Nakamura
Tsukuba Daigaku

Read the Original

This page is a summary of: Enhancing Soccer Player Tracking and Re-Identification with Dual Visual Embeddings and Tracklet Association, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3728423.3759415.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page