What is it about?

With the increasing adoption of UAV platforms in areas such as public safety and smart cities, Aerial-Ground Person Re-Identification (AGPReID) has emerged as a crucial yet highly challenging task, garnering growing interest from the research community. While existing approaches have leveraged identity attributes and viewpoint disentanglement strategies to improve cross-view matching, their heavy reliance on prior knowledge often compromises model generalization. We propose a CLIP-based View-Consistent Alignment Framework (CVAF) with two training stages.

Featured Image

Why is it important?

In the first stage, learnable text tokens are employed to represent identity-aware textual descriptions. To promote consistent alignment across varying viewpoints, we introduce aText Consistency Loss (TCL) that regularizes the stability of text-token interactions with multi-view images. In the second stage, we present a Semantic Filtering Module (SFM) that jointly modulates image patch tokens along spatial and channel dimensions.

Perspectives

A text-guided cross-attention mechanism generates spatial attention maps to explicitly emphasize identity-relevant regions, while semantic matching between textual features and visual tokens enables adaptive reweighting of image representations, effectively suppressing background clutter and view-specific noise.

Shangzhi Teng
Beijing Information Science and Technology University

Read the Original

This page is a summary of: CVAF: A CLIP-Based View-Consistent Alignment Framework for Aerial-Ground Person Re-Identification, ACM Transactions on Multimedia Computing Communications and Applications, December 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3785482.
You can read the full text:

Read

Contributors

The following have contributed to this page