What is it about?
With the increasing adoption of UAV platforms in areas such as public safety and smart cities, Aerial-Ground Person Re-Identification (AGPReID) has emerged as a crucial yet highly challenging task, garnering growing interest from the research community. While existing approaches have leveraged identity attributes and viewpoint disentanglement strategies to improve cross-view matching, their heavy reliance on prior knowledge often compromises model generalization. We propose a CLIP-based View-Consistent Alignment Framework (CVAF) with two training stages.
Featured Image
Photo by Ryoji Iwata on Unsplash
Why is it important?
In the first stage, learnable text tokens are employed to represent identity-aware textual descriptions. To promote consistent alignment across varying viewpoints, we introduce aText Consistency Loss (TCL) that regularizes the stability of text-token interactions with multi-view images. In the second stage, we present a Semantic Filtering Module (SFM) that jointly modulates image patch tokens along spatial and channel dimensions.
Perspectives
A text-guided cross-attention mechanism generates spatial attention maps to explicitly emphasize identity-relevant regions, while semantic matching between textual features and visual tokens enables adaptive reweighting of image representations, effectively suppressing background clutter and view-specific noise.
Shangzhi Teng
Beijing Information Science and Technology University
Read the Original
This page is a summary of: CVAF: A CLIP-Based View-Consistent Alignment Framework for Aerial-Ground Person Re-Identification, ACM Transactions on Multimedia Computing Communications and Applications, December 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3785482.
You can read the full text:
Contributors
The following have contributed to this page







