What is it about?

3D Scene Graph Generation (3DSGG) aims to model spatial and semantic relationships among objects for comprehensive scene understanding and reasoning. However, existing methods encounter two core challenges: (i) semantic–geometric misalignment across heterogeneous modalities—textual descriptions often overlook curvature cues in point clouds—and (ii) long-tail distribution bias in relation prediction, conflating distinct predicates due to sparse samples. To address these issues, we propose a novel 3DSGG framework integrating textual, visual, and point-cloud data through three dedicated modules: (1) Cross-Modal Consistency Enhancement (CMCE), which aligns RGB-D and point-cloud embeddings via cosine similarity and non-linear mappings; (2) Relation Enhancement Generation (REGM), which rebalances tail relations using dynamic weighting and relation embeddings; and (3) Generation Quality Optimization (GQOM), which refines graph precision and robustness with a quality discriminator and structural-consistency loss. Extensive quantitative experiments and systematic empirical ablations demonstrate the proposed framework’s superiority and robustness.

Featured Image

Read the Original

This page is a summary of: 3D Scene Graph Generation with Cross-Modal Alignment and Adversarial Learning, June 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3731715.3733257.
You can read the full text:

Read

Contributors

The following have contributed to this page