Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation

Jiawei Fu, Tiantian Zhang, Kai Chen, Qi Dou

The Chinese University of Hong Kong
CVPR 2025

Abstract

Scene graph generation is a pivotal task in computer vision, aiming to identify all visual relation tuples within an image. The advancement of methods involving triplets has sought to enhance task performance by integrating triplets as contextual features for more precise predicate identification from component level. However, challenges remain due to interference from multi-role objects in overlapping tuples within complex environments, which impairs the model's ability to distinguish and align specific triplet features for reasoning diverse semantics of multi-role objects. To address these issues, we introduce a novel framework that incorporates a triplet alignment model into a hybrid reciprocal transformer architecture, starting from using triplet mask features to guide the learning of component-level relation graphs. To effectively distinguish multi-role objects characterized by overlapping visual relation tuples, we introduce a triplet alignment loss, which provides multi-role objects with aligned features from triplet and helps customize them. Additionally, we explore the inherent connectivity between hybrid aligned triplet and component features through a bidirectional refinement module, which enhances feature interaction and reciprocal reinforcement. Experimental results demonstrate that our model achieves state-of-the-art performance on the Visual Genome and Action Genome datasets, underscoring its effectiveness and adaptability.

BibTeX

@inproceedings{fu2025hybrid, title={Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation}, author={Fu, Jiawei and Zhang, Tiantian and Chen, Kai and Dou, Qi}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2025} }

Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation

Abstract

Scene graph generation performance on Visual Genome.

Qualitative analysis of our method compared with the baseline models, SpeaQ with ISG, indicates that our method yields significantly enhanced accuracy in the prediction of predicates and object detection.

Visualization results of the triplet mask from our method, and its corresponding visual relation in VG datasets.

Qualitative results of our methods on VG datasets. Left part is the original image, right part is the constructed scene graph by our method. \textcolor{red}{Dashed rectangle} marks our correct detected visual relation that is not annotated in ground truth.

Poster

BibTeX