Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation

Jiawei Fu, Tiantian Zhang, Kai Chen, Qi Dou
The Chinese University of Hong Kong
CVPR 2025
Project Teaser

Complex scenarios often involve overlapping visual relationships, where a single object can assume multiple roles across different triplets. Identifying these multi-role objects is challenging without knowing their aligned triplet features. Moreover, aligned triplets (top) and component (bottom) representations correspond to the same visual relation tuple, offering an opportunity to leverage their inherent relevance for reciprocal refinement.

Abstract

Scene graph generation is a pivotal task in computer vision, aiming to identify all visual relation tuples within an image. The advancement of methods involving triplets has sought to enhance task performance by integrating triplets as contextual features for more precise predicate identification from component level. However, challenges remain due to interference from multi-role objects in overlapping tuples within complex environments, which impairs the model's ability to distinguish and align specific triplet features for reasoning diverse semantics of multi-role objects. To address these issues, we introduce a novel framework that incorporates a triplet alignment model into a hybrid reciprocal transformer architecture, starting from using triplet mask features to guide the learning of component-level relation graphs. To effectively distinguish multi-role objects characterized by overlapping visual relation tuples, we introduce a triplet alignment loss, which provides multi-role objects with aligned features from triplet and helps customize them. Additionally, we explore the inherent connectivity between hybrid aligned triplet and component features through a bidirectional refinement module, which enhances feature interaction and reciprocal reinforcement. Experimental results demonstrate that our model achieves state-of-the-art performance on the Visual Genome and Action Genome datasets, underscoring its effectiveness and adaptability.

MY ALT TEXT

Model overview of our method. Hybrid reciprocal transformer deploys four multi-layer decoder structures to embed triplet mask and visual relation tuple components learning using hybrid representations. We illustrate the detailed feature flow of each decoder to visualize two interaction paradigms within this transformer. The first paradigm facilitates triplet iterative guidance and enhances contextual feature integration for predicate recognition at the component level. The second paradigm employs aligned triplet and component features to enable bidirectional refinement of the hybrid triplet-level and component-level representations.

MY ALT TEXT

Triplet feature alignment loss measures consistency between triplet feature with tripled component-level feature. It encourages similarity between triplet and component feature from identical visual relation tuples to distinguish multi-role object by aligning different triplet feature into component level.

MY ALT TEXT

Scene graph generation performance on Visual Genome.

MY ALT TEXT

Qualitative analysis of our method compared with the baseline models, SpeaQ with ISG, indicates that our method yields significantly enhanced accuracy in the prediction of predicates and object detection.

MY ALT TEXT

Visualization results of the triplet mask from our method, and its corresponding visual relation in VG datasets.

MY ALT TEXT

Qualitative results of our methods on VG datasets. Left part is the original image, right part is the constructed scene graph by our method. \textcolor{red}{Dashed rectangle} marks our correct detected visual relation that is not annotated in ground truth.

Poster

Coming Soon!

BibTeX

@inproceedings{fu2025hybrid,
    title={Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation},
    author={Fu, Jiawei and Zhang, Tiantian and Chen, Kai and Dou, Qi},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
}