Instance-Aware Generalized Referring Expression Segmentation

arXiv

1Stony Brook University, New York, USA 2EPFL
empty


Previous GRES methods typically output a single foreground mask in an end-to-end manner, struggling with complex cases involving multiple referred object instances. In contrast, our proposed method automatically localizes relevant object instances associated with different parts of the input prompt before aggregating them to produce the final mask.


TL;DR: InstAlign is a novel instance-aware approach for Generalized Referring Expession Segmentation that incorporates the instance-level reasoning into segmentation process. By explicitly identifying and linking relevant objects to specific semantic entities of the input text, the model can effectively handles complex multi-object scenarios, leading to significant performance gains.

Abstract

Recent works on Generalized Referring Expression Segmentation (GRES) struggle with handling complex expressions referring to multiple distinct objects. This is because these methods typically employ an end-to-end foreground-background segmentation and lack a mechanism to explicitly differentiate and associate different object instances to the text query.

To this end, we propose InstAlign, a method that incorporates. object-level reasoning into the segmentation process.

Our model leverages both text and image inputs to extract a set of object-level tokens that capture both the semantic information in the input prompt and the objects within the image. By modeling the text-object alignment via instance-level supervision, each token uniquely represents an object segment in the image, while also aligning with relevant semantic information from the text.

✨ Results: Extensive experiments on the gRefCOCO and Ref-ZOM benchmarks demonstrate that our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.

🚀 Code Released: Our code and pre-trained weights are available at https://github.com/nero1342/InstAlign.

empty

InstAlign identifies relevant instances and merges them to produce the final accurate segmentation mask.