Abstract: Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on natural language descriptions, which requires accurate spatial localization and temporal consistency. In ...