Abstract: Audio-visual segmentation (AVS) is a challenging multimodal task that needs to fuse the spatial-temporal audio-visual features to achieve pixel-wise segmentation of sounding objects. This ...