Plug-and-Steer:
Decoupling Separation and Selection in
Audio-Visual Target Speaker Extraction
Abstract
The goal of this paper is to provide a new perspective on audio-visual target speaker extraction (AV-TSE) by decoupling the separation and target selection. Conventional AV-TSE systems typically integrate audio and visual features deeply to re-learn the entire separation process, which can act as a fidelity ceiling due to the noisy nature of in-the-wild audio-visual datasets. To address this, we propose Plug-and-Steer, which assigns high-fidelity separation to a frozen audio-only backbone and limits the role of visual modality strictly to target selection. We introduce the Latent Steering Matrix (LSM), a minimalist linear transformation that re-routes latent features within the backbone to anchor the target speaker to a designated channel. Experiments across four architectures—ConvTasNet, DPRNN, TF-GridNet, and MossFormer2—show that our method effectively preserves the acoustic priors of diverse backbones, achieving perceptual quality comparable to the original backbones.
Audio Samples
Experimental results on the LRS2-2Mix dataset, corresponding to Table 2 of our paper. This section compares our approach using Latent Steering Matrix (LSM) with the baseline (AV-MossFormer2) from ClearerVoice-Studio and the residual-feature variants.