Abstract: We present an approach to Audio-Visual Speech Recognition that builds on a pre-trained Whisper model. To infuse visual information into this audio-only model, we extend it with an AV fusion ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results