What is it about?
In this paper, we investigate the viability of leveraging the embeddings of a large-scale pre-trained Conformer encoder for downstream SER. Empirically, the embeddings from the middle blocks of a well pre-trained ASR encoder may contain adequate acoustic and linguistic information, making them more suited for SER. Additionally, improving the ASR performance of pre-trained models can benefit for downstream SER. Further, we propose a CTA-RNN architecture to effectively fuse the ASR embeddings from different blocks. The experimental results show that our proposed methodology can achieve state-ofthe-art performance on IEMOCAP and MSP-IMPROV without explicit text as input. Moreover, the cross-corpus evaluations demonstrate the robustness of large-scale pre-trained ASR embeddings across different speakers and domains.
Featured Image
Read the Original
This page is a summary of: CTA-RNN: Channel and Temporal-wise Attention RNN leveraging Pre-trained ASR Embeddings for Speech Emotion Recognition, September 2022, International Speech Communication Association,
DOI: 10.21437/interspeech.2022-10403.
You can read the full text:
Contributors
The following have contributed to this page







