What is it about?

Teaching AI to understand emotions in speech can be tricky. Often, AI models learn very specific details from their training data and then don't work well when they hear speech from new speakers or in different sound quality environments (this is called overfitting). We propose a new two-stage approach called the 'verify-to-classify' framework to make Speech Emotion Recognition (SER) more reliable and adaptable. First, in the 'speech emotion learning' stage, the AI learns to create unique 'emotional voiceprints' or vectors from speech, much like verifying a speaker's identity. We use advanced AI (ResNet with Squeeze-Excitation) for this. This stage produces a model that's good at understanding emotional characteristics within the training data. Second, in the 'speech emotion recognition' stage, this validated emotion model is then used to train a simpler, classical machine learning system (like SVM or MLP) to classify the actual emotions. We also developed new ways (loss functions) for the AI to learn these emotional vectors more effectively.

Featured Image

Why is it important?

Many current AI systems for recognizing speech emotions work well in perfect conditions but struggle with real-world variations like different audio qualities or new speakers. Our 'verify-to-classify' framework is important because: - Better Generalization: It helps AI overcome overfitting, meaning it performs better and more consistently even when exposed to speech from new environments or of different quality than what it was trained on. For example, models trained on low-quality audio still performed well when tested on high-quality audio, with less decrease in accuracy and F1-score compared to baselines. - Improved Accuracy: The proposed method, particularly using our "softmax with angular prototypical loss" (Lo5), significantly improves emotion recognition accuracy compared to previous approaches on standard datasets like Emo-DB and RAVDESS. For instance, the PerformResSE-Lo5 with ASP model achieved 92.76% accuracy and 90.14% F1-score on Emo-DB in a low-quality environment. - Works for Both Verification and Recognition: The framework is designed to be effective at both verifying emotional characteristics (in-domain) and recognizing distinct emotion classes (out-domain).

Perspectives

The 'verify-to-classify' concept was born out of the challenges we often see with end-to-end deep learning models in speech emotion recognition. While powerful, they can be like black boxes that sometimes overfit to the training data and don't generalize well to new, unseen conditions. I wanted to explore a more structured approach that combines the feature extraction power of deep learning with the robustness of classical machine learning for the final classification. The idea was to first teach the model to really understand and 'verify' the core emotional content by learning distinctive vector representations – like creating a unique signature for each emotion. Then, once we have these strong emotional 'voiceprints', we can use a more straightforward classifier. It was particularly rewarding to see this approach yield better generalization, especially in cross-environment settings. Developing and testing the new loss functions, like Lo5, and seeing how they helped in creating more discriminative emotional vectors was also a key part of this journey. I believe this framework offers a more robust and explainable way to tackle speech emotion recognition.

Sattaya Singkul
True Digital Group

Read the Original

This page is a summary of: Vector learning representation for generalized speech emotion recognition, Heliyon, March 2022, Elsevier,
DOI: 10.1016/j.heliyon.2022.e09196.
You can read the full text:

Read

Contributors

The following have contributed to this page