What is it about?

Computers are getting better at understanding emotions in our speech, which is important for things like call center services. Often, the 'deeper' an AI model is (meaning more layers), the better it learns. But making models too deep can cause problems, like the AI struggling to learn or taking too much time. We've created a new AI building block called DeepResLFLB (Deep Residual Local Feature Learning Block) to tackle this. It's inspired by how humans learn by 're-reading' material to understand it better. Our system has parts for initial learning (like a first read), further in-depth learning using special 'residual' connections to avoid losing information (like re-reading for detail), and then a part to make sense of it all and predict the emotion. We also tested which sound features work best, comparing standard ones (LMS) with a richer set (LMSDDC) that includes more details about human emotional expression.

Featured Image

Why is it important?

Building AI that accurately understands speech emotions is complex, and deeper AI models, while potentially more effective, often run into learning issues. Our DeepResLFLB approach is important because: - Better Performance & Efficiency: It significantly outperforms older methods in recognizing emotions from speech on standard datasets (EMODB and RAVDESS). Crucially, it does this while using about 40% fewer parameters than a comparable baseline model (2D-LFLB), making it more efficient. - Solves Deep Learning Problems: The 'residual learning' in DeepResLFLB helps prevent the 'vanishing gradient' problem, allowing the AI to learn more effectively even in deeper layers without losing information. This leads to more stable training, as shown by better validation loss. - Improved Feature Use: We showed that using a richer set of sound features (LMSDDC, capturing aspects of glottal flow, prosody, and human hearing) can further enhance emotion recognition, especially in datasets with diverse speech variations like EMODB.

Perspectives

The idea behind DeepResLFLB came from thinking about how we, as humans, learn complex information. We don't just glance at something once; we often re-read or revisit details to get a deeper understanding. I wanted to see if we could build this 'repeated learning' concept into an AI model for speech emotion recognition. The challenge with many deep learning models is that as they get deeper, they can sometimes forget earlier information or struggle to update themselves – the vanishing gradient problem. It was really exciting to see DeepResLFLB not only improve accuracy but also do it more efficiently, using fewer resources than existing LFLB models. The design, with its distinct stages for initial feature learning, deeper residual learning, and then classification, seemed to effectively manage the flow of information. Also, exploring different acoustic features like LMSDDC reinforced how important the input representation is. This work, for me, highlighted how insights from human learning can inspire more effective and efficient AI architectures.

Sattaya Singkul
True Digital Group

Read the Original

This page is a summary of: Deep Residual Local Feature Learning for Speech Emotion Recognition, January 2020, Springer Science + Business Media,
DOI: 10.1007/978-3-030-63830-6_21.
You can read the full text:

Read

Contributors

The following have contributed to this page