What is it about?
Recognizing emotions in spoken Thai is tough for computers, especially when there's background noise or limited training data. Our research introduces a new system called TH-SERSE (Thai Speech Emotion Recognition with Speech Enhancement) to tackle these issues. First, we 'clean up' the speech using an AI tool called Conv-TasNet to reduce noise and improve clarity. Then, the cleaned speech goes into another AI block that learns to understand general speech patterns without needing emotion labels – it does this by trying to predict upcoming parts of the speech from what it has already heard (using a technique called Contrastive Predictive Coding). Finally, this pre-trained system is fine-tuned specifically to identify emotions. We tested different setups, including adding more variety to the training data using a technique called VTLP augmentation.
Featured Image
Photo by Kelly Sikkema on Unsplash
Why is it important?
Effectively recognizing emotions in Thai speech, a tonal and low-resource language, can greatly improve how we interact with computers, but background noise and data scarcity are major hurdles. Our TH-SERSE framework is important because: - Handles Noisy Speech Better: By first enhancing the speech with Conv-TasNet, our system significantly improves emotion recognition accuracy, especially in challenging acoustic environments like those found in the EMOLA dataset. For example, on EMOLA, using speech enhancement boosted unweighted accuracy from about 34.44% to 41.22%. - Effective for Low-Resource Thai: The use of self-supervised pre-training (Contrastive Predictive Coding) helps the model learn robust speech representations even with limited labeled Thai emotion data. This is crucial for low-resource languages. - Improved Performance Over Baselines: Our complete system, combining speech enhancement and smart AI learning, outperformed other recent methods on both the EMOLA and ThaiSER datasets, achieving up to 81.53% unweighted accuracy on ThaiSER with the speech enhancement and augmentation pipeline.
Perspectives
Working on TH-SERSE was a really interesting dive into tackling the practical challenges of speech emotion recognition, especially for Thai. We know that real-world audio is rarely perfect – it's often noisy, which can really confuse AI models. So, the idea of integrating a strong speech enhancement component like Conv-TasNet right at the beginning of the pipeline felt like a critical step. Beyond just cleaning the audio, the self-supervised pre-training with Contrastive Predictive Coding was exciting. For a low-resource language like Thai, where getting large amounts of labeled emotion data is hard, being able to first let the model learn general speech characteristics from unlabeled data is a huge advantage. Seeing how these components, speech enhancement and self-supervised learning, came together to significantly boost emotion recognition performance on both standard and more challenging datasets was very rewarding. It reinforces that a multi-stage, thoughtful approach to data processing and model training can make a real difference in building more robust and practical AI systems for understanding human emotion
Sattaya Singkul
True Digital Group
Read the Original
This page is a summary of: Real-Time Thai Speech Emotion Recognition With Speech Enhancement Using Time-Domain Contrastive Predictive Coding and Conv-Tasnet, May 2022, Institute of Electrical & Electronics Engineers (IEEE),
DOI: 10.1109/icbir54589.2022.9786444.
You can read the full text:
Contributors
The following have contributed to this page







