What is it about?

Teaching computers to speak Thai naturally is tricky because Thai has complex tones and sound structures. Our research looks at how to make computer-generated Thai speech (Text-to-Speech or TTS) sound better. We developed a complete system for Thai TTS and explored two main things: - Language Units: We studied whether breaking down Thai text into different 'units' – like individual sounds (phonemes), syllables, or whole words – affects how clearly and accurately the AI speaks. - Silence Trimming: We also checked if removing silent parts from the beginning and end of recorded speech used for training the AI helps improve the final voice quality. We tested these ideas using advanced AI voice models (like VITS, Tacotron2-DDC, and YourTTS) to see which combination produces the most natural-sounding Thai speech, looking at both the speaker's tone and the pronunciation accuracy.

Featured Image

Why is it important?

Creating high-quality AI voices for tonal languages like Thai is a significant challenge due to its intricate phonology and tonal complexities. Our research is important because: - Clearer Path for Thai TTS: We show that the choice of 'linguistic unit' (how text is broken down – word, syllable, or phoneme) significantly impacts the quality of Thai TTS. Specifically, using 'words' as the unit with the VITS AI model produced the best overall results, excelling in both correct tone and clear pronunciation. This provides a clear direction for developing better Thai TTS systems. - Beyond Tone – Pronunciation Matters: We highlight that just sounding like the original speaker (tonal accuracy) isn't enough; the AI also needs to pronounce words correctly, which is crucial for meaning in Thai. Our study used methods to evaluate both these critical aspects of speech quality. - Practical Improvements: We also found that trimming silences from training audio generally helps improve the quality of the synthesized speech, leading to clearer tonal and prosodic output. These findings can directly help create more natural and understandable Thai AI voices for applications like virtual assistants, accessibility tools for the visually impaired, and enhancing human-computer interaction.

Perspectives

Working on this Thai Text-to-Speech project was incredibly insightful. Thai is such a nuanced tonal language, and getting an AI to capture not just the right tones but also accurate pronunciation is a complex puzzle. We wanted to systematically explore how fundamental choices, like the linguistic units (phonemes, syllables, or words) we feed to the AI, and practical steps like trimming silence from audio data, really impact the final speech quality. It was particularly interesting to see the VITS model, especially when using 'words' as the linguistic unit, emerge as a strong performer in our evaluations. This reinforces that for a language like Thai, providing more contextual information through larger units can be very beneficial for the AI. One key takeaway for me was also the critical need to evaluate TTS systems comprehensively – looking beyond just tonal similarity to the speaker, and really digging into pronunciation accuracy, because that's where meaning often lies in Thai. I believe this research helps lay a stronger foundation for building more natural and human-like Thai TTS systems, which is vital for improving human-computer interaction in Thailand.

Sattaya Singkul
True Digital Group

Read the Original

This page is a summary of: End-to-End Thai Text-to-Speech with Linguistic Unit, May 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3652583.3658029.
You can read the full text:

Read

Contributors

The following have contributed to this page