What is it about?

Understanding the grammatical structure of sentences is a key task for computers, called Dependency Parsing (DP). While this is well-studied for English, Thai poses unique challenges with its flexible word order, omitted words, slang, and lack of clear sentence boundaries, especially in long or complex sentences. Our research proposes seven new AI algorithms specifically designed for parsing Thai sentences. A key feature of these algorithms is 'character embedding'. Instead of just looking at whole words (which can be problematic if a word is new or slangy), our AI also learns from the individual characters within words. This helps it better handle unfamiliar words and the nuances of the Thai language. We tested different approaches, some that process words in sequence (transition-based) and others that look at the whole sentence as a network (graph-based).

Featured Image

Why is it important?

Accurately parsing Thai sentences is essential for developing advanced Thai language AI applications like better search engines, question-answering systems, or tools that extract meaning from text. However, current methods often fall short with Thai. Our work is important because: - Improved Thai Parsing Accuracy: Our seven newly developed Thai DP algorithms, which combine traditional parsing techniques with character-level understanding, significantly outperform existing baseline methods on Thai datasets (Thai-PUD). For instance, our best transition-based model (m-BiLSTM with CNN character embedding) achieved a UAS of 78.48% on Thai-PUD. - Handles Thai Language Challenges: By using character embeddings, our models are better at dealing with out-of-vocabulary words and slang common in Thai. The use of BiLSTM also helps tackle issues related to Thai's flexible word order. - Effective for Long & Complex Sentences: The proposed algorithms show improved performance even when dealing with long and complex Thai sentences, a known weak point for many parsing systems. For example, one of our algorithms (m-BiLSTM with LSTM character embedding) achieved 88.17% UAS on long Thai sentences.

Perspectives

Working on Thai dependency parsing has been a fascinating challenge, largely because Thai has so many interesting linguistic features that aren't always straightforward for computers. Most existing parsing tools are built for languages like English, so adapting these or creating new ones for Thai, especially to handle its word order flexibility and the common issue of out-of-vocabulary words or slang, was a key motivation. The idea to incorporate character embeddings was particularly exciting. By allowing the model to learn from sub-word information (characters), we could give it a better chance to understand words it hadn't seen before or that were creatively spelled. It was very rewarding to see that this approach, combined with careful model design (like using BiLSTMs to capture context from both directions), led to a significant improvement in parsing accuracy for Thai. This work, I hope, provides a stronger foundation for many other Thai NLP applications that rely on a good understanding of sentence structure.

Sattaya Singkul
True Digital Group

Read the Original

This page is a summary of: Thai Dependency Parsing with Character Embedding, October 2019, Institute of Electrical & Electronics Engineers (IEEE),
DOI: 10.1109/iciteed.2019.8930002.
You can read the full text:

Read

Contributors

The following have contributed to this page