What is it about?
Parallel corpus is the primary ingredient of machine translation. It is required to train the statistical machine translation (SMT) and neural machine translation (NMT) systems. There is a lack of good quality parallel corpus for Hindi to English. Comparable corpora for a given language pair are comparatively easy to find, but this cannot be used directly in SMT or NMT systems. As a result, we generate a parallel corpus from the comparable corpus. For this purpose, the sentences (which are translations of each other) are mined from the comparable corpus to prepare the parallel corpus. The proposed algorithm uses the length of the sentence and word translation model to align sentence pairs that are translations of each other.
Featured Image
Why is it important?
Parallel corpora are essential for training machine translation systems, but due to the scarcity of high-quality Hindi-English parallel data, algorithms leveraging sentence length and word translation models are used to mine and align sentence pairs from comparable corpora to create such resources.
Perspectives
To mitigate the scarcity of high-quality Hindi-English parallel corpora, algorithms use sentence length and word translation models to mine and align sentence pairs from comparable texts, thereby generating valuable translation resources. This innovative approach bridges the data gap, enhancing machine translation systems' capabilities.
Dr. Debajyoty Banik
Read the Original
This page is a summary of: Automatic Resource Augmentation for Machine Translation in Low Resource Language:
EnIndic Corpus, ACM Transactions on Asian and Low-Resource Language Information Processing, August 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3617371.
You can read the full text:
Contributors
The following have contributed to this page







