Automatic Resource Augmentation for Machine Translation in Low Resource Language:
            <tt>EnIndic Corpus</tt>

Anasua Banerjee; Vinay Kumar; Achyut Shankar; Rutvij H. Jhaveri; Debajyoty Banik

doi:10.1145/3617371

What is it about?

Parallel corpus is the primary ingredient of machine translation. It is required to train the statistical machine translation (SMT) and neural machine translation (NMT) systems. There is a lack of good quality parallel corpus for Hindi to English. Comparable corpora for a given language pair are comparatively easy to find, but this cannot be used directly in SMT or NMT systems. As a result, we generate a parallel corpus from the comparable corpus. For this purpose, the sentences (which are translations of each other) are mined from the comparable corpus to prepare the parallel corpus. The proposed algorithm uses the length of the sentence and word translation model to align sentence pairs that are translations of each other.

Why is it important?

Parallel corpora are essential for training machine translation systems, but due to the scarcity of high-quality Hindi-English parallel data, algorithms leveraging sentence length and word translation models are used to mine and align sentence pairs from comparable corpora to create such resources.

Perspectives

To mitigate the scarcity of high-quality Hindi-English parallel corpora, algorithms use sentence length and word translation models to mine and align sentence pairs from comparable texts, thereby generating valuable translation resources. This innovative approach bridges the data gap, enhancing machine translation systems' capabilities.
Dr. Debajyoty Banik

This page is a summary of: Automatic Resource Augmentation for Machine Translation in Low Resource Language: EnIndic Corpus, ACM Transactions on Asian and Low-Resource Language Information Processing, August 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3617371.
You can read the full text:

Read

Contributors

The following have contributed to this page

Dr. Debajyoty Banik

Automatic Resource Augmentation for Machine Translation in Low Resource Language: EnIndic Corpus

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Automatic Resource Augmentation for Machine Translation in Low Resource Language: EnIndic Corpus

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management