What is it about?

Data-driven supervised approaches rely on the parallel corpus. Due to lack of data and resources availability, it has become more difficult to achieve accurate outputs. In addition, the efficiency of the machine translation system depends on the quality of the used corpora. Hindi still lacks good quality parallel corpora and needs more resources for accurate machine translation. Comparable corpora are easily available compared to parallel corpora, but they cannot be used directly in machine translation. In our present research, we propose an algorithm to mine these comparable corpora from the web, and generate the parallel corpora automatically. Machine translation systems, system combination approach, and IR-based technique join their hands together to choose the set of sentence pairs. Then the sentence pairs having the best score are chosen to prepare the final parallel corpora. The primary modules of this architecture are fuzzy logic-based evaluation metric, information retrieval module, statistical machine translation system, Google neural machine translation system, Microsoft machine translation system, and system combination module for machine translation. For case study, we prepare the Hindi-English parallel corpora of (30825 + 51235) = 82060 sentence pairs. Evaluation results show that the F-Score measurement varies from 95.73 to 96.98 for various data sets. The source code and prepared dataset (comparable and parallel corpus) can be found at https://github.com/debajyoty/Comparable-partallel-Algo2.git.

Featured Image

Why is it important?

The proposed algorithm addresses the critical need for high-quality parallel corpora in Hindi-English machine translation by leveraging comparable corpora mined from the web, thereby automating the generation process and enhancing translation accuracy through collaborative evaluation and selection mechanisms.

Perspectives

The perspective provided is that despite challenges in accessing high-quality parallel corpora for Hindi-English machine translation, leveraging comparable corpora and collaborative techniques can effectively address this limitation, leading to significant improvements in translation accuracy and resource availability.

Dr. Debajyoty Banik

Read the Original

This page is a summary of: Fuzzy Influenced Process to Generate Comparable to Parallel Corpora, ACM Transactions on Asian and Low-Resource Language Information Processing, June 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3599235.
You can read the full text:

Read

Contributors

The following have contributed to this page