What is it about?

This paper deals with language identification in the domain of web documents. The proposed system is built on hidden Markov models (HMMs) that enable the modeling of character sequences. Furthermore, the use of HMMs provides the means for language tracking, that is, language identification across the segments of a multilingual document.

Featured Image

Why is it important?

Automatic language identification in written text documents is an issue which deserves significant attention in the context of the ever-growing volume of web documents. The proposed system is built on Hidden Markov Models (HMMs), with flexible stochastic properties, that enable the modeling of character sequences. To our knowledge the use of HMMs had not been previously examined in such a task. A parallel structure of discrete HMMs is used in the training phase. During testing, a previously unseen document is divided into its sentences and each of them is independently characterized with respect to the language it is written in. For this purpose, proper HMM features are used. Several HMM parameters are examined and adjusted for the improvement of the results. Experiments conducted on short sentence-long documents, written in five European languages, have demonstrated very high identification rates. For documents of about 140 characters there was an average identification rate of 99%. Furthermore, HMMs allow for language tracking; that is language identification across the segments of a multilingual document. This is a promising application for the proposed method.

Perspectives

Innovative and influential paper. It is mentioned in the literature review of several peer-reviewed articles as a new method of text-based language identification. It achieved more than 45 Google Scholar citations.

Alexandros Xafopoulos
University College London

Read the Original

This page is a summary of: Language identification in web documents using discrete HMMs, Pattern Recognition, March 2004, Elsevier,
DOI: 10.1016/j.patcog.2003.05.001.
You can read the full text:

Read

Contributors

The following have contributed to this page