What is it about?
The speech signal of a word uttered by a speaker at two different times varies in terms of energy content, duration etc. So it is a difficult job to find occurrences of the query by finding similar speech signals. It is even more challenging when the speech database is huge containing the utterances of a large number of speakers having variations due to differences in gender, age, speech style etc. Spoken Term Detection (STD) refers to the process of locating the occurrences of spoken queries in a large speech database. For the STD task, all we have is speech signals of corresponding queries and search database. This work mentions about the processing of these speech signals to find the query locations.
Photo by Standsome Worklifestyle on Unsplash
Why is it important?
Generally, two methods have been adopted for STD: an Automatic Speech Recognition (ASR) based label sequence matching or feature-based template matching. ASR-based techniques utilize phoneme models of a language, which require a considerable amount of labelled training data in the selected language. Hence such techniques are considered as language-dependent, and it is not feasible to develop ASR for each language. The feature-based template matching techniques address this task in a language-independent manner, but they are computationally complex. This work combines the positive aspect of both the methods by introducing a multistage architecture to address the task of STD for low-resourced languages.
Read the Original
This page is a summary of: Two-stage spoken term detection system for under-resourced languages, IET Signal Processing, July 2020, the Institution of Engineering and Technology (the IET), DOI: 10.1049/iet-spr.2019.0131.
You can read the full text:
The following have contributed to this page