What is it about?

The paper documents the long-term research that led to the current unrestricted spoken English system for output from computers based on an acoustic "tube model" of the human vocal tract; describes how various problems were overcome; and links the work to the research of others and explains the methods and tools used, as well as their potential for speech research in general. Because the tube model is a reasonably accurate representation of the actual acoustic basis of the real vocal tract, and may be controlled in real time, voices equivalent to different speakers can be imitated producing speech on demand from normally punctuated text, and the speech is more natural because the energy balances involved between the oral and nasal passages, losses from the cheeks and throat, the radiation characteristics from the mouth and nose, and the like are directly included. The existing system ("Gnuspeech") uses the databases we generated that are appropriate for spoken English. The tools that were used for producing these databases are described and illustrated along with their use, whilst links to access the relevant on-line databases are provided. Links are also provided to allow samples of the speech produced to be auditioned, as well as to the manuals for the tools, plus the code sources to all the components involved. The system could be used to develop databases for other languages using the approaches and tools we describe. The material is available under a Free Software Foundation "General Public Licence" ("GPL") that allows the work to be used by all comers. The system runs under Apple's OS X, and the synthesis system is also available under Linux, though the tools are still being ported to Linux.

Featured Image

Why is it important?

The work is important because it is the first complete real-time text-to-speech system based on low-level articulatory synthesis of speech using a reasonably accurate model of the vocal tract; because it can be used in psychophysical experiments on speech; and because it can be used to produce databases for other languages and then speak them. Extensions would be needed to deal with languages involving implosives, clicks, and the like, but the terms of the GPL allow this. This can provide a new beginning for research on speech, and for speech synthesis by computer.

Perspectives

I was really pleased that the reviewers and editors were able to see the real merit in this paper, and I thank them for their improvements and careful work. What is described represents an important segment of what I spent my career doing—specifically computer human interaction in general, but focusing particularly on computer speech recognition and synthesis. The work is original and has yet to be duplicated. The involvement of people who are interested and would like to develop the system and tools further would be most welcome.

David Hill
University of Calgary

Read the Original

This page is a summary of: Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool, The Canadian Journal of Linguistics / La revue canadienne de linguistique, June 2017, Cambridge University Press,
DOI: 10.1017/cnj.2017.15.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page