Speech Processing topics
A list of potential topics for PhD students in the area of Speech Processing.
Topics in unsupervised speech processing and/or modelling infant speech perception
Supervisor: Sharon Goldwater
Work in unsupervised (or 'zero-resource') speech processing (see Aren Jansen et al., ICASSP 2013) has begun to investigate methods for extracting repeated units (phones, words) from raw acoustic data as a possible way to index and summarize speech files without the need for transcription. This could be especially useful in languages where there is little data to develop supervised speech recognition systems. In addition, it raises the possibility of whether similar methods could be used to model the way that human infants begin to identify words in the speech stream of their native language. Unsupervised speech processing is a growing area of research with many interesting open questions, so a number of projects are possible. Projects could focus mainly on ASR technology or mainly on modeling language acquisition; specific research questions will depend on this choice. here are just two possibilities: (1) unsupervised learners are more sensitive to input representation than are supervised learners, and preliminary work suggests that MFCCs are not necessarily the best option. Investigate how to learn better input representations (e.g., using neural networks) that are robust to speaker differences but encode linguistically meaningful differences. (2) existing work in both speech processing and cognitive modeling suggests that trying to learn either words or phones alone may be too difficult and in fact we need to develop *joint learners* that simultaneously learn at both levels. Investigate models that can be used to do this and evaluate how joint learning can improve performance.
Deep neural network-based speech synthesis
DNNs may offer new ways to model the complex relationship between text and speech, compared to HMMs. In particular, they may enable more effective use of the highly-factorial nature of the linguistic representation derived from text, which in the HMM approach is a 'flat' sequence of phonetically- and prosodically-context-dependent labels.
We are interested in any topic concerning DNN-based speech synthesis, for example: novel text processing methods to extract new forms of linguistic representation, how to represent linguistic information at the input to the DNN, how to represent the speech signal (or vocoder parameters) at the output of the DNN, methods for control over the speaker/gender/accent/style of the output speech, combinations of supervised, semi-supervised and un-supervised learning.
Hidden Markov model-based speech synthesis
The HMM, which is a statistical model that can be used to both classify and generate speech, offers an exciting alternative to concatenative methods for synthesising speech. As a consequence, most research effort around the world in speech synthesis is now focussed on HMMs because of the flexibility that they offer.
There are a number of topics we are interested in within HMM-based speech synthesis, including: speaker, language and accent adaptation; cross-language speaker adaptation; improving the signal processing and vocoding aspects of the model; unsupervised and semi-supervised learning.
Personification using affective speech synthesis
New approaches to capture, share and manipulate information in sectors such as health care and the creative industries require computers to enter the arena of human social interaction. Users readily adopt a social view of computers and previous research has shown how this can be harnessed in applications such as giving health advice, tutoring, or helping children overcome bullying.
However, whilst current speech synthesis technology is highly intelligible, it has not been able to deliver voices which aid this 'personification'. The lack of naturalness makes some synthetic voices sound robotic, while the lack of expressiveness makes others sound dull and lifeless.
In many of the above applications, it is less important to be able to render arbitrary text, than it is to convey a sense of personality within a more limited domain. So, this project would investigate two key problems (1) Merging expressive pre-recorded prompts with expressive unit selection speech synthesis. (2) Dynamically altering voicing in speech to convey underlying levels of stress and excitement using source filter decomposition techniques.
Cross-lingual acoustic models
Adapting speech recognition acoustic models from one language to another, with a focus on limited resources and unsupervised training.
Current speech technology is based on machine learning and trainable statistical models. These approaches are very powerful, but before a system can be developed for a new language considerable resources are required: transcribed speech recordings for acoustic model training; large amounts of text for language model training; and a pronunciation dictionary. Such resources are available for languages such as English, French, Arabic, and Chinese, but there are many less well-resourced languages. There is thus a need for models that can be adapted to from one language to another with limited effort and resources. To address this we are interested in two (complementary) approaches. First, the development of lightly supervised and unsupervised training algorithms: speech recordings are much easier to obtain than transcriptions. Second, the development of models which can factor language-dependent and language-independent aspects of the speech signal, perhaps exploiting invariances derived from speech production. We have a particular interest in approaches (1) building on the subspace GMM framework, or (2) using deep neural networks.
Factorised acoustic models
Acoustic models which factor specific causes of variability, thus allowing more powerful adaptation for speech recognition, and greater control for speech synthesis.
Adaptation algorithms, such as MLLR, MAP, and VTLN, have been highly effective in acoustic modelling for speech recognition. However, current approaches only weakly factor the underlying information - for instance "speaker" adaptation will typically adapt for the acoustic environment and the task, as well as for different aspects of the speaker. It is of great interest to investigate speech recognition models which are able to factor the different sources of variability. PhD projects in this area will explore the development of factored models that enable specific aspects of a system to be adapted. For example, it is of great interest - for both speech recognition and speech synthesis - to be able to model accent in a specific way. We are interested in two modelling approaches which hold great promise for this challenge: subspace Gaussian mixture models, and deep neural networks.
Robust broadcast speech recognition
Supervisor: Steve Renals
Current speech recognition technology has shown great promise in subtitling material such as news, but is brittle when faced with the full range of broadcast genres such as sport, game shows, and drama. Our industry partners have identified the transcription of noisy, reverberant speech, such as sports commentaries, as a particular challenge. We are interested in developing speech recognition models that can factorise different components of the audio signal, separating the target speech from sources of interfering acoustic sources (e.g. crowd noise) and echo. References:
P Swietojanski and S Renals (2015). Differentiable pooling for unsupervised speaker adaptation
In Proc IEEE ICASSP-2015.
P Swietojanski and S Renals (2014). Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models.
In Proc IEEE SLT-2014.
Distant speech recognition and overlapping speech
Supervisor: Steve Renals
Distant speech recognition, in which speech is captured using one or more distant microphones is a major challenge in speech recognition. Specific problems include compensating for reverberation and dealing with multiple acoustic sources (including overlapping talkers). Research in this area will explore deep neural network and and recurrent neural network acoustic models to handle reverberation and overlapping talkers, building on our recent work using convolutional neural networks and RNN encoder-decoder approaches. References:
S Renals, T Hain, and H Bourlard (2007). Recognition and interpretation of meetings: The AMI and AMIDA projects
In Proc IEEE ASRU-2007.
S Renals and P Swietojanski (2014). Neural networks for distant speech recognition
In Proc HSCMA-2014.
Multi-genre broadcast speech recognition
Supervisor: Steve Renals
Broadcast speech has been a target domain for speech recognition since the 1990s; however, most work has focused on specific genres such as news and weather forecasts. Multi-genre broadcast speech recognition, covering all types of broadcast material (e.g. sport, films, and reality TV), is a significant challenge due to much greater variability in speaking style, music and other sound effects, and overlapping talkers. In collaboration with the BBC, we have begun a programme of work in this area. Specific research topics in this area will include fundamental work in acoustic models and language models using broadcast speech recognition as a testbed and rapid adaptation to changes in acoustic environment, genre/topic, and speaker by exploiting available metadata. One topic of particular interest is recognition of broadcast speech with additive noise and reverberation (e.g. sports commentary). Our current approaches to acoustic and language modelling include recurrent and convolutional networks. References:
P Bell, P Swietojanski, and S Renals (2013). Multi-level adaptive networks in tandem and hybrid ASR systems
In Proc IEEE ICASSP-2013.
P Bell and S Renals (2015). Complementary tasks for context-dependent deep neural network acoustic models
In Proc Interspeech-2015.
S Renals et al (2015). The MGB Challenge.
Submitted to Proc IEEE ASRU-2015. The MGB Challenge
Multilingual and cross-lingual speech recognition
Supervisor: Steve Renals
We are interested in the development of new approaches to quickly and cheaply speech recognition systems for new languages, which maybe poorly resourced. We are concerned in particular with cross-lingual techniques which are able to exploit and transfer information across languages. References:
P Swietojanski, A Ghoshal, and S Renals (2012). Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR
In Proc IEEE SLT-2012.
A Ghoshal, P Swietojanski, and S Renals (2013). Multilingual training of deep neural networks.
In Proc IEEE ICASSP-2013.
L Lu, A Ghoshal, and Renals, S. (2014). Cross-lingual subspace Gaussian mixture models for low-resource speech recognition.
IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(1):17–27.
P Bell, J Driesen, and S Renals (2014). Cross-lingual adaptation with multi-task adaptive networks.
In Proc Interspeech-2014.
Audio scene understanding
Supervisor: Steve RenalsThe problem of audio scene understanding is to annotate an acoustic scene recorded using one or more microphones. This involves locating the acoustic sources, identifying them, and extracting semantic information from them. This is a new area for the group and we are interested in exploring recurrent neural networks and attention-based approaches to this task.
Natural interactive virtual agents
Supervisor: Hiroshi Shimodaira
Development of a lifelike animated character that is capable of establishing natural interaction with humans in terms of non-verbal signals.
Embodied conversational agents (ECAs) aim to foster natural communication between machine and humans. State-of-the-art technology in computer graphics has made it possible to create photo-realistic animation of human faces. However, it is not the case when the interactions between ECA and human are concerned. Interactions of ECA with humans are not as natural as those between humans. Although there are many reasons for this, the present project focuses on the non-verbal aspect of communication such as gestures and gaze, and seeks to develop an ECA system that is capable of recognising user's non-verbal signals and synthesising appropriate signals of the agent.
Gesture synthesis for lifelike conversational agent
Supervisor: Hiroshi Shimodaira
Development of a mechanism for controlling the gestures of photo-realistic lifelike agents when the agents are in various modes; idling, listening, speaking and singing.
Lifelike conversational agents, behaving like humans with facial animation and gesture, and making speech conversations with humans, are one of the next-generation human-interface. Much effort has been made so far to make the agents natural, especially controlling mouth/lip movement, and eye movement. On the other hand, controlling the non-verbal movements of the head, facial expressions, and shoulders have not been studied that much, even though those motions sometimes plays a crucial role in naturalness and intelligibility. The purpose of the project is to develop a mechanism for creating such motions of photo-realistic lifelike agents when the agents are in various modes; idling, listening, speaking and singing. One of the outstanding features of the project is that it aims to give the agent virtual personality by imitating the manner of movements/gestures of an existing person with the help of machine learning techniques used for text-to-speech synthesis.
Evaluating the impact of expressive speech synthesis on embodied conversational agents
Evaluation of embodied conversational agents (ECAs) has tended to concentrate on userbility - do users like the system, does the system achieve its objectives. We are not aware of any studies of this type which have controlled for the speech synthesis used in the project where the speech synthesis used was close to the state of the art. This project will develop evaluation based on measured interaction where expressive speech synthesis is the topic of study. It will explore carefully controlled interactive environments, and measure a subjects performance as well as measuring the physiological effects of the experiment on the subject. It will explore how involved (emotionally or otherwise) our subject is with the ECA. The userbility approach described above is also important for these experiments but we also wish to determine how speech is affecting involvement over time. For example, we would expect to increase a subjects arousal by adding emotional elements to the speech, we might expect to destroy a subjects involvement by intentionally producing an error which undermines the ECAs believability.