Spoken Language Processing

A list of potential topics for PhD students in the area of Speech Processing.

We welcome applications from talented and committed PhD candidates to join us to work on these or any other theme related to speech generation. We're glad to consider proposals for the ideas you want to work on!

Multi-lingual speech recognition

Supervisor: Peter Bell

Adapting speech recognition acoustic models from one language to another, with a focus on limited resources and the use of self-supervised and unsupervised training.

Current speech technology is based on machine learning and trainable statistical models. These approaches are very powerful, but before a system can be developed for a new language considerable resources are required: transcribed speech recordings for acoustic model training; large amounts of text for language model training; and in many cases, a pronunciation dictionary. Such resources are available for languages such as English, French, Arabic, and Chinese, but this topic focuses on the many less well-resourced or economically marginalised languages. There is thus a need for models that can be adapted to from one language to another with limited effort and resources. We are particularly interested in “zero resource” scenarios, in which no transcribed speech data is available for a language of interest.

General topics in automatic speech recognition

Supervisor: Peter Bell

We are interested in many different areas of automatic speech recognition, including:

Adaptation of neural network models
Connections between end-to-end models and hybrid-HMM models
Algorithms for efficient alignment, search and decoding of audio data
Audio visual speech recognition and enhancement
Speaker identification and diarization
Fairness and bias in speech recognition algorithm development

Next frontiers in speech generation

Supervisors: Simon King, Catherine Lai, Korin Richmond, Hao Tang

There is a long history of research into speech generation at the University of Edinburgh. From as far back as the 1950s (with the "Parametric Artificial Talker"), Edinburgh has been continuously active and at the forefront of the field. Today, we are as active as ever, with a strong group of academics, PhD candidates and postdoctoral researchers working on a broad range of topics, currently including:

state-of-the-art neural acoustic modelling and waveform generation
controllable and context-aware speech synthesis, including prosody control,"long-form" synthesis using wide textual context, human-in-the-loop, synthetic speech for interactive applications such as dialogue, etc
under-resourced languages
synthetic speech evaluation
text normalisation and frontend processing
pronunciation modelling (e.g. multi-accent, accent-independent representations)
self-supervised learning representations of speech
Human perception of synthetic speech (e.g. in terms of emotion, style)

Pronunciation modelling

Supervisor: Korin Richmond

Many areas of speech technology rely upon having a representation of "how something is said"; pronunciation is a key bridge between the discrete symbolic words that exist in written form, or in the human mind, and the speech sounds that constitute their corresponding spoken realisation. The mainstays of pronunciation modelling have hitherto been expert-devised discrete unit phone sets, hand-crafted pronunciation lexica, and grapheme-to-phoneme mappings and post-lexical processing. Machine learning can be employed in a variety of interesting and increasingly appealing ways though, but there remains much to be done on such topics as:

automatic lexicon creation (pronunciation “harvesting”)
machine learning grapheme-to-phoneme models (G2P)
morphological decomposition
text normalisation (token classification, grammar induction)
derivational morphology for pronunciation prediction
multi-accent approaches
automatic accent recognition
self-supervised learning representations versus traditional phonetics
discrete versus continuous feature pronunciations
pronunciation continua

Articulatory data and modelling

Supervisor: Korin Richmond

Any speech uttered by a human can be viewed as having two representations: the audio signal we hear, but also the articulator movements which created that sound.

There has been much work on developing ways to measure and record speech articulation, and then exploiting that data in a variety of speech technology applications. Example applications of interest include:

Silent Speech Interfaces, where we seek to use measurements of a user's mouth movements alone either to recognise what they are saying (i.e. ASR) or to generate an audible speech signal for others to hear (i.e. "voice reconstruction", such as https://jxzhanggg.github.io/TaLNet_demos/ )
Inversion mapping, where we seek to automatically infer articulator movements from a given audio signal. This has application to: providing visual biofeedback for speech therapy or computer-assisted pronunciation training (CAPT); lip-synching and facial animation for characters in animated films or games; or to provide articulatory data for any other speech technology application which needs it, but where only an audio signal is available.
Combined audio-visual synthesis, where we want to animate an avatar from text for example, generating the speech audio and mouth movements simultaneously so they match perfectly.

Numerous articulographic data sources are available (e.g. EMA, EPG, EGG, MRI, mocap, etc.), but ultrasound is a particular current focus. CSTR's multi-speaker TaL corpus (https://doi.org/10.1109/SLT48900.2021.9383619 or https://arxiv.org/abs/2011.09804 ) is a resource that is ready to be exploited for a number of novel projects. The fundamental questions we encounter when using articulation in speech technology include:

what constraints and inductive biases can articulatory representations bring to speech modelling?
how can we use articulatory data in speaker-independent ways?
how can we integrate multiple sources/modalities of articulatory data?
how does articulation vary between different speech types and speaking styles?
what forms of articulatory data are most amenable to modelling (e.g. point-tracking versus the various types of imaging data)?
what is the most effective way of using imaging data (e.g. ultrasound) in speech technology applications?

Natural interactive virtual agents

Supervisor: Hiroshi Shimodaira

Development of a lifelike animated character that is capable of establishing natural interaction with humans in terms of non-verbal signals.

Embodied conversational agents (ECAs) aim to foster natural communication between machine and humans. State-of-the-art technology in computer graphics has made it possible to create photo-realistic animation of human faces. However, it is not the case when the interactions between ECA and human are concerned. Interactions of ECA with humans are not as natural as those between humans. Although there are many reasons for this, the present project focuses on the non-verbal aspect of communication such as gestures and gaze, and seeks to develop an ECA system that is capable of recognising user's non-verbal signals and synthesising appropriate signals of the agent.

Gesture synthesis for lifelike conversational agents

Supervisor: Hiroshi Shimodaira

Development of a mechanism for controlling the gestures of photo-realistic lifelike agents when the agents are in various modes; idling, listening, speaking and singing.

Lifelike conversational agents, behaving like humans with facial animation and gesture, and making speech conversations with humans, are one of the next-generation human-interface. Much effort has been made so far to make the agents natural, especially controlling mouth/lip movement, and eye movement. On the other hand, controlling the non-verbal movements of the head, facial expressions, and shoulders have not been studied that much, even though those motions sometimes plays a crucial role in naturalness and intelligibility. The purpose of the project is to develop a mechanism for creating such motions of photo-realistic lifelike agents when the agents are in various modes; idling, listening, speaking and singing. One of the outstanding features of the project is that it aims to give the agent virtual personality by imitating the manner of movements/gestures of an existing person with the help of machine learning techniques used for text-to-speech synthesis.

Evaluating the impact of expressive speech synthesis on embodied conversational agents

Supervisor: Hiroshi Shimodaira

Evaluation of embodied conversational agents (ECAs) has tended to concentrate on userbility - do users like the system, does the system achieve its objectives. We are not aware of any studies of this type which have controlled for the speech synthesis used in the project where the speech synthesis used was close to the state of the art. This project will develop evaluation based on measured interaction where expressive speech synthesis is the topic of study. It will explore carefully controlled interactive environments, and measure a subjects performance as well as measuring the physiological effects of the experiment on the subject. It will explore how involved (emotionally or otherwise) our subject is with the ECA. The userbility approach described above is also important for these experiments but we also wish to determine how speech is affecting involvement over time. For example, we would expect to increase a subjects arousal by adding emotional elements to the speech, we might expect to destroy a subjects involvement by intentionally producing an error which undermines the ECAs believability.

Self-supervised learning for speech representation learning

Supervisors: Hao Tang, Peter Bell, Sharon Goldwater

Self-supervised learning has made possible improving a wide range of tasks with unlabeled data. However, the connection between the training objective, e.g., masked reconstruction, and the improvement on tasks is generally loose. A low masked reconstruction loss does not always translate to improvements on tasks, resulting in a lack of means for practitioners to debug when one self-supervised model performs worse than another. One approach to strengthening the connection between the training objective and the improvement on tasks is to include modeling assumptions on the learned representations, such as discreteness, slowness, segmental constraints, hyperbolic space assumptions, space alignment assumptions, to name a few. In this topic, we study the following questions.

Do these modeling assumptions permit efficient implementations?
How do individual modeling assumptions change the learned representations?
How do measures of individual modeling assumptions correlate with the improvement of tasks?
How can individual modeling assumptions be used to include side information, e.g., to achieve grounding in self-supervised representations
What information is contained in self-supervised representations and how is this affected by the model structure and choice of pretext tasks?

Model compression for speech models

Supervisor: Hao Tang

To make use of the growing amount of speech data, the size of speech models, self-supervised learning and automatic speech recognition in particular, also increases rapidly in the past few years. One can use model compression techniques, such as pruning, to obtain small models without much loss of performance. However, prior work is content with finding a small model that can perform well, largely ignoring the computational and memory cost of compression itself and ignoring other differences between the small and large models. The goal of this topic is to develop computational- and memory-efficient techniques for model compression, while combining analytical tools to understand the interplay between model compression and representation learning. In particular, we aim to answer the following research questions.

What are the key properties of large neural networks that allow us to search for small models given the large ones?
What are lost when compressing a large model into a small one?
How do the learned representations change when a model is compressed?
What does compression tell us about the roles of individual components in the model?

Understanding the non-lexical aspects of spoken language

Supervisors: Catherine Lai, Peter Bell

To build models that really understand the richness of spoken communication, we need to consider not just what was said, but also how it was said. That is, we can consider speech to be a multichannel communication system where both the lexical content (i.e. the transcript) and speech acoustics (e.g. the prosody) contribute to meaning. Both channels also create expectations on the discourse and dialogue to come. This project would build on our ongoing work which investigates what characteristics of communicative context affect listeners’ expectations of speech and how we can model this computationally for both spoken language understanding and speech generation. Recently, we’ve been looking at dialogue continuations and emotion recognition, but we’re also more generally interested in what lexical and non-lexical aspects of speech tell us about spoken dialogue structure (e.g. topics of conversation) and expression/perception of affect (e.g. emotion and stance).

Topics in unsupervised speech processing and/or modelling infant speech perception

Supervisor: Sharon Goldwater

Work in unsupervised (or 'zero-resource') speech processing encompasses areas such as learning sub-word representations that better capture phonetic or other linguistic information; and identifying repeated units such as phones or words in a target language without any transcribed training data. Systems that solve these tasks could be useful as a way to index and summarize speech files in under-resourced languages. In some cases, they may also be useful as cognitive models, allowing us to investigate how human infants might begin to identify words in the speech stream of their native language. There are many open questions, so various projects are possible. Projects could focus mainly on speech technology or mainly on modeling language acquisition; specific research questions will depend on this choice. Here are just two possibilities: (1) recent work on self-supervised learning (a type of unsupervised learning) has shown that SSL models can learn representations for speech that factor out some of the speaker and contextual variability. However, the nature of these representations, and whether they handle such variability in similar ways to humans, is not well understood. Investigate these questions by developing new ways to analyze and compare SSL representations to each other and to infant or adult speech perception results. (2) existing work in both speech processing and cognitive modeling suggests that trying to learn either words or phones alone may be too difficult and in fact we need to develop *joint learners* that simultaneously learn at both levels. Investigate models that can be used to do this and evaluate how joint learning can improve performance.

This article was published on 16 Oct, 2023