Natural Language Processing and Computational Linguistics
A list of potential topics for PhD students in the area of Language Processing.
Concurrency in (computational) linguistics
Improving understanding of synchronic and diachronic aspects of phonology.
Supervisor: Julian Bradfield
In several aspects of linguistic analysis, it is natural to think of some form of concurrent processing, not least because the brain is a massively concurrent system. This is particularly true in phonology and phonetics, and descriptions such as feature analyses, and especially autosegmental phonology, go some way to recognizing this. Although there has been some work on rigorous formal models of such descriptions, there has been little if any application of the extensive body of research in theoretical computer science on concurrent processes. Such a project has the potential to give better linguistic understanding of synchronic and diachronic aspects of phonology and perhaps syntax, and even to improve speech generation and recognition, by adding formal underpinning and improvement to the existing agent-based approaches.
Spectral learning for natural language processing
Supervisors: Shay Cohen, Mirella Lapata
Latent variable modeling is a common technique for improving the expressive power of natural language processing models. The values of these latent variables are missing in the data, but we are still required to predict these values and estimate the model parameters while assuming these variables exist in the model. This project seeks to improve the expressive power of NLP models at various levels (such as morphological, syntactic and semantic) using latent variable modeling, and also to identify key techniques, based on spectral algorithms, in order to learn these models. The family of latent-variable spectral learning algorithms is a recent exciting development that was seeded in the machine learning community. It presents a principled, well-motivated approach for estimating the parameters of latent variable models using tools from linear algebra.
Natural language semantics and question answering
Supervisor: Shay Cohen
How can we make computers understand language? This is a question at the core of the area of semantics in natural language processing. Question answering, an NLP application in which a computer is expected to respond to natural language questions, provides a lens to look into this challenge. Most modern search engines offer some level of functionality for factoid question answering. These systems have high precision, but their recall could significantly be improved. From the technical perspective, question answering offers a sweet spot between challenging semantics problems that are not expected to be solved in the near future, and problems that will be solved in the foreseeable future. As such, it is an excellent test-bed for semantic representation theories and for other attempts at describing the meaning of text. The most recent development in question answering is that of the retrieval of answers from open knowledge bases such as Freebase (a factoid database of various facts without a specific domain tying them all together). The goal of this project is to explore various methods to improve semantic representations in language, with open question answering being potentially an important application for testing them. These semantic representations can either be symbolic (enhanced with a probabilistic interpretation) or they can be projections in a continuous geometric space. Both of these ideas have been recently explored in the literature.
Topics in morphology (NLP or cognitive modelling)
Supevisor: Sharon Goldwater
Many NLP systems developed for English ignore the morphological structure of words and (mostly) get away with it. Yet morphology is far more important in many other languages. Handling morphology appropriately can reduce sparse data problems in NLP, and understanding human knowledge of morphology is a long-standing scientific question in cognitive science. New methods in both probabilistic modeling and neural networks have the potential to improve word representations for downstream NLP tasks and perhaps to shed light on human morphological acquisition and processing. Projects in this area could involve working to combine distributional syntactic/semantic information with morphological information to improve word representations for low-resource languages or sparse datasets, evaluating new or existing models of morphology against human behavioral benchmarks, or related topics.
Neural Network Models of Human Language and Visual Processing
Supervisor: Frank Keller
Recent neural models have used attention mechanisms as a way of focusing the processing of a neural networks on certain parts of the input. This has proved successful for diverse applications such as image description, question answering, or machine translation. Attention is also a natural way of understanding human cognitive processing: during language processing, humans attend words in a certain order; during visual processing, they view image regions in a certain sequence. Crucially, human attention can be captured precisely using an eye-tracker, a device that measures which parts of the input the eye fixates, and for how long. Projects within this area will leverage neural attention mechanisms to model aspects of human attention. Examples include reading: when reading text, humans systematically skip words, spend more time on difficult words, and sometimes re-read passages. Another example is visual search: when looking for a target, human make a sequence of fixations which depend a diverse range of factors, such as visual salience, scene type, and object context. Neural attention models that capture such behaviors need to combine different types of knowledge, while also offering a cognitively plausible story how such knowledge is acquired, often based on only small amounts of training data.
Neural Network Models of Long-form, Multimodal Narratives
Supervisor: Frank Keller
Deep learning approaches are very successful in classical NLP tasks. However, they often assume a limited input context, and are designed to work on short texts. Standard architectures such as LSTMs or transformers therefore struggle to process long-form narratives such as books or screenplays. Prior work shows that such narratives have a particular structure, which can be analyzed in terms of events and characters. This structure can be used for applications such as question answering or summarization of long-form texts. Projects within this area will leverage recent advances in language modeling, such as retrieval-based or memory-based models, to analyze narrative structure. The analysis can take the form of sequences or graphs linking events or characters. Based on such structures, higher level concepts (e.g., schemas or tropes) can be identified, and user reaction such as suspense, surprise, or sentiment can be predicted. Multimodal narratives (illustrated stories, comics, or movies) pose a particular challenge, as narrative elements need to be grounded in both the linguistic and the visual modality to infer structure.
Multi-Sentence Questions
Supervisor: Bonnie Webber
A multi-sentence question (MSQ) is a short text specifying a question or set of related questions. Evidence suggests that the sentences in an MSQ relate to each other in difference ways and that recognizing these relations can enable a system to produce a better response. We have been gathering a corpus of MSQs and beginning to characterize relations within them. Research will involve collecting human responses to MSQs and using them to design and implement a system that produces similar responses.
Concurrent Discourse Relations
Supervisor: Bonnie Webber, Hannah Rohde (LEL)
Evidence from crowd-sourced conjunction-completion experiments shows that people systematically infer implicit discourse relations that hold in addition to discourse relations signalled explicitly. Research on shallow discourse parsing has not yet reflected these findings. It is possible that enabling a shallow discourse parser to recognize implicit relations that hold concurrently with explicitly signalled relations may also help in the recognition of implicit relations without additional signals. Work in this area could also involve crowd-sourcing additional human judgments on discourse relations.
Low-resource language and speech processing
Supervisors: Sharon Goldwater, Edoardo Ponti
The most effective language and speech processing systems are based on statistical models learned from many annotated examples, a classic application of machine learning on input/ output pairs. But for many languages and domains we have little data. Even in cases where we do have data, it is government or news text. For the vast majority of languages and domains, there is hardly anything. However, in many cases, there is side information that we can exploit: dictionaries or other knowledge sources, or text paired with weak signals, such as images, speech, or timestamps. How can we exploit such heterogeneous information in statistical language processing? The goal of projects in this area is to develop statistical models and inference techniques that exploit such data, and apply them to real problems.
Incremental interpretation for robust NLP using CCG and dependency parsing
Supervisor: Mark Steedman
Combinatory Categorial Grammar (CCG) is a computational grammar formalism that has recently been used widely in NLP applications including wide-coverage parsing, generation, and semantic parser induction. The present project seeks to apply insights from these and other sources including dependency parsing to the problem of incremental word-by-word parsing and interpretation using statistical models. Possible evaluation tasks include language modeling for automatic speech recognition, as well as standard parsing benchmarks.
Data-driven learning of temporal semantics for NLP
Supervisor: Mark Steedman
Mike Lewis' Edinburgh thesis (2015) shows how to derive a natural language semantics for wide coverage parsers that directly captures relations of paraphrase and entailment using machine learning and parser-based machine reading of large amounts of text. The present project
seeks to extend the semantics to temporal and causal relations between events, such as that being somewhere is the consequent state of arriving there, using large amounts of timestamped text.
Constructing large knowledge graphs from text using machine reading
Supervisor: Mark Steedman
Knowledge graphs like Freebase are constructed by hand using relation labels that are not easy to map onto natural language semantics, especially for languages other than English. An obvious alternative is to build the knowledge graph in terms of language-independent natural language semantic relations, which Lewis and Steedman 2013b can be mined by machine reading from multi-lingual text. The project will investigate the extension of the language independent semantics and its application to the construction of large knowledge resources using parser-based machine reading.
Semantic Parsing for Sequential Question Answering
Supervisor: Mirella Lapata
Semantic parsing maps natural language queries into machine interpretable meaning representations (e.g.,~logical forms or computer programs). These representations can be executed in a task-specific environment to help users navigate a database, compare products, or reach a decision. Semantic parsers to date can handle queries of varying complexity and a wealth of representations including lambda calculus, dependency-based compositional semantics, variable-free logic, and SQL.
However, the bulk of existing work has focused on isolated queries, ignoring the fact that most natural language interfaces receive inputs in streams. Users typically ask questions or perform tasks in multiple steps, and they often decompose a complex query into a sequence of inter-related sub-queries. The aim of this project is to develop novel neural architectures for training semantic parsers in a context-dependent setting. The task involves simultaneously parsing individual queries correctly and resolving co-reference links between them. An additional challenge involves eliciting datasets which simulate the task of answering sequences of simple but inter-related questions.
Knowledge Graph Completion with Rich Semantics
Supervisor: Jeff Pan
Knowledge Graphs have been shown useful for improving the performance and explain-ability of machine learning methods (such as transfer learning and zero-shot learning) and their downstream tasks, such as NLP tasks. The present project seeks to investigate how to integrate rich semantics into knowledge graph completion models and methods. Projects in this area could involve integrations of schema of knowledge graph, temporal and spatial constraints, updates of knowledge graphs, or information from language models.
Complex Query Answering over Knowledge Graphs
Supervisor: Jeff Pan
Answering complex queries on large-scale incomplete knowledge graphs is a fundamental yet challenging task. There have been two extremes in current research: one is to completely relying on logical reasoning of both schema and data sub-graphs but suffers from knowledge incompleteness; the other is to mainly relying embeddings but paying less attention to schema information. This present project will investigate alternative approaches including those that can make use of schema and embedding effectively.
Open-Domain Complex Question Answering at Scale
Supervisor: Pasquale Minervini
Open-Domain Question Answering (ODQA) is a task where a system needs to generate the answer to a given general-domain question, and the evidence is not given as input to the system. A core limitation of modern ODQA models is that they remain limited to answering simple, factoid questions, where the answer to the question is explicit in a single piece of evidence. In contrast, complex questions involve aggregating information from multiple documents, requiring some form of logical reasoning and sequential, multi-hop processing in order to generate the answer. Projects in this area involve proposing new ODQA models for answering complex questions, for example, by taking inspiration from models for answering complex queries in Knowledge Graphs (Arakaleyan et al., 2021; Minervini et al., 2022) and Neural Theorem Provers (Minervini et al., 2020a; Minervini et al., 2020b) and proposing methods by which neural ODQA models can learn to search in massively large text corpora, such as the entire Web.
Neuro-Symbolic and Hybrid Discrete-Continuous Natural Language Processing Models
Supervisor: Pasquale Minervini
Incorporating discrete components, such as discrete decision steps and symbolic reasoning algorithms, in neural models can significantly improve their interpretability, data efficiency, and predictive properties — for example, see (Niepert et al., 2021; Minervini et al., 2022; Minervini et al., 2020a,b). However, approaches in this space rely either on ad-hoc continuous relaxations (e.g. (Minervini et al., 2020a,b)) or on gradient estimation techniques that require some assumptions on the distributions of the discrete variables (Niepert et al., 2021; Minervini et al., 2022). Projects in this area involve devising neuro-symbolic approaches for solving NLP tasks that require some degree of reasoning and compositionality and identifying gradient estimation techniques (for back-propagating through discrete decision steps) that are both data-efficient, hyper parameter-free, accurate, and require fewer assumptions on the distribution of the discrete variables.
Learning from Graph-Structured Data
Supervisor: Pasquale Minervini
Graph-structured data is everywhere – e.g. consider Knowledge Graphs, social networks, protein and drug interaction networks, and molecular profiles. In this project, we aim to improve models for learning from graph-structured data and their evaluation protocols. Projects in this area involve incorporating invariances in graph machine learning models (e.g. see (Minervini et al., 2017)), proposing methods of transferring knowledge between graph representations, automatically identifying functional inductive biases for learning from graphs from a given domain (such as Knowledge Graphs) and proposing techniques for explaining the output of black-box graph machine learning methods (such as graph embeddings).
Modular Transfer Learning
Current neural models often fail to generalise systematically and suffer from negative transfer or catastrophic forgetting across different tasks or languages. A promising solution is endowing neural models with modularity. This property allows for i) disentangling knowledge and recombining it in new, original ways and ii) updating modules locally and asynchronously. Specifically, modules representing different languages / tasks / modalities (such as perception or action) can be implemented as parameter-efficient adapters on top of pre-trained general-purpose language models. The goal of this project is designing modular architectures capable of adapting to new tasks based on few examples.
Next-Generation Tool Synthesis for Complex Information Tasks
Supervisor: Jeff Dalton
The aim of this project is to develop new methods for tool use and plugin-based approaches to allow LLMs to perform complex tasks. Next-generation virtual assistants based on LLMs interact with external systems to assist users in information tasks, including interacting with search systems and structured API calls. It will develop new models and evaluation methods for “Auto-AppChain'' with human-in-the-loop interaction with evolving scenarios for complex tasks.
Knowledge Distillation for Adaptive Large Language Models
Supervisor: Jeff Dalton
The aim of this project is to improve the effectiveness of language models to act as knowledge bases and to perform complex reasoning tasks in an environment where information is specialized and rapidly evolving. It will study new methods to encode structured specialized knowledge in language models, how to access the information and develop new methods to edit models to keep them up to date. This will study how to adapt models to specialized topics and domains effectively while also preserving key capabilities (instruction following, in-context learning, and chat).
Personalized Content Moderation Technologies
Supervisor: Zee Talat
Content moderation is one of the vital technologies of our time. Without algorithms for sorting, ranking, and removing unwanted data current global social structures become untenable. Machine learning has been widely deployed for content moderation, however all of such systems rely on a one-size-fits-all approach. However, what content one would like to engage with, and not engage with is a deeply personal question. This project seeks to go beyond the current state of content moderation technology by examining different technological methods for personalisation of machine learning and natural language processing technologies.
Multimodal Hate Speech Detection
Supervisor: Zee Talat
Work in hate speech detection has primarily focused on the modality of text, with a lesser focus on speech and on audio, yet online communication often makes use of a combination of modalities. Taking the breadth and the nature of communication into account is vital for correct, accurate, and fair recognition of sanctionable content. This project will seek to address this gap between technology and communicate practices. In particular, this project will aim to develop methods for identifying hateful and discriminatory messages across text, audio, and video.