Natural Language Processing and Computational Linguistics
A list of potential topics for PhD students in the area of Language Processing.
Concurrency in (computational) linguistics
Improving understanding of synchronic and diachronic aspects of phonology.
Supervisor: Julian Bradfield
In several aspects of linguistic analysis, it is natural to think of some form of concurrent processing, not least because the brain is a massively concurrent system. This is particularly true in phonology and phonetics, and descriptions such as feature analyses, and especially autosegmental phonology, go some way to recognizing this. Although there has been some work on rigorous formal models of such descriptions, there has been little if any application of the extensive body of research in theoretical computer science on concurrent processes. Such a project has the potential to give better linguistic understanding of synchronic and diachronic aspects of phonology and perhaps syntax, and even to improve speech generation and recognition, by adding formal underpinning and improvement to the existing agent-based approaches.
Spectral learning for natural language processing
Latent variable modeling is a common technique for improving the expressive power of natural language processing models. The values of these latent variables are missing in the data, but we are still required to predict these values and estimate the model parameters while assuming these variables exist in the model. This project seeks to improve the expressive power of NLP models at various levels (such as morphological, syntactic and semantic) using latent variable modeling, and also to identify key techniques, based on spectral algorithms, in order to learn these models. The family of latent-variable spectral learning algorithms is a recent exciting development that was seeded in the machine learning community. It presents a principled, well-motivated approach for estimating the parameters of latent variable models using tools from linear algebra.
Natural language semantics and question answering
Supervisor: Shay Cohen
How can we make computers understand language? This is a question at the core of the area of semantics in natural language processing. Question answering, an NLP application in which a computer is expected to respond to natural language questions, provides a lens to look into this challenge. Most modern search engines offer some level of functionality for factoid question answering. These systems have high precision, but their recall could significantly be improved. From the technical perspective, question answering offers a sweet spot between challenging semantics problems that are not expected to be solved in the near future, and problems that will be solved in the foreseeable future. As such, it is an excellent test-bed for semantic representation theories and for other attempts at describing the meaning of text. The most recent development in question answering is that of the retrieval of answers from open knowledge bases such as Freebase (a factoid database of various facts without a specific domain tying them all together). The goal of this project is to explore various methods to improve semantic representations in language, with open question answering being potentially an important application for testing them. These semantic representations can either be symbolic (enhanced with a probabilistic interpretation) or they can be projections in a continuous geometric space. Both of these ideas have been recently explored in the literature.
Topics in morphology (NLP or cognitive modelling)
Supevisor: Sharon Goldwater
Many NLP systems developed for English ignore the morphological structure of words and (mostly) get away with it. Yet morphology is far more important in many other languages. Handling morphology appropriately can reduce sparse data problems in NLP, and understanding human knowledge of morphology is a long-standing scientific question in cognitive science. New methods in both probabilistic modeling and neural networks have the potential to improve word representations for downstream NLP tasks and perhaps to shed light on human morphological acquisition and processing. Projects in this area could involve working to combine distributional syntactic/semantic information with morphological information to improve word representations for low-resource languages or sparse datasets, evaluating new or existing models of morphology against human behavioral benchmarks, or related topics.
Incremental Processing with Neural Networks
Design neural network models that can process input incrementally (word by word) and apply them to NLP tasks such as part of speech tagging, parsing, or semantic role labeling
Supervisor: Frank Keller
Incremental sentence processing, i.e., the construction of representations on a word-by-word basis as the input unfolds, is not only a central property of human language processing, it is also crucial for NLP systems that need to work in real time, e.g., for sentence completion, speech translation, or dialogue. However, the deep learning techniques used in most state-of-the-art NLP make it hard to achieve incrementality. Even architectures that can in principle work be incremental (such as recurrent neural networks) are in practice used in a bidirectonal fashion, as they require right context for good performance.
The aim of this project is to develop novel neural architectures that can perform tasks such as part of speech tagging, parsing, or semantic role labeling incrementally. The problem decomposes into two parts: (1) design features that are maximally informative, even though they don't have access to right context; (2) develop learning algorithms work well with the incomplete input available during incremental processing. For both tasks, ideas from existing generative models for incremental processing can be leveraged. In particular the idea of computing expected the completions of a sentence prefix could be integrated in the cost function used for neural network training.
Structured representations of images and text
Develop models for learning structured representations of images and exploit them for language/vision tasks such as image description or visual question answering
Supervisor: Frank Keller
The web is awash with image data: on Facebook alone, 350 million new images are uploaded every day. These images typically co-occur with textual data such as comments, captions, or tags. Well-established techniques exist for extracting structure from text, but image structure is a much less explored area. Prior work has shown that images can be represented as graphs expressing the relations between the objects in the image. One example is Visual Dependency Representations (VDRs), which can be aligned with textual data and used for image description.
The aim of this project is to explore the use of structured image representations such as VDRs. Ideas for topics include: (1) Developing new ways of learning structured representations from images with various forms of prior annotation (using structure-aware deep learning techniques, e.g., recursive neural networks). (2) Augmenting existing structured representations to be more expressive (e.g., by representing events, attributes, scene type, background knowledge). (3) Developing models that exploit VDRs for new language/vision tasks, e.g., dense image description, visual question answering, story illustration, induction of commonsense knowledge.
Supervisor: Bonnie Webber
A multi-sentence question (MSQ) is a short text specifying a question or set of related questions. Evidence suggests that the sentences in an MSQ relate to each other in difference ways and that recognizing these relations can enable a system to produce a better response. We have been gathering a corpus of MSQs and beginning to characterize relations within them. Research will involve collecting human responses to MSQs and using them to design and implement a system that produces similar responses.
Concurrent Discourse Relations
Evidence from crowd-sourced conjunction-completion experiments shows that people systematically infer implicit discourse relations that hold in addition to discourse relations signalled explicitly. Research on shallow discourse parsing has not yet reflected these findings. It is possible that enabling a shallow discourse parser to recognize implicit relations that hold concurrently with explicitly signalled relations may also help in the recognition of implicit relations without additional signals. Work in this area could also involve crowd-sourcing additional human judgments on discourse relations.
Low-resource language and speech processing
Supervisors: Sharon Goldwater
The most effective language and speech processing systems are based on statistical models learned from many annotated examples, a classic application of machine learning on input/ output pairs. But for many languages and domains we have little data. Even in cases where we do have data, it is government or news text. For the vast majority of languages and domains, there is hardly anything. However, in many cases, there is side information that we can exploit: dictionaries or other knowledge sources, or text paired with weak signals, such as images, speech, or timestamps. How can we exploit such heterogeneous information in statistical language processing? The goal of projects in this area is to develop statistical models and inference techniques that exploit such data, and apply them to real problems.
Incremental interpretation for robust NLP using CCG and dependency parsing
Supervisor: Mark Steedman
Combinatory Categorial Grammar (CCG) is a computational grammar formalism that has recently been used widely in NLP applications including wide-coverage parsing, generation, and semantic parser induction. The present project seeks to apply insights from these and other sources including dependency parsing to the problem of incremental word-by-word parsing and interpretation using statistical models. Possible evaluation tasks include language modeling for automatic speech recognition, as well as standard parsing benchmarks.
Data-driven learning of temporal semantics for NLP
Supervisor: Mark Steedman
Mike Lewis' Edinburgh thesis (2015) shows how to derive a natural language semantics for wide coverage parsers that directly captures relations of paraphrase and entailment using machine learning and parser-based machine reading of large amounts of text. The present project
seeks to extend the semantics to temporal and causal relations between events, such as that being somewhere is the consequent state of arriving there, using large amounts of timestamped text.
Constructing large knowledge graphs from text using machine reading
Supervisor: Mark Steedman
Knowledge graphs like Freebase are constructed by hand using relation labels that are not easy to map onto natural language semantics, especially for languages other than English. An obvious alternative is to build the knowledge graph in terms of language-independent natural language semantic relations, which Lewis and Steedman 2013b can be mined by machine reading from multi-lingual text. The project will investigate the extension of the language independent semantics and its application to the construction of large knowledge resources using parser-based machine reading.
Semantic Parsing for Sequential Question Answering
Supervisor: Mirella Lapata
Semantic parsing maps natural language queries into machine interpretable meaning representations (e.g.,~logical forms or computer programs). These representations can be executed in a task-specific environment to help users navigate a database, compare products, or reach a decision. Semantic parsers to date can handle queries of varying complexity and a wealth of representations including lambda calculus, dependency-based compositional semantics, variable-free logic, and SQL.
However, the bulk of existing work has focused on isolated queries, ignoring the fact that most natural language interfaces receive inputs in streams. Users typically ask questions or perform tasks in multiple steps, and they often decompose a complex query into a sequence of inter-related sub-queries. The aim of this project is to develop novel neural architectures for training semantic parsers in a context-dependent setting. The task involves simultaneously parsing individual queries correctly and resolving co-reference links between them. An additional challenge involves eliciting datasets which simulate the task of answering sequences of simple but inter-related questions.