Natural Language Processing and Computational Linguistics

A list of potential topics for PhD students in the area of Language Processing.

Concurrency in (computational) linguistics

Improving understanding of synchronic and diachronic aspects of phonology.

Supervisor: Julian Bradfield

In several aspects of linguistic analysis, it is natural to think of some form of concurrent processing, not least because the brain is a massively concurrent system. This is particularly true in phonology and phonetics, and descriptions such as feature analyses, and especially autosegmental phonology, go some way to recognizing this. Although there has been some work on rigorous formal models of such descriptions, there has been little if any application of the extensive body of research in theoretical computer science on concurrent processes. Such a project has the potential to give better linguistic understanding of synchronic and diachronic aspects of phonology and perhaps syntax, and even to improve speech generation and recognition, by adding formal underpinning and improvement to the existing agent-based approaches.

Spectral learning for natural language processing

Supervisors: Shay Cohen, Mirella Lapata

Latent variable modeling is a common technique for improving the expressive power of natural language processing models. The values of these latent variables are missing in the data, but we are still required to predict these values and estimate the model parameters while assuming these variables exist in the model. This project seeks to improve the expressive power of NLP models at various levels (such as morphological, syntactic and semantic) using latent variable modeling, and also to identify key techniques, based on spectral algorithms, in order to learn these models. The family of latent-variable spectral learning algorithms is a recent exciting development that was seeded in the machine learning community. It presents a principled, well-motivated approach for estimating the parameters of latent variable models using tools from linear algebra.

Natural language semantics and question answering

Supervisor: Shay Cohen

How can we make computers understand language? This is a question at the core of the area of semantics in natural language processing. Question answering, an NLP application in which a computer is expected to respond to natural language questions, provides a lens to look into this challenge. Most modern search engines offer some level of functionality for factoid question answering. These systems have high precision, but their recall could significantly be improved. From the technical perspective, question answering offers a sweet spot between challenging semantics problems that are not expected to be solved in the near future, and problems that will be solved in the foreseeable future. As such, it is an excellent test-bed for semantic representation theories and for other attempts at describing the meaning of text. The most recent development in question answering is that of the retrieval of answers from open knowledge bases such as Freebase (a factoid database of various facts without a specific domain tying them all together). The goal of this project is to explore various methods to improve semantic representations in language, with open question answering being potentially an important application for testing them. These semantic representations can either be symbolic (enhanced with a probabilistic interpretation) or they can be projections in a continuous geometric space. Both of these ideas have been recently explored in the literature.

Topics in morphology (NLP or cognitive modelling)

Supevisor:  Sharon Goldwater

Many NLP systems developed for English ignore the morphological structure of words and (mostly) get away with it. Yet morphology is far more important in many other languages. Handling morphology appropriately can reduce sparse data problems in NLP, and understanding human knowledge of morphology is a long-standing scientific question in cognitive science. New methods in both probabilistic modeling and neural networks have the potential to improve word representations for downstream NLP tasks and perhaps to shed light on human morphological acquisition and processing. Projects in this area could involve working to combine distributional syntactic/semantic information with morphological information to improve word representations for low-resource languages or sparse datasets, evaluating new or existing models of morphology against human behavioral benchmarks, or related topics.

Directly learning phrases in machine translation

Supervisor: Kenneth Heafield

Machine translation systems memorize phrasal translations so that they can translate chunks of text at a time.  For example, the system memorizes that the phrase "Chambre des représentants" translates as "House of Representatives" with some probability.  These phrases are currently identified using heuristics on top of word translations. However, word-level heuristics inadequately capture non-compositional phrases like "hot dog".

This project will look at ways to replace word-level heuristics by learning phrases directly from translated text.  Prior work (DeNero et al, 2006) has sought to segment translated sentences into translated phrases, but failed to address overlapping phrases.  Potential topics include models for overlapping phrases, the large number of parameters that results from considering all pairings of phrases, discriminatively optimizing phrases directly for translation quality, and modeling compositionality.

Tera-scale language models

Supervisor: Kenneth Heafield

Given a copy of the web, how well can we predict the next word you will type or figure out what sentence you said?  We have built language models on trillions of words of text and seen that they do improve quality.  Is this the best model?  Can we use 10x more data?  If we need to query the models on machines with fewer terabytes of RAM, where should we approximate or put the remaining data?  This project is about both the systems aspect of dealing with large amounts of data and the modeling questions of quality and approximation.  By jointly looking at systems and modeling we hope to arrive at the best language models of all sizes, rather than limiting the data we use.

Optimizing structured objectives

Supervisor: Kenneth Heafield

Natural language systems often boil down to search problems: find high-scoring options in a structured but exponential space.  Our ability to efficiently search these spaces limits what types of features can be used and, ultimately, the quality of the system.  This project is about growing the class of efficient features by making better search algorithms.  We would like to support long-distance continuous features like neural networks and discrete features such as parsing.  While most people treat features as a black box that returns final scores, we will open the box and improve search in ways that exploit the internal structure of natural language features.  Applications include machine translation, speech recognition, and parsing.

Multi-task neural networks

Supervisor: Kenneth Heafield

Neural networks are advertised as a way to directly optimize an objective function instead of doing feature engineering.  However, current systems feature separately-trained word embeddings, language models, and translation models.  Part of the problem is that there are different data sets for each task.  This work will look at multi-task learning as a way to incorporate multiple data sets.  Possible uses include direct speech-to-speech translation or joint language and translation modeling.

Incremental Processing with Neural Networks

Design neural network models that can process input incrementally (word by word) and apply them to NLP tasks such as part of speech tagging, parsing, or semantic role labeling

Supervisor: Frank Keller

Incremental sentence processing, i.e., the construction of representations on a word-by-word basis as the input unfolds, is not only a central property of human language processing, it is also crucial for NLP systems that need to work in real time, e.g., for sentence completion, speech translation, or dialogue. However, the deep learning techniques used in most state-of-the-art NLP make it hard to achieve incrementality. Even architectures that can in principle work be incremental (such as recurrent neural networks) are in practice used in a bidirectonal fashion, as they require right context for good performance.

The aim of this project is to develop novel neural architectures that can perform tasks such as part of speech tagging, parsing, or semantic role labeling incrementally. The problem decomposes into two parts: (1) design features that are maximally informative, even though they don't have access to right context; (2) develop learning algorithms work well with the incomplete input available during incremental processing. For both tasks, ideas from existing generative models for incremental processing can be leveraged. In particular the idea of computing expected the completions of a sentence prefix could be integrated in the cost function used for neural network training.

Structured representations of images and text

Develop models for learning structured representations of images and exploit them for language/vision tasks such as image description or visual question answering 

Supervisor: Frank Keller

The web is awash with image data: on Facebook alone, 350 million new images are uploaded every day. These images typically co-occur with textual data such as comments, captions, or tags. Well-established techniques exist for extracting structure from text, but image structure is a much less explored area. Prior work has shown that images can be represented as graphs expressing the relations between the objects in the image. One example is Visual Dependency Representations (VDRs), which can be aligned with textual data and used for image description.

The aim of this project is to explore the use of structured image representations such as VDRs. Ideas for topics include: (1) Developing new ways of learning structured representations from images with various forms of prior annotation (using structure-aware deep learning techniques, e.g., recursive neural networks). (2) Augmenting existing structured representations to be more expressive (e.g., by representing events, attributes, scene type, background knowledge). (3) Developing models that exploit VDRs for new language/vision tasks, e.g., dense image description, visual question answering, story illustration, induction of commonsense knowledge.

Maintaining negative polarity in statistical machine translation

Supervisor: Bonnie Webber

Negative assertions, negative commands, and many negative questions all convey the opposite of their corresponding positives. Statistical machine translation (SMT), for all its other success, cannot be trusted to get this right: It may incorrectly render a negative clause in the source language as positive in the target; render negation of an embedded source clause as negation of the target matrix clause; render negative quantification in the source text as positive quantification in the target; and negative quantification in the source text as verbal negation in English, thereby significantly changing the meaning conveyed.

The goal of this research is a robust, language-independent method for improving accuracy in the translation of negative sentences in SMT. To assess negation-specific improvements, a bespoke evaluation metric must also be developed, to complement the standard SMT BLEU score.

Using discourse relations to inform sentence-level statistical MT

Supervisor: Bonnie Webber

Gold standard annotation of the Penn Discourse TreeBank has enabled the development of methods for disambiguating the intended sense of an ambiguous discourse connective such as since or while, as well as for suggesting discourse relation(s) likely to hold between adjacent sentences that are not marked with a discourse connective.

Since a discourse connective and its two arguments can be viewed as in terms of constraints that hold between pairwise between the connective and each argument, or between the arguments, we should be able to use these constraints in Statistical MT, either in decoding or in re-ranking, preferring translations that are compatible with the constraints.  One might start this work either by looking at rather abstract, high-frequency discourse relations such as contrast, which have rather weak pair-wise constraints, or by looking at rather specific, low-frequency relations such as chosen alternative, which have very strong constraints between the arguments. 

Entity-coherence and statistical MT

Supervisor: Bonnie Webber 

Entity-based coherence has been used in both Natural Language Generation (Barzilay and Lapata, 2008) : Elsner and Charniak, 2011) and essay scoring (Miltsakaki and Kukich, 2004), reflecting the observation that texts that display the entity-coherence patterns of well-written texts from the given genre are seen as being better written than texts that don't display these patterns. Recently, Guinaudeau and Strube (2013) have shown that matrix operations can be used to compute entity-based coherence very efficiently.

This project aims to apply the same insights to Statistical MT, and assess whether sentence-level SMT can be improved by promoting translations that better match natural patterns of entity-based coherence, or by getting better translations in the first place.

Improving the translation of modality in SMT

Supervisor: Bonnie Webber   

Modal utterances are common in both argumentative (rhetorical) text and instructions. This project considers the translation of such texts (for example, TED talks as representative of argumentative text) and whether the translation of modality can be improved by considered discourse-level features of such texts. For example, there may be useful constraints between adjacent sentences or clauses, such that the appearance of a modal marker in one increases/decreases the likelihood of some kind of modal marker in the other.

Low-resource language and speech processing

Supervisors: Adam LopezSharon Goldwater

The most effective language and speech processing systems are based on statistical models learned from many annotated examples, a classic application of machine learning on input/ output pairs. But for many languages and domains we have little data. Even in cases where we do have data, it is government or news text. For the vast majority of languages and domains, there is hardly anything. However, in many cases, there is side information that we can exploit: dictionaries or other knowledge sources, or text paired with weak signals, such as images, speech, or timestamps. How can we exploit such heterogeneous information in statistical language processing? The goal of projects in this area is to develop statistical models and inference techniques that exploit such data, and apply them to real problems.

What do deep neural models really learn about language?

Supervisor: Adam Lopez

Deep learning researchers claim that their models learn to represent linguistic properties of language without any explicit guidance. But recent results 

hint that they are simply very good at memorizing local correlations. What do these models really learn about language? We will use advanced techniques to probe their representations, and invent new techniques where we need to.

Deep probabilistic models of graphs

Supervisor:  Adam Lopez

For a system to understand a text and answer questions about it, the system must distill the meaning of the text into a set of facts (semantic parsing). We can represent these facts as a graph: entities and events become nodes, and relationships between them become edges. We now have datasets that pair text with such graphs, and we'd like to learn a semantic parser from this data, so we need to model graphs. How do we design and use deep probabilistic models of graphs?

Incremental interpretation for robust NLP using CCG and dependency parsing

Supervisor: Mark Steedman

Combinatory Categorial Grammar (CCG) is a computational grammar formalism that has recently been used widely in NLP applications including wide-coverage parsing, generation, and semantic parser induction.  The present project seeks to apply insights from these and other sources including dependency parsing to the problem of incremental word-by-word parsing and interpretation using statistical models.  Possible evaluation tasks include language modeling for automatic speech recognition, as well as standard parsing benchmarks.

Data-driven learning of temporal semantics for NLP

Supervisor: Mark Steedman

Mike Lewis' Edinburgh thesis (2015) shows how to derive a natural language semantics for wide coverage parsers that directly captures relations of paraphrase and entailment using machine learning and parser-based machine reading of large amounts of text.  The present project

seeks to extend the semantics to temporal and causal relations between events, such as that being somewhere is the consequent state of arriving there, using large amounts of timestamped text.

Constructing large knowledge graphs from text using machine reading

Supervisor: Mark Steedman

Knowledge graphs like Freebase are constructed by hand using relation labels that are not easy to map onto natural language semantics, especially for languages other than English. An obvious alternative is to build the knowledge graph in terms of language-independent natural language semantic relations, which Lewis and Steedman 2013b can be mined by machine reading from multi-lingual text.  The project will investigate the extension of the language independent semantics and its application to the construction of large knowledge resources using parser-based machine reading.

Semantic Parsing for Sequential Question Answering

Supervisor: Mirella Lapata

Semantic parsing maps natural language queries into machine interpretable meaning representations (e.g.,~logical forms or computer programs).  These representations can be executed in a task-specific environment to help users navigate a database, compare products, or reach a decision. Semantic parsers to date can handle queries of varying complexity and a wealth of representations including lambda calculus, dependency-based compositional semantics, variable-free logic, and SQL.

However, the bulk of existing work has focused on isolated queries, ignoring the fact that most natural language interfaces receive inputs in streams. Users typically ask questions or perform tasks in multiple steps, and they often decompose a complex query into a sequence of inter-related sub-queries. The aim of this project is to develop novel neural architectures for training semantic parsers in a context-dependent setting. The task involves simultaneously parsing individual queries correctly and resolving co-reference links between them. An additional challenge involves eliciting datasets which simulate the task of answering sequences of simple but inter-related questions.