Language Processing topics

A list of potential topics for PhD students in the area of Language Processing.

Concurrency in (computational) linguistics

Improving understanding of synchronic and diachronic aspects of phonology.

Supervisor: Julian Bradfield

In several aspects of linguistic analysis, it is natural to think of some form of concurrent processing, not least because the brain is a massively concurrent system. This is particularly true in phonology and phonetics, and descriptions such as feature analyses, and especially autosegmental phonology, go some way to recognizing this. Although there has been some work on rigorous formal models of such descriptions, there has been little if any application of the extensive body of research in theoretical computer science on concurrent processes. Such a project has the potential to give better linguistic understanding of synchronic and diachronic aspects of phonology and perhaps syntax, and even to improve speech generation and recognition, by adding formal underpinning and improvement to the existing agent-based approaches.

Spectral learning for natural language processing

Supervisors: Shay Cohen, Mirella Lapata

Latent variable modeling is a common technique for improving the expressive power of natural language processing models. The values of these latent variables are missing in the data, but we are still required to predict these values and estimate the model parameters while assuming these variables exist in the model. This project seeks to improve the expressive power of NLP models at various levels (such as morphological, syntactic and semantic) using latent variable modeling, and also to identify key techniques, based on spectral algorithms, in order to learn these models. The family of latent-variable spectral learning algorithms is a recent exciting development that was seeded in the machine learning community. It presents a principled, well-motivated approach for estimating the parameters of latent variable models using tools from linear algebra.

Natural language semantics and question answering

Supervisor: Shay Cohen

How can we make computers understand language? This is a question at the core of the area of semantics in natural language processing. Question answering, an NLP application in which a computer is expected to respond to natural language questions, provides a lens to look into this challenge. Most modern search engines offer some level of functionality for factoid question answering. These systems have high precision, but their recall could significantly be improved. From the technical perspective, question answering offers a sweet spot between challenging semantics problems that are not expected to be solved in the near future, and problems that will be solved in the foreseeable future. As such, it is an excellent test-bed for semantic representation theories and for other attempts at describing the meaning of text. The most recent development in question answering is that of the retrieval of answers from open knowledge bases such as Freebase (a factoid database of various facts without a specific domain tying them all together). The goal of this project is to explore various methods to improve semantic representations in language, with open question answering being potentially an important application for testing them. These semantic representations can either be symbolic (enhanced with a probabilistic interpretation) or they can be projections in a continuous geometric space. Both of these ideas have been recently explored in the literature.

Topics in morphology (NLP or cognitive modelling)

Supevisor:  Sharon Goldwater

Many NLP systems developed for English ignore the morphological structure of words and (mostly) get away with it. Yet morphology is far more important in many other languages. Handling morphology appropriately can reduce sparse data problems in NLP, and understanding human knowledge of morphology is a long-standing scientific question in cognitive science. New methods in both probabilistic modeling and neural networks have the potential to improve word representations for downstream NLP tasks and perhaps to shed light on human morphological acquisition and processing. Projects in this area could involve working to combine distributional syntactic/semantic information with morphological information to improve word representations for low-resource languages or sparse datasets, evaluating new or existing models of morphology against human behavioral benchmarks, or related topics.

Directly learning phrases in machine translation

Supervisor: Kenneth Heafield

Machine translation systems memorize phrasal translations so that they can translate chunks of text at a time.  For example, the system memorizes that the phrase "Chambre des représentants" translates as "House of Representatives" with some probability.  These phrases are currently identified using heuristics on top of word translations. However, word-level heuristics inadequately capture non-compositional phrases like "hot dog".

This project will look at ways to replace word-level heuristics by learning phrases directly from translated text.  Prior work (DeNero et al, 2006) has sought to segment translated sentences into translated phrases, but failed to address overlapping phrases.  Potential topics include models for overlapping phrases, the large number of parameters that results from considering all pairings of phrases, discriminatively optimizing phrases directly for translation quality, and modeling compositionality.

Tera-scale language models

Supervisor: Kenneth Heafield

Given a copy of the web, how well can we predict the next word you will type or figure out what sentence you said?  We have built language models on trillions of words of text and seen that they do improve quality.  Is this the best model?  Can we use 10x more data?  If we need to query the models on machines with fewer terabytes of RAM, where should we approximate or put the remaining data?  This project is about both the systems aspect of dealing with large amounts of data and the modeling questions of quality and approximation.  By jointly looking at systems and modeling we hope to arrive at the best language models of all sizes, rather than limiting the data we use.

Optimizing structured objectives

Supervisor: Kenneth Heafield

Natural language systems often boil down to search problems: find high-scoring options in a structured but exponential space.  Our ability to efficiently search these spaces limits what types of features can be used and, ultimately, the quality of the system.  This project is about growing the class of efficient features by making better search algorithms.  We would like to support long-distance continuous features like neural networks and discrete features such as parsing.  While most people treat features as a black box that returns final scores, we will open the box and improve search in ways that exploit the internal structure of natural language features.  Applications include machine translation, speech recognition, and parsing.

Multi-task neural networks

Supervisor: Kenneth Heafield

Neural networks are advertised as a way to directly optimize an objective function instead of doing feature engineering.  However, current systems feature separately-trained word embeddings, language models, and translation models.  Part of the problem is that there are different data sets for each task.  This work will look at multi-task learning as a way to incorporate multiple data sets.  Possible uses include direct speech-to-speech translation or joint language and translation modeling.

Cognitive Modeling with Neural Attention

Leverage attention-based deep learning approaches to build cognitive models of human language processing or human visual processing

Supervisor: Frank Keller

Recent advances in deep learning have used attention mechanisms as a way of focusing the processing of a neural networks on certain parts of the input. This has proved successful for diverse applications such as image description, question answering, or machine translation. Attention is also a natural way of understanding human cognitive processing: during language processing, humans attend words in a certain order; during visual processing, they view image regions in a certain sequence. Crucially, human attention can be captured precisely using an eye-tracker, a device that measures which parts of the input the eye fixates, and for how long.

The aim of this project is to leverage neural attention mechanisms to model aspects of human attention. Examples include reading: when reading text, humans systematically skip words, spend more time on difficult words, and sometimes re-read passages. Another example is visual search: when looking for a target, human make a sequence of fixations which depend a diverse range of factors, such as visual salience, scene type, and object context. Neural attention models that capture such behaviors need to combine different types of knowledge, while also offering a cognitively plausible story how such knowledge is acquired, often based on only small amounts of training data.

Incremental Processing with Neural Networks

Design neural network models that can process input incrementally (word by word) and apply them to NLP tasks such as part of speech tagging, parsing, or semantic role labeling

Supervisor: Frank Keller

Incremental sentence processing, i.e., the construction of representations on a word-by-word basis as the input unfolds, is not only a central property of human language processing, it is also crucial for NLP systems that need to work in real time, e.g., for sentence completion, speech translation, or dialogue. However, the deep learning techniques used in most state-of-the-art NLP make it hard to achieve incrementality. Even architectures that can in principle work be incremental (such as recurrent neural networks) are in practice used in a bidirectonal fashion, as they require right context for good performance.

The aim of this project is to develop novel neural architectures that can perform tasks such as part of speech tagging, parsing, or semantic role labeling incrementally. The problem decomposes into two parts: (1) design features that are maximally informative, even though they don't have access to right context; (2) develop learning algorithms work well with the incomplete input available during incremental processing. For both tasks, ideas from existing generative models for incremental processing can be leveraged. In particular the idea of computing expected the completions of a sentence prefix could be integrated in the cost function used for neural network training.

Structured representations of images and text

Develop models for learning structured representations of images and exploit them for language/vision tasks such as image description or visual question answering 

Supervisor: Frank Keller

The web is awash with image data: on Facebook alone, 350 million new images are uploaded every day. These images typically co-occur with textual data such as comments, captions, or tags. Well-established techniques exist for extracting structure from text, but image structure is a much less explored area. Prior work has shown that images can be represented as graphs expressing the relations between the objects in the image. One example is Visual Dependency Representations (VDRs), which can be aligned with textual data and used for image description.

The aim of this project is to explore the use of structured image representations such as VDRs. Ideas for topics include: (1) Developing new ways of learning structured representations from images with various forms of prior annotation (using structure-aware deep learning techniques, e.g., recursive neural networks). (2) Augmenting existing structured representations to be more expressive (e.g., by representing events, attributes, scene type, background knowledge). (3) Developing models that exploit VDRs for new language/vision tasks, e.g., dense image description, visual question answering, story illustration, induction of commonsense knowledge.

Language Resources Extraction from Social Media

Supervisor: Walid Magdy

Social Media data contains large amount of information, knowledge, and resources that get generated by users every second. Although it might look noisy for the first instance, but when applying novel data mining techniques, we can extract valuable language resources that could be used as training data for different NLP tasks. The typical research question in that field will be how to automatically extract data from social media that can be used for training a machine learning model in a distant-supervision manner to different NLP applications such as, machine translation, classification, paraphrasing, NER, Multilingual word/sentence embeddings, sentiment analysis, and others.

User Behaviour Analysis on Social Media

Supervisor: Walid Magdy

Current trends in computational social science shows that data science represented in machine learning, NLP, and data mining could be used to learn about human behaviour from their activities on social media networks. For example, research questions such as: Why some people use hate speech on social media? Why voters decide to vote for a given party or candidate? How to predict voting decisions of individuals from their social media posts? How to detect people on social media suffering from depression or mental health? How to detect fake-accounts? Bots? How to detect child grooming before it happens from social media communications? Some of the current present techniques in NLP, ML, and network analysis could be directly used for answering such RQ’s. However, sometimes new techniques would need to be developed for an accurate and representative analysis. Working in this area requires having good knowledge in data data science and social/political science.

Maintaining negative polarity in statistical machine translation

Supervisor: Bonnie Webber

Negative assertions, negative commands, and many negative questions all convey the opposite of their corresponding positives. Statistical machine translation (SMT), for all its other success, cannot be trusted to get this right: It may incorrectly render a negative clause in the source language as positive in the target; render negation of an embedded source clause as negation of the target matrix clause; render negative quantification in the source text as positive quantification in the target; and negative quantification in the source text as verbal negation in English, thereby significantly changing the meaning conveyed.

The goal of this research is a robust, language-independent method for improving accuracy in the translation of negative sentences in SMT. To assess negation-specific improvements, a bespoke evaluation metric must also be developed, to complement the standard SMT BLEU score.

Using discourse relations to inform sentence-level statistical MT

Supervisor: Bonnie Webber

Gold standard annotation of the Penn Discourse TreeBank has enabled the development of methods for disambiguating the intended sense of an ambiguous discourse connective such as since or while, as well as for suggesting discourse relation(s) likely to hold between adjacent sentences that are not marked with a discourse connective.

Since a discourse connective and its two arguments can be viewed as in terms of constraints that hold between pairwise between the connective and each argument, or between the arguments, we should be able to use these constraints in Statistical MT, either in decoding or in re-ranking, preferring translations that are compatible with the constraints.  One might start this work either by looking at rather abstract, high-frequency discourse relations such as contrast, which have rather weak pair-wise constraints, or by looking at rather specific, low-frequency relations such as chosen alternative, which have very strong constraints between the arguments. 

Entity-coherence and statistical MT

Supervisor: Bonnie Webber 

Entity-based coherence has been used in both Natural Language Generation (Barzilay and Lapata, 2008) : Elsner and Charniak, 2011) and essay scoring (Miltsakaki and Kukich, 2004), reflecting the observation that texts that display the entity-coherence patterns of well-written texts from the given genre are seen as being better written than texts that don't display these patterns. Recently, Guinaudeau and Strube (2013) have shown that matrix operations can be used to compute entity-based coherence very efficiently.

This project aims to apply the same insights to Statistical MT, and assess whether sentence-level SMT can be improved by promoting translations that better match natural patterns of entity-based coherence, or by getting better translations in the first place.

Improving the translation of modality in SMT

Supervisor: Bonnie Webber   

Modal utterances are common in both argumentative (rhetorical) text and instructions. This project considers the translation of such texts (for example, TED talks as representative of argumentative text) and whether the translation of modality can be improved by considered discourse-level features of such texts. For example, there may be useful constraints between adjacent sentences or clauses, such that the appearance of a modal marker in one increases/decreases the likelihood of some kind of modal marker in the other.

Modelling non-cooperative conversation

Supervisor: Alex Lascarides

Develop and implement a model of conversation that can handle cases where the agents' goals conflict.

Work on adversarial strategies from game theory and signalling theory lack sophisticated models of linguistic meaning. Conversely, current models of natural language discourse typically lack models of human action and decision making that deal with situations where the agents' goals conflict.  The aim of this project is to fill this gap and in doing so provide a model of implicature in non-cooperative contexts.

This project involves analysing a corpus of human dialogues of users playing the game Settlers of Catan: a well-known adversarial negotiating game.  This will be used to leverage extensions to an existing state of the art dynamic semantic model of dialogue content with a logically precise model of the agents' mental states and strategies.  The project will also involve implementing these ideas into a working dialogue system that extends an existing open source agent that plays Settlers, but that has no linguistic capabilities.

Interpreting hand gestures in face to face conversation

Supervisor: Alex Lascarides

Map hand shapes and movements into a representation of their form and meaning.

The technology for mapping an acoustic signal into a sequence of words and for estimating the position of pitch accents is very well established. But estimating which hand movements are communicative and which aren't, estimating which part of a communicative hand movement is the stroke or post-stroke hold (i.e., those part of the move that conveys meaning) is much less well understood. Furthermore, to build a semantic representation of the multimodal action, one must, for depicting gestures at least (that is, gestures whose form resembles their meaning) capture qualitative properties of its shape, position and movement (e.g., that the trajectory of the hand was a circle, or a straight line moving vertically upwards).  On the other hand, deictic gestures must be represented using quantitative values in 4D Euclidean space. Mapping hand movement to these symbolic and quantitative representations of form is also an unsolved problem.

The aim of this project is to create and exploit a corpus to learn mappings from the communicative multimodal signals to the representation of their form, as required by an existing online grammar of multimodal action, which in turn is designed to yield (underspecified) representations of the meaning of the multimodal action.  We plan to use state of the art models of visual processing using kinect cameras to estimate hand positions and hand shapes, and design Hidden Markov Models that exploit the visual signal, language models and gesture models to estimate the qualitative (and quantitative) properties of the gesture.

The content of multimodal interaction

Supervisor: Alex Lascarides

To design, implement and evaluate a semantic model of conversation that takes place in a dynamic environment.

It is widely attested in descriptive linguistics that non-linguistic events dramatically affect the interpretation of linguistic moves and conversely, linguistic moves affect how people perceive or conceptualise their environemnt.  For instance, suppose I look upset and so you ask me "What's wrong?"  I look over my shoulder towards a scribble on the living room wall, and then utter "Charlotte's been sent to her room".  An adequate interpretation of my response can be paraphrased as: Charlotte has drawn on the wall, and as a consequence she has been sent to her room.  In other words, you need to conceptualise the scribble on the wall as the result of Charlotte's actions; moreover, this non-linguistic event, with this description, is a part of my response to your question.  Traditional semantic models of dialogue don't allow for this type of interaction between linguistic and non-linguistic contexts.  The aim of this project is to fix this, by extending and refining an existing formal model of discourse structure to support the semantic role of non-linguistic events in context in the messages that speakers convey. The project will draw on data from an existing corpus of people playing Settlers of Catan, where there are many examples of complex semantic relationships among the player's utterances and the non-linguistic moves in the board game.  The project involves formally defining a model of discourse structure that supports the interpretation of these multimodal moves, and developing a discourse parser through machine learning on the Settlers corpus.

Low-resource language and speech processing

Supervisors: Adam LopezSharon Goldwater

The most effective language and speech processing systems are based on statistical models learned from many annotated examples, a classic application of machine learning on input/ output pairs. But for many languages and domains we have little data. Even in cases where we do have data, it is government or news text. For the vast majority of languages and domains, there is hardly anything. However, in many cases, there is side information that we can exploit: dictionaries or other knowledge sources, or text paired with weak signals, such as images, speech, or timestamps. How can we exploit such heterogeneous information in statistical language processing? The goal of projects in this area is to develop statistical models and inference techniques that exploit such data, and apply them to real problems.

What do deep neural models really learn about language?

Supervisor: Adam Lopez

Deep learning researchers claim that their models learn to represent linguistic properties of language without any explicit guidance. But recent results 

hint that they are simply very good at memorizing local correlations. What do these models really learn about language? We will use advanced techniques to probe their representations, and invent new techniques where we need to.

Deep probabilistic models of graphs

Supervisor:  Adam Lopez

For a system to understand a text and answer questions about it, the system must distill the meaning of the text into a set of facts (semantic parsing). We can represent these facts as a graph: entities and events become nodes, and relationships between them become edges. We now have datasets that pair text with such graphs, and we'd like to learn a semantic parser from this data, so we need to model graphs. How do we design and use deep probabilistic models of graphs?

Incremental interpretation for robust NLP using CCG and dependency parsing

Supervisor: Mark Steedman

Combinatory Categorial Grammar (CCG) is a computational grammar formalism that has recently been used widely in NLP applications including wide-coverage parsing, generation, and semantic parser induction.  The present project seeks to apply insights from these and other sources including dependency parsing to the problem of incremental word-by-word parsing and interpretation using statistical models.  Possible evaluation tasks include language modeling for automatic speech recognition, as well as standard parsing benchmarks.

Data-driven learning of temporal semantics for NLP

Supervisor: Mark Steedman

Mike Lewis' Edinburgh thesis (2015) shows how to derive a natural language semantics for wide coverage parsers that directly captures relations of paraphrase and entailment using machine learning and parser-based machine reading of large amounts of text.  The present project

seeks to extend the semantics to temporal and causal relations between events, such as that being somewhere is the consequent state of arriving there, using large amounts of timestamped text.

Constructing large knowledge graphs from text using machine reading

Supervisor: Mark Steedman

Knowledge graphs like Freebase are constructed by hand using relation labels that are not easy to map onto natural language semantics, especially for languages other than English. An obvious alternative is to build the knowledge graph in terms of language-independent natural language semantic relations, which Lewis and Steedman 2013b can be mined by machine reading from multi-lingual text.  The project will investigate the extension of the language independent semantics and its application to the construction of large knowledge resources using parser-based machine reading.

Statistical NLP for programming languages

Supervisor: Charles Sutton

Find syntactic patterns in corpora of programming language text.

The goal of this project is to apply the advanced statistical techniques from natural language processing to a completely different and new textual domain: programming language text.  Think about how you program when you are using a new library or new environment for the first time. You "program by search engine", i.e., you search for examples of people who have used the same library, and you copy chunks of code from them. I want to systemize this process, and apply it at a large scale.  We have collected a corpus of 1.5 billion lines of source code from 8000 software projects, and we want to find syntactic patterns that recur across projects. These can then be presented to a programmer as she is writing code, providing an autocomplete functionality that can suggest entire function bodies.  Statistical techniques involved include language modeling, data mining, and Bayesian nonparametrics.  This also raises some deep and interesting questions in software engineering: i.e., Why do syntactic patterns occur in professionally written software when they could be refactored away?