TITLE: Living with Dataset Biases in Automatic Summarization



Progress on NLP benchmarks over the past several years has been nothing short of astounding, thanks to pre-trained language models. Nevertheless, current systems often learn dataset-specific cues that correlate with performance, rather than the underlying task dynamics, which is a form of dataset bias. In this talk, I discuss how this phenomenon manifests in automatic summarization in content selection and in summary generation. First, I show that current systems rely heavily on the stereotypical discourse structure of a genre (e.g., news summarization or scientific articles) to determine importance, rather than directly modelling the importance of the contents of the text. By exploiting this signal, we can develop an unsupervised summarization system of scientific articles that rivals the performance supervised systems trained with hundreds of thousands of samples. Then, I discuss how abstractive summarization systems are limited by a lack of semantic understanding, for example by generating factually incorrect outputs. I present our work on addressing such issues, including methods that directly aim to correct the factuality problem, as well as new evaluations and modelling techniques that focus on semantic understanding and abstraction.  


Jackie Chi Kit Cheung is an assistant professor at McGill University's School of Computer Science where he co-directs the Reasoning and Learning Lab, and a Canada CIFAR AI Chair at the Mila Quebec AI Institute. His research focuses on natural language generation tasks such as automatic summarization, and on integrating diverse sources of knowledge into NLP systems for pragmatic and common-sense reasoning. He is motivated in particular by how the structure of the world can be reflected in the structure of language processing systems. Dr. Cheung was a Program Co-Chair of Canadian AI 2018, and received a best paper award at ACL 2018. He is a consulting researcher at Microsoft Research.



This event is co-organised by ILCC and by the UKRI Centre for Doctoral Training in Natural Language Processing, https://nlp-cdt.ac.uk

