Thursday, 4th April - 11am Piotr Nawrot : Seminar

Title: Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Abstract:

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key-value cache compression at inference time. Most importantly, the model learns to apply different compression rates in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to ~3.7x throughput increase in auto-regressive inference on a NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget.

Bio:

Piotr Nawrot is a second-year PhD student with the UKRI CDT in NLP at the University of Edinburgh, advised by Edoardo Maria Ponti and Ivan Titov. Previously, he obtained Bachelor’s degree in Computer Science from the University of Warsaw and completed multiple internships at tech companies like Nvidia and Meta AI. His research focuses broadly on improving the efficiency of neural models. More specifically, he is interested in learnable ways to compress a sequence of tokens which could pave the way for tokenisation-free and more compute-optimal models.

Apr 04 2024 11.00 - 12.00

Thursday, 4th April - 11am Piotr Nawrot : Seminar

This event is co-organised by ILCC and by the UKRI Centre for Doctoral Training in Natural Language Processing, https://nlp-cdt.ac.uk.

IF G.03

Contact