ICSA Colloquium Talk - 18/02/2021
A Study of Container Image Storage and How To Improve It
Containers have rapidly risen in popularity in recent years. Starting out as a lightweight alternative to virtual machines, containers have revolutionized how software is developed, deployed, and managed in production by enabling concepts such as micro-service architectures, continuous deployment, and serverless computing. At the core of containers is the container image, which includes all necessary runtime data to start the containerized application. Large-scale image registries host images for sharing and distribution and individual container clients pull images into local storage and provision them as root file systems for running containers. Registries today store millions of images, and containerized environments, e.g. for serverless computing, start and stop new containers within seconds. This places high pressure on the storage and networking infrastructure for both registries and individual clients.
In this talk, we will give an overview of container image storage and present ways of dealing with the increasing load at both the registry and the client side. The talk consists of three main parts: First, we present an analysis of request traces from an IBM production registry. We find that image accesses are heavily skewed and some accesses can be predicted accurately based on previous requests. Given these observations, we propose a two-tier caching mechanism and a prefetching scheme to reduce response latencies by an order of magnitude. Second, we conduct an analysis of over 350,000 container images, looking at various aspects such as compressibility, size, and duplicate content. We find that only 3% of files in our image data set are unique, revealing a large potential for deduplication. Based on this finding, we present a deduplication-enabled container registry, which is able to reduce storage consumption by up to 6.9x while simultaneously providing up to 2.8x lower request latencies compared to other deduplication approaches. Third, we look at how image retrievals can be sped up by hosting local images in a distributed file system and making individual clients collaborate when retrieving images. Our system, Wharf, is able to improve pull latencies by up to 12x while significantly reducing network traffic.
Lukas Rupprecht is a Research Staff Member in the Storage Systems Department at IBM Research - Almaden. His research focuses on creating novel storage solutions to meet the demands of modern, cloud native platforms and applications. He is also interested in provenance and using it to provide reproducibility for complex machine learning pipelines. Before joining IBM Research, he received his PhD in computer science from Imperial College London, where he worked on approaches to improve the interaction of big data processing systems with the network.