PPar Lunch Series 2017/18

PPar lunch series schedule for 2017/18

Semester 1

Date	Speaker	Title/Abstract
27-Sep-17	Murray Cole	Introducing Cohort 4
4-Oct-17	Chris Cummins
11-Oct-17	–	No Speaker. Meet to chat.
18-Oct-17	Hugh Leather	Introducing Compucast
25-Oct-17	Martin Rüfenacht	Playing with the Bandwidth of Recursive Multiplying
1-Nov-17	Arpit Joshi	Architectural Support for Persistent Memory Emerging non-volatile memory technologies (like 3DXpoint) enable fast, fine-grained persistence compared to slow block-based devices (like disks). However, ensuring consistency of data structures in non-volatile (persistent) memory is a challenge. Ordering and atomic durability are two primitives that can be used to ensure that updates to persistent memory happen in a consistent manner. In this talk, we will see that current support for ordering using persist barriers and atomic durability using software logging add cache line flushes to the critical path. As a solution to this problem, we first propose an efficient persist barrier that reduces the number of cache line flushes happening in the critical path. Then we present ATOM, a hardware log manager based on undo logging that performs the logging operation out of the critical path.
8-Nov-17	Bjoern Franke	Generalized Profile-Guided Iterator Recognition
15-Nov-17	n/a	No lunch (PG Open day)
22-Nov-17	No speaker	PPar social lunch.
29-Nov-17	Artemiy Margaritov	Stretch: Balancing QoS and Throughput for Colocated Server Workloads on SMT Cores In a drive to maximize resource utilization, today’s datacenters are moving to colocation of latency-sensitive and batch workloads on the same server. State-of-the-art deployments colocate such diverse workloads even on a single SMT core. This form of aggressive colocation is afforded by virtue of the fact that a latency-sensitive service operating below its peak load has significant performance slack in its response latency with respect to the QoS target. In my talk, I am going to show that many batch applications can greatly benefit from a large instruction window to uncover ILP and MLP. After that, I am planning to talk about the fact that the performance slack inherent in latency-sensitive workloads operating at low to moderate load makes it safe to shift microarchitectural resources to a co-running batch thread without compromising QoS targets. Lastly, I will introduce Stretch, a simple ROB partitioning scheme that is invoked by system software to provide one hardware thread with a much larger ROB partition at the expense of another thread. When Stretch is enabled for latency-sensitive workloads operating below their peak load on an SMT core, co-running batch applications gain 13% of performance on average (30% max) over a baseline SMT colocation and without compromising QoS constraints.
06-Dec-17	PPar Lunch Deluxe	End of semester social.

Semester 2

Date	Student Organizer	Speaker	Title / Abstract
17-Jan-18	Mattia Bradascio	No speaker	PPar social lunch.
24-Jan-18	Mattia Bradascio	Philip Ginsbach	Formalising Computational Idioms for Compilers Different communities in informatics have identified computational idioms as an important concept to better understand and exploit parallelism in software. The Berkeley Parallel Dwarfs classify scientific workloads into categories, Algorithmic Skeletons allow reasoning about parallelism from a software engineering perspective and higher order functions in functional languages help reveal the compositionality of algorithms. This work however has so far not had much impact on compilers. With the Idiom Description Language (IDL), we are able to formalize a concept of computational idioms for the use in mainstream compilers to allow idiom specific optimization and parallelisation of C/C++ programs. Using domain specific knowledge, this allows us exploit parallel and heterogeneous hardware for programs that are beyond the scope of established static analysis methods.
31-Jan-18	Mattia Bradascio	Floyd Chitalu	Real-time CPU-GPU streaming of lightfield video Lightfield (volumetric) video, as a high-dimensional function, is very demand- in terms of storage. As such, lightfield video data, even in a compressed form, do not typically fit in GPU or main memory unless the capture area, resolution or duration is sufficiently small. Additionally, latency minimization–critical for viewer comfort in use-cases such as virtual reality–places further constraints in many schemes. In this talk, I’ll present a method we developed at Disney Research for streaming lightfield video, parameterized on viewer location time, that efficiently handles RAM-to-GPU memory transfers lightfield video in a compressed form. I’ll also briefly share my experience of doing an internship.
7-Feb-18	Pablo Andres-Martinez	Justs Zarins	Progressive load balancing of asynchronous algorithms Synchronisation in the presence of noise and hardware performance variability is a key challenge that prevents applications from scaling to large problems and machines. Using asynchronous or semi-synchronous algorithms can help overcome this issue, but at the cost of reduced stability or convergence rate. In this paper we propose progressive load balancing to manage progress imbalance in asynchronous algorithms dynamically. In our technique the balancing is done over time, not instantaneously. Using Jacobi iterations as a test case, we show that, with CPU performance variability present, this approach leads to higher iteration rate and lower progress imbalance between parts of the solution space. We also show that under these conditions the balanced asynchronous method outperforms synchronous, semi-synchronous and totally asynchronous implementations in terms of time to solution.
14-Feb-18	Pablo Andres-Martinez	Daniel Hillerström	Blimey JavaScript When compiling to the Web there is little choice. JavaScript is the dominant programming language on the Web, and as a consequence it has been labelled as the “assembly language on the Web”. Its surprisingly incoherent design admits a deep rabbit hole which one must dive into in order to effectively use JavaScript as a compilation target. In this talk we will venture down this rabbit hole; and if we can find our way out again, then I will report on an on-going effort to compile effect handlers (a novel control abstraction) to JavaScript. Not to be confused with WebAssembly, an emerging alternative compilation target for the Web.
21-Feb-18	Pablo Andres-Martinez	Simon Fowler	Unlocking Functional Web Programming “Callback hell.” “The Pyramid of Doom.” Frontend web developers are often all too familiar with these terms, which describe the readability, reliability, and maintainability problems arising from the imperative, event-driven style of programming fostered by vanilla JavaScript. The Elm programming language addresses these problems neatly: a functional model describes the state of the page, and a rendering function displays the model as HTML. Each component on the page produces messages, which update the model and therefore the rendered HTML. In the first part of this talk, I will describe the Elm architecture, and our work porting the Elm architecture to the Links programming language developed at Edinburgh. In the second part of the talk, I will give a quick introduction to hobbyist lockpicking.
28-Feb-18	Aleksandr Maramzin	Paul Piho	Study of collective dynamics with formal modelling methods The study of collective dynamics has a wealth of interesting applications in collective adaptive systems (CAS) where examples range from swarming behaviour of insects to patterns of epidemic spread in humans. Such systems are highly distributed and robust in nature and for that reason have become a paradigm for the design of highly-distributed computer-based systems. Such systems exhibit non-linear behaviour making it difficult to verify or predict the emergent behaviour. Moreover, it is hard to know a priori how to design behaviours and capabilities of individual agents in order to achieve a goal at a system level. In this short talk, I consider building a framework based on formal modelling methods that allows us to gain insight into the workings of such systems.
7-Mar-18	Aleksandr Maramzin	Amna Shahab	Scale-Out Caches The slowdown in technology scaling mandates rethinking of conventional CPU architectures to achieve higher performance and new capabilities. Our work takes a step in this direction by questioning the value of on-chip shared last-level caches (LLC) for server processors and argues for a better alternative. Shared LLCs have a number of limitations, including on-chip area constraints that limit storage capacity, long planar interconnect spans that increase access latency, and contention for the shared cache capacity that hurts performance under workload colocation. To overcome these limitations, we propose Scale-Out Caches (SO$), an all-private cache hierarchy that combines conventional on-chip private L1 (and optionally, L2) caches with a private die-stacked DRAM-based LLC per core. In this talk, I will discuss how SO$ addresses the shortcomings of prevalent shared LLC architectures and provides a design better suited to scale-out server workloads.
14-Mar-18	Aleksandr Maramzin	Rajkarn Singh	On Choosing Between Privacy Preservation Mechanisms for Mobile Trajectory Data Sharing Various notions of privacy preservation have been proposed for mobile trajectory data sharing/publication. The privacy guarantees provided by these approaches are theoretically very different and cannot be directly compared against each other. They are motivated by different adversary models, making varying assumptions about adversary’s background knowledge and intention. A clear comparison between existing mechanisms is missing, making it difficult when a data aggregator/owner needs to pick a mechanism for a given application scenario. I this presentation, I will first briefly describe existing privacy mechanisms, then discuss about a measure we developed, called STRAP, that allows comparison of different trajectory privacy mechanisms on a common scale.
21-Mar-18	Nicolai Oswald	Larisa Stoltzfus	LIFTing the Abstraction Layer of 3D Wave Simulations As the HPC hardware landscape continues to grow more complex and less portable, scientists should not require extensive expertise in parallelisation techniques in order to take advantage of more performant architectures. One popular approach to this problem is to raise the level of abstraction of simulation codes to hide away low-level details. However, this can result in niche solutions that cannot be reused across domains or platforms. In this talk I will discuss my modular approach to this problem for 3D wave models using the LIFT framework, which is built to be adaptable to any domain or backend.
28-Mar-18	Martin Kristien	No Speaker. Meet to chat.
4-Apr-18	Martin Kristien	Vanya Yaneva	Accelerating Finite State Machine Testing Using GPUs In software engineering, system development starts with a collection of requirements, arising from a mixture of customer needs, design decisions and constraints. Typically, when developing complex systems, these requirements are used to construct a formal model in order to specify, study and generate tests for the implemented system. This is known as model-based development. Finite state machines (FSMs) are a widely used model for a variety of computer systems. Examples include control circuits, signal processing, communications protocols and pattern matching. There is a huge body of research dedicated to generating test cases for the implemented system when a model is present, which assumes that an FSM model provides a full and correct specification for the system. However, models are often build by hand, using informal customer requirements and the domain knowledge of the system developer and it is an important task to ensure that the model also conforms to the specifications. This is the role of functional testing for Finite State Machines. In this talk I will introduce functional testing for FSM-based models and will present how I approach the problem of accelerating it using GPUs.
25-Apr-18	Bruce Collie	Rodrigo Caetano de Oliveira Rocha	Loop-Balance Analysis Loops are the single largest source of parallelism in many programs andloop imbalance is one of the main threats to the performance of a parallel loop. In this talk, I will present loop-balance analysis, a novel static analysis for estimating whether or not a parallel loop may present imbalance at runtime. Although loop imbalance is dependent on many run-time variables, our proposed static analysis is able to accurately estimate completely at compile-time when a loop is likely to be unbalanced. To demonstrate how this analysis can be leveraged by optimizations, we use its results to automatically select a scheduling strategy for parallel loops. Our optimization reaches about 98% of the oracle’s performance. This result corresponds to an average speedup of 1.12 compared to always using the straightforward static scheduling and an average speedup of 2.65 compared to always using an adaptive scheduling based on work stealing.
2-May-18	Bruce Collie	Dan Mills	5 years away: A history of quantum computing and the prospect of a near term superiority demonstration Quantum mechanics was one of the most shocking discoveries of the 20th centuries, and it shook up a discipline which thought it had it all figured out. It was towards the end of the 20th century that it was realised quantum mechanics, if tamed, could give us access to applications thought beyond out reach until then. In this talk I'll give a summary of how quantum computation got to where it is today, what the state of the art in the field is now and what the chances of quantum computers overtaking classical devices is.
9-May-18	Jack Turner	Lewis Crawford	Compiler Optimizationfor Graphics Shaders In real-time graphics applications such as games and VR, performance is crucial to provide a smooth user experience, especially the performance of the shaderprograms which render images on the GPU. In this talk, I will introduce the typical software systems stack found in real-time graphical applications like computer games, and explain the role compilers play at various levels of this stack. I will also show common features of graphics shaders, and examine the performance impact and applicability of common compiler optimizations across a wide set of shaders running on various mobile and desktop platforms.
16-May-18	Maxi Behnke	Vasilis Gavrielatos	Scale-Out ccNUMA: Exploiting Skew with Strongly Consistent Caching Today’s cloud based online services are underpinned by distributed key-value stores (KVS). Such KVS typically use a scale-out architecture, whereby the dataset is partitioned across a pool of servers, each holding a chunk of the dataset in memory and being responsible for serving queries against the chunk. One important performance bottleneck that a KVS design must address is the load imbalance caused by skewed popularity distributions. Despite recent work on skew mitigation, existing approaches offer only limited benefit for high-throughput in-memory KVS deployments. In this paper, we embrace popularity skew as a performance opportunity. Our insight is that aggressively caching popular items at all nodes of the KVS enables both load balance and high throughput – a combination that has eluded previous approaches. We introduce symmetric caching, wherein every server node is provisioned with a small cache that maintains the most popular objects in the dataset. To ensure consistency across the caches, we use high-throughput fully-distributed consistency protocols. A key result of this work is that strong consistency guarantees (per-key linearizability) need not compromise on performance. In a 9-node RDMA-based rack and with modest write ratios, our prototype design, dubbed ccKVS, achieves 2.2× the throughput of the state-of-the-art KVS while guaranteeing strong consistency.
23-May-18	Jack Turner	Rudi Horn	Incremental Relational Lenses Lenses are a popular approach to bidirectional transformations, a generalization of the view update problem in databases, in which we wish to make changes to source tables to effect a desired change on a view. However, perhaps surprisingly, lenses have seldom actually been used to implement updatable views in databases. Bohannon, Pierce and Vaughan proposed an approach to updatable views called relational lenses over 10 years ago, which offered a number of advantages over existing approaches to view update, but to the best of our knowledge this proposal has not been implemented or evaluated to date. We propose incremental relational lenses, which improve on prior work by equipping relational lenses with change-propagating semantics that map small changes to the view to small changes to the source tables. We also present a language-integrated implementation of relational lenses and a detailed experimental evaluation, showing orders of magnitude improvement over the non-incremental approach. Our work shows that relational lenses can be used to support expressive and efficient view updates at the language level, without relying on updatable view support from the underlying database. Following Simon’s example I will be dividing the talk into two halves, with the first introducing incremental relational lenses with a focus on the incrementalization process, while devoting the second half of my talk to something completely different: making beer brewing equipment.
30-May-18	Nicolai Oswald	Viktor Ivanov	Hardware-based Machine Learning for Real-time Mitigation of Interference in 5G Systems Interference in the scientific / unlicensed spectrum is a huge problem for wireless communications today and in the foreseeable future. As demand for capacity and data rate grows, these systems become increasingly more sensitive to sources of interference. Such is the case with the next generation 5G cloud radio access network (C-RAN) for mobile which is expected to support up to 16x multiple input/multiple output (MIMO) streams and cumulative data rates of up to 1.467 Gbps per stream. However, the majority of interference comes from human-made devices that utilise specific radio protocols (e.g. Bluetooth, ZigBee, WiFi 802.11) or certain environmental changes (e.g. reflection and passive inter-modulation). They thus exhibit behaviour which leaves a unique "fingerprint" in the radio spectrum. Our hypothesis is that that specially optimised hardware for deep learning could be used at future 5G base stations to identify such sources of interference, predict their interference patterns, and mitigate their effects, thereby improving the overall quality of service in real time.
06-Jun-18	Volunteers	PPar Lunch Deluxe	End of semester lunch series wrap-up/social.

This article was published on 11 Sep, 2018

PPar Lunch Series 2017/18

Semester 1

Architectural Support for Persistent Memory

Stretch: Balancing QoS and Throughput for Colocated Server Workloads on SMT Cores

Semester 2

Formalising Computational Idioms for Compilers

Real-time CPU-GPU streaming of lightfield video

Progressive load balancing of asynchronous algorithms

Blimey JavaScript

Unlocking Functional Web Programming

Study of collective dynamics with formal modelling methods

Scale-Out Caches

On Choosing Between Privacy Preservation Mechanisms for Mobile Trajectory Data Sharing

LIFTing the Abstraction Layer of 3D Wave Simulations

Accelerating Finite State Machine Testing Using GPUs

Loop-Balance Analysis

5 years away: A history of quantum computing and the prospect of a near term superiority demonstration

Compiler Optimizationfor Graphics Shaders

Scale-Out ccNUMA: Exploiting Skew with Strongly Consistent Caching

Incremental Relational Lenses

Hardware-based Machine Learning for Real-time Mitigation of Interference in 5G Systems