PPar Lunch Series 2015/16

PPar lunch series schedule for 2015/16

Semester 1




23 September 2015 Don Sannella
ThreadSafe: Static Analysis for Java Concurrency

ThreadSafe is a commercial static analysis tool that focuses on detection of Java concurrency defects. By restricting attention to concurrency bugs, ThreadSafe can find bugs that other static analysis tools, both commercial and freely available, miss or are not designed to look for. Its findings are generated from information produced by a class-by-class flow-sensitive, path-sensitive, context-sensitive points-to and lock analysis, with heuristics to tune the analysis to avoid false positives. It has been successfully applied to codebases of more than a million lines.

30 September 2015 Artemy Margaritov
Sharing Branch Direction Predictors for Data Center Processors

In data centre workloads, large instruction working sets and complex control flow requires the tracking of many branches and their output histories for high prediction accuracy. Ideally, the branch direction predictor should be sufficiently large to capture all branches along the control flow. However, a large branch direction predictor consumes considerable on-chip resources, which prohibits its use in emerging lean-core data centre processors where, if it were used, it would significantly reduce efficiency, since the area footprint of a common branch direction predictor may reach that of the core.

My project attacks the problem of designing an efficient branch direction predictor for lean-core data centre processors. A recent characterisation study on data centre workloads has shown that these workloads are homogeneous, which means that each core executes same types of requests as all other cores. Consequently, the various cores of a processor executing a common homogeneous data centre workload tend to generate similar control flow histories. Based on this observation, one can see that commonality and recurrence in control flow paths across cores can be exploited to generate one common control flow history, which can be shared then by all the cores running the same workload. Sharing the branch direction predictor and its associated storage among multiple cores should be an effective approach for mitigating the severe area overhead of existing branch direction predictor designs, while preserving their performance benefits.

7 October 2015 Adrian Jackson
HPC, I/O, NVRAM, and Big Data

The talk will mainly be me ranting about why I don’t think big data is any different than HPC, but possibly of more interest to the audience I will also introduce the NextGENIO project and discuss its plans to use NVRAM for a new level of data storage between main memory and disk which could enable a different way of undertaking computational simulation and data analysis.

28 October 2015 Justs Zariņš
The method in madness

I will talk about asynchronous algorithms – what they are and why they are interesting for exascale computing.

25 November 2015 Mike O’Boyle
Achieving a successful research career: A highly biased perspective

This talk will try to give advice on how to navigate the various stages of an academic career, and what to avoid.

2 December 2015 Murray Cole
Finding Patterns in Parallel Legacy Code

Pattern-oriented parallel programming frameworks offer a promising approach to portably abstracting away from the variations in underlying parallel models, while maintaining enough information to assist optimization. However, turning ad-hoc parallel code into patternized form is a non-trivial “legacy” challenge. I will discuss some ideas on how we might tackle this.

9 December 2015 Paul Jackson
Formal verification of cyber-physical systems

Cyber-physical systems are systems in which computers interact with the physical world in real time via sensors and actuators. Their use is widespread in modern society and ever-more frequently our safety depends on their correctness. Examples include control systems for cars, trains and planes, medical devices such as insulin pumps, heart pacemakers and life support systems, automation for chemical plants, and robots.

Formal verification of cyber-physical systems involves the computer-aided mathematical analysis of abstract models of such systems. It can offer higher degrees of assurance of safety and correctness when compared to simulation-based or testing-based approaches. This is because the mathematical techniques used are logically rigorous and exhaustively take into consideration continuous ranges of input conditions and system parameters.

In this talk I’ll introduce at a very high level some of the techniques currently used in the formal verification of cyber-physical systems and will touch on the current research directions in this area.

16 December 2015 PPar Lunch Deluxe (no speaker) PPar Lunch Deluxe: Networking Event

Semester 2




13 January 2016 Simon Fowler 
An Introduction to Concurrent λ-Calculi 

As programming languages adopt concurrency as a more central point of their design, it is useful to be precise about their concurrent behaviour. In this talk, I’ll discuss the technique of using configurations of expressions in the λ-calculus to model the concurrent fragment of functional languages. I’ll demonstrate this technique by showing two minimal functional languages I’ve been working on: one to model channel-based communication (as in Concurrent ML and Hopac), and one to model actor-based communication (as in Erlang and Elixir).

20 January 2016 Sethu Vijayakumar
Shared autonomy for interactive robotics: Closing the loop

The next generation of robots are going to work much more closely with humans, other robots and interact significantly with the environment around it. As a result, the key paradigms are shifting from isolated decision making systems to one that involves sharing control — with significant autonomy devolved to the RAS systems; and end-users in the loop making only high level decisions. This talk will look at technologies ranging from robust multi-modal sensing, shared representations, compliant actuation and real-time learning and adaptation that are enabling us to reap the benefits of increased autonomy while still feeling securely in control. I will attempt to touch upon scenarios where new ways of distributing computation may have added value — from robustness, security or scalability perspective. Domains where this debate is relevant NOW include self-driving cars, mining, shared manufacturing, exoskeletons for rehabilitation, active prosthetics, large scale scheduling (e.g. transport) systems as well as Oil and Gas exploration to list a few.

27 January 2016 Adam Harries
Compositional Compilation for Sparse, Irregular Data Parallelism

Contemporary GPU architectures are heavily biased towards the execution of predictably regular data parallelism, while many real world application domains are based around data structures which are naturally sparse and irregular. This makes efficiently parallelising and executing sparse applications on GPUs difficult, as programmers must resort to low level, domain specific optimisations which are often non-portable, and difficult to express in current higher level languages.

In this talk, I will demonstrate that high level programming and high performance GPU execution for sparse, irregular problems are not necessarily mutually exclusive. Our insight is that this can be achieved by capturing sparsity and irregularity friendly implementations within the target space of a regular pattern-oriented, high-level compilation and transformation system. Specifically, I will discuss how we worked to embed a sparse data structure and algorithm within a regular data parallel programming language, show that there are correlations between good implementation choices and simple measurable properties of the irregularity present in problem instances and finally give some promising preliminary performance results.
3 February 2016 Perdita Stevens
Parallelism and Model-driven Development 

Model-driven development is a term for software development in which models are important artefacts. What has this to do with parallelism? Are they compatible, complementary, in conflict? In this at-most-quarter-baked talk I will part tell you, part ask you.

10 February 2016 Chris Cummins
Autotuning and Algorithmic Skeletons

Multi-cores and heterogeneous systems are now the norm, yet we lack suitable high level programming models to cope with them. I will talk briefly about how algorithmic skeletons will solve this problem, and the progress we need to make for their performance to become a viable alternative to hand written low level code.

17 February 2016 Leonid Libkin
Can you trust answers to database queries?

I’ll review a well known phenomenon that in the presence of incomplete information, answers to SQL queries become unpredictable. A simple argument based on parallel complexity tells us that finding correct answers is incompatible with efficiency, but SQL’s design goes further than that and produces just about every possible type of errors. This is due to completely ad hoc choices made by its designers 30+ years ago. I’ll explain how to evaluate database queries efficiently in a way that restores at least some trust in the answers.

24 February 2016 Galini Tsoukaneri
On the Inference of User Paths from Anonymized Mobility Data

Using the plethora of apps on smartphones and tablets entails giving them access to different types of privacy sensitive information, including the device’s location. This can potentially compromise user privacy when app providers share user data with third parties (e.g., advertisers) for monetization purposes. In this work, we focus on the interface for data sharing between app providers and third parties, and devise an attack that can break the strongest form of the commonly used anonymization method for protecting the privacy of users.

2 March 2016 Ian Stark
Continuous pi-Calculus: A Process Language for Computational Modelling of Biochemical Systems  

The Continuous pi-Calculus (c-pi) is a language for describing concurrent interacting processes.  It’s based on similar languages from computer science, but has as its target the behaviour of biochemical systems.  These are the chemical reactions and networks constantly acting and adapting within every living cell: all happening in fluid solution, all happening at the same time — pervasive and as parallel as it gets.

I’ll talk about c-pi itself, how we compile high-level process terms into differential equations, and the Logic of Behaviour in Context for model-checking biochemistry.

This is joint work with Marek Kwiatkowski and Chris Banks.

9 March 2016 Martin Rüfenacht
AllReducing with the best of them!

I will be talking about large scale AllReduce operations (global all-to-all summations), what they are, how they have been done and about my latest work contributing to ever larger machines, hopefully with nice data!

16 March 2016 Barry O’Rouke (CriticalBlue)
Realizing Performance on Mass Market Multicore Platforms

Many of the computer based products shipping today contain multiple processors. Often this is driven by the silicon providers who only offer multi core devices but it is also common to see heterogeneous systems with specialist processing engines next to application processors. These are complex systems running even more complex software and design teams often find themselves struggling to realize the performance promised by the data sheets.

In this talk we will take an in depth look at a real life case study which will provide an insight into the technical challenges facing the embedded software industry as it tries to take advantage of ever increasing parallelism.

The talk will be given by Barry O’Rouke, Technical Lead & Project Management at CriticalBlue.

23 March 2016 Davide Pinato
Improving TCP’s Performance for Intra-Data Center Flows

TCP’s burstiness and dependance on RTT, in order to increase its throughput, has always presented challenges in high RTT/high bandwidth networks. With this project, we aim to design a TCP congestion-avoidance scheme to improve the utilisation of expensive private WAN links. I will be mainly covering problems with current approaches, and present our strategies on the design of such an algorithm.

30 March 2016 Adam Lopez
Parallelism and natural language processing

Most of human knowledge is expressed in language, so natural language processing is very important. Modern approaches to NLP are based on machine learning, and since the domain consists of structured objects like strings and trees, most problems boil down to continuous or combinatorial optimization problems (or both). In short, this means NLP is computation-bound. Parallel systems are obviously an intriguing prospect for any computation-bound problem, but they also break the assumptions of classical algorithms in NLP, and so far attempts to exploit parallelism for NLP have been limited. I’ll teach you some NLP basics, and I hope you’ll teach me some new tricks to make NLP faster.

4 May 2016 Arpit Joshi
Architectural Optimizations for Non-volatile Memories

The emerging non-volatile memory (NVM) technologies (like 3D XPoint) provide fast fine grained access to persistent data. However, managing persistent data in NVM requires a careful redesign of applications to ensure consistency in the presence of failures. This in turn opens up the scope for architectural optimizations to improve the performance of these applications. In this talk we will discuss architectural optimizations to improve performance in systems with NVM while ensuring consistency of data in the presence of failures.

11 May 2016 Ruymán Reyes, Codeplay
SYCL: Programming accelerators the C++ way

Abstract: Current technology trends show a major shift in computer system architectures towards heterogeneous systems that combine multiple different processors (CPUs, GPUs, FPGA…) that all work together, performing many different kinds of tasks in parallel. Developers want to take advantage of parallel programming features in modern languages in a simpler and more accessible performance portable way.

One solution to this provided by SYCL. SYCL is a Khronos standard that offers a layer on top of OpenCL that enables programming heterogeneous platforms using a “single-source” style with C++. SYCL has no extensions over standard C++, and supports any host compiler coupled with a SYCL-enabled device compiler to generate the target binary code. In this talk we introduce the basics of SYCL showing various code samples, and present how SYCL enables the implementation of the C++17 Parallel STL for GPUs.

Bio: Ruymán Reyes (Staff Software Developer at Codeplay Software) is the Team Lead of the ComputeCpp team, Codeplay’s SYCL implementation, and the coordinator of the SYCL Parallel STL open-source project. He got his Ph.D in Programming Models at University of La Laguna in 2012, creating the first open-source OpenACC implementation (accULL) in the process. In the meantime, he participated in the TEXT project, porting ScaLaPack to SMPSs/OMPss in collaboration with BSC (Barcelona Supercomputing Centre). He then moved to sunny Edinburgh to work in EPCC in various research projects, until he joined the research side of Codeplay in 2013 to work in the SYCL specification and prototype implementation. He is interested in programming new fancy architectures, but enjoys cycling and playing strategy games in his laptop.

18 May 2016 Nikolay Bogoychev
Massively parallel Language models

Abstract: For many applications, the query speed of N-gram language models is a computational bottleneck. Although massively parallel hardware like GPUs offer a potential solution to this bottleneck, exploiting this hardware requires a careful rethinking of basic algorithms and data structures. We present the first language model designed for such hardware, using B-trees to maximize data parallelism and minimize memory latency. When we compare with a single-threaded instance of KenLM (Heafield, 2011), a highly optimized CPU-based language model, our GPU implementation produces identical results with a smaller memory footprint and a sixfold increase in throughput on a batch query task. When we compare a fully saturated CPU and a GPU, our results show that the GPU delivers twice the throughput per hardware dollar.

1 June 2016 Stan Manilov
Beyond dependence analysis for parallelisation: Commutativity!

Throughout the long history of automatic parallelisation, researchers have been focusing mainly on loops and have without exception been hinging on dependence analysis as their theoretical foundation. We claim that similarly to the way static analysis is undecidable and is provably insufficient to tackle every program thrown at it, dependence analysis and may-dependencies in particular are overly conservative and produces disappointing results when it comes to detecting parallelism. We propose to use commutativity as the building block of parallelism detection, by defining the theory for it and presenting our experimental results.

8 June 2016 Valentin Radu
Squeezing deep learning to resource constrained devices

 Deep learning is becoming a popular solution for many common tasks (computer vision, speech recognition, translations, etc.) due to their outstanding performance. The wast amount of data required to train deep neural network architectures made this technology more common on servers and computer clusters, where computation resources are abundant. But a wast amount of personal data is generated on our wearable devices (e.g. sensors data on smartphones), which should be best utilised locally, due to privacy concerns of this data leaving the device. However, employing deep neural networks on these devices requires particular resource-fitting considerations to make these run more efficiently. In this talk I am going to present my experience using deep neural networks for computer vision tasks on wearable devices.

15 June 2016 Michele Weiland
Energy Efficiency in HPC

The focus of the HPC community is firmly on reaching its next big goal, the Exascale. There are many technical hurdles that need to be overcome to reach this goal in the next five to ten years, but there is an underlying theme of efficiency. HPC applications often only use a few percent of the peak performance a system can offer, wasting vast amounts of resources that could be exploited to achieve increased science throughput. The efficiency of Exascale systems in terms of power is also of great concern – we will need to achieve 50 GFlop/s per Watt if we want to stay within a 20MW power envelope. In this talk, I will present the work of the Adeot project, which investigates and models power/energy expenditure in parallel systems.

22 June 2016 Kenneth Heafield
Systems and Scaling Problems in Machine Translation

Most of the progress in machine translation is due to faster systems. Thanks to GPUs, newer models are based on a weird functional programming language called neural networks. Custom neural network hardware is a hot topic, but even for widespread GPUs, we’re missing compiler features like fusing with matrix multiplies. Training takes about two weeks and we using about a thousandth of the available data. This talk is about the problems we have, what people have done about them, and a call for jointly optimizing at model through hardware levels rather than separately.

29 June 2016

Kenny Mitchell, Bochang Moon, Babis Koniaris,

Disney Research

Accelerating Film and Game Technology Convergence

Our talk will present our latest results on computer graphics rendering for games and film convergence. The IRIDiuM (Interactive Rendered Immersive Deep Media) virtual reality installation presents a new form of rendered 360 stereo movies allowing for motion parallax and interactivity. Our sparse sensor body tracking provides user immersion and embodiment with freedom of movement within an interactive 3D panoramic movie. Further, in this talk, we present our recent filtering methods that reduce noise in rendered images while preserving high-frequency edge details. Our methods approximate rendered images with polynomial functions locally, and optimize filtering parameters used in the functions so that filtering errors are minimized. Our polynomial function allows for reconstructing an image block instead of each pixel, and thus we can reduce our optimization based filtering time by parallelizing our reconstruction at only a sparse number of image pixels.