PPar Seminar Series 2018/19

PPar Seminar Series schedule for 2017/18

IF = Informatics Forum, Central campus

AT = Appleton Tower,  Central campus

BC = Bayes Centre, Central campus

Speaker
Title
Date
Time
Venue
Organiser
Peter Hsu
Ideas for a European Chiplet Microprocessor

Barcelona Supercomputing Center is working on a RISC-V vector accelerator as part of the European Processor Initiative project.  This talk is about my individual research for a potential future European microprocessor that incorporates vector acceleration for HPC and AI applications.  I will talk about advantages of chiplet architecture and discuss/solicit feedback for the proposed memory programming model which can accommodate aggressive computation acceleration.

12/03/2019 2pm IF 4.31/ 4.33 ICSA
Boris Grot
Better Hardware in the Post-Moore Era? Look to the Software!

As Moore’s Law grinds to a halt, the days of easy performance improvements enabled by rapid growth in transistor budgets are over. While in narrow domains, custom accelerators can be used to improve performance and energy-efficiency, there is no silver bullet for broad application domains such as mobile or servers. The server domain, in particular, poses a significant challenge because explosive growth in data volumes and an increasing diversity of applications that are powered by this data (often in an online setting) demand ever-higher performance from stagnating hardware.

In this talk, I will argue that the path to future improvements in hardware efficiency is paved with insights about software behaviour. I will further argue that with the end of the road for transistor scaling looming ahead, software can and should be used to ease the hardware’s burden – an idea that harks back to the early (transistor-constrained!) days of computer architecture. To demonstrate the potential of software-centric hardware design for servers, I will overview several recent projects from my research group.

28/02/2019 2pm IF 4.31/ 4.33 ICSA
Henrik Barthels
Linnea: Automatic Generation of Efficient Linear Algebra Programs

The evaluation of linear algebra expressions is a central part of both languages for scientific computing such as Julia and Matlab, and libraries such as Eigen, Blaze, and NumPy. However, the existing strategies are still rather primitive. At present, the only way to achieve high performance is by handcoding algorithms using libraries such as BLAS and LAPACK, a task that requires extensive knowledge in linear algebra, numerical linear algebra and high-performance computing. We present Linnea, a synthesis tool that automates the translation of the mathematical description of a linear algebra problem to an efficient sequence of calls to BLAS and LAPACK kernels. The main idea of Linnea is to construct a search graph that represents a large number of programs, taking into account knowledge about linear algebra. The algebraic nature of the domain is used to reduce the size of the search graph, without reducing the size of the search space that is explored. Experiments show that 1) the code generated by Linnea outperforms standard linear algebra languages and libraries, and 2) in contrast to the development time of human experts, the generation takes only few minutes.

14/02/2019 3pm IF 4.31/ 4.33 CArD
Subhankar Pal
Closing the Programmability-Efficiency Gap with a Software-Defined Reconfigurable Accelerator

With the end of Dennard scaling and Moore’s law, both industry and research efforts to develop accelerator-centric solutions are growing rapidly. While existing General-Purpose Processors (GPPs) are designed for high programmability, ASIC accelerators maximize performance and energy efficiency, albeit for a single algorithm or at best a narrow set of kernels. Due to high non-recurring expenses, building ASIC accelerators is typically infeasible except for a small number of application-critical workloads. Furthermore, fixed-function ASICs for fast-moving application domains, such as machine learning, carry high risk for near-term obsolescence.

GPPs, ASICs, and reconfigurable accelerators have been historically bound by three conflicting constraints: (1) software programmability, (2) algorithm-specificity, (3) and overhead from low-level configurability. The proposed work aims to expand the range of performance and programmability that is achievable with reconfigurable hardware by unifying crucial properties of both GPPs and ASIC accelerators, targeting web search, graph processing, linear algebra applications. On the hardware side, we design a many-core architecture with GPP-level programmability, but with new reconfigurable hardware elements to efficiently support ASIC-like dataflows. In addition to support for high-level languages, the hardware and compiler framework are heavily co-designed to exploit data- dependent optimizations via fast dynamic reconfiguration.
12/02/2019 2pm IF 4.31/ 4.33 CArD
Christophe Dubach
Tackling the Performance Portability Challenge with High-level abstractions and Rewrite Rules

Abstract: Parallel accelerators exhibit tremendous computational power but are notoriously hard to program. High-level and domain-specific languages have been proposed to address this issue. However, compilers and software implementations have to be re-written and re-tuned continuously as new hardware emerge. This is known as the performance portability challenge.

I will present recent techniques developed in my group to achieve performance portability. I will present Lift, a novel high-level data-parallel programming model. Lift is based on a surprisingly small set of functional primitives which are combined to define higher-level hardware-agnostic algorithmic patterns. A system of provably-correct rewrite rules is used to encode both algorithmic choices and low-level optimisations that lets the compiler explore the optimisation space automatically. Lift's goal is to achieve performance on par with highly tuned code across a wide range of parallel architectures (e.g. GPUs, FPGAs, Neural Network accelerators) and application domains (e.g. Linear Algebra, Stencil, Machine-learning).
31/01/2019 4pm IF 4.31/ 4.33 ICSA
Pramod Bhatotia
System design principles for intelligent applications

Abstract: In this digital age, we are increasingly relying on modern online services and cyber-physical systems that are based on "data-driven intelligence". These intelligent applications require a high degree of reliability, real-time performance, scalability, and security. The state-of-the-art for designing, developing, and deploying such applications follow ad hoc practices, where the application programmers explicitly manage computational resources and application state on a per-application basis. However, such ad hoc practices easily become unmanageable because the underlying computing infrastructure composed of cloud and edge/IoT computing resources is highly heterogenous, and it comes with varying degree of performance, cost, reliability, and security guarantees. Our work aims to build an end-to-end generic system that supports the design, development, and deployment of a wide-range of data-driven intelligent applications, where the application programmers, such as machine learning experts or data scientists, can focus on their core business logic/algorithms, and our system transparently provides all the aforementioned desired functional properties.

More specifically, I will present four system design principles targeting hardware/software co-design for intelligent applications: (1) Scalability: How to seamlessly support ever growing application workload with increasing number of cores, and at the same time, embracing the heterogeneity in the underlying computing platform. (2) Reliability: How to leverage new ISA extensions to build reliable software systems; (3)Security: How to build secure systems for the underlying untrusted computing infrastructure using a combination of trusted execution environments (TEEs) and small trusted computing base (TCB); and (4) Performance: How to achieve real-time performance using incremental and approximate computing paradigms.

As I will show in the talk, we follow these design principles at all levels of the software stack covering operating system, storage/file-system, compiler and run-time libraries, and all the way to building distributed middleware. More importantly, our approach transparently supports existing applications -- we neither require a radical departure from the current models of programming nor complex, error-prone application-specific modifications.
18/12/2018 4pm IF 4.31/ 4.33 ICSA
Dan Holmes
Does persistence pay off?
MPI has defined persistent operations for point-to-point communication since forever (well, since MPI-1 in 1994). They are not commonly used in real HPC applications and some HPC developers do not even know of their existence. At the Barcelona meeting in September this year, the MPI Forum voted in some new persistent operations - this time for collective communication. These new operations are included in the Draft MPI-4.0 Standard (released at SC18, last week), they available to use now (as an extension to Open MPI), and they will be one of the major feature additions in the official MPI-4.0 Standard (whenever that is actually finalised).  This talk will attempt to motivate why “persistence” is a (now) good thing in MPI (despite being a bad thing for 25 years) and why HPC programmers should start to move towards using the new persistent collective operations (whilst continuing to avoid using the persistent point-to-point operations). If there is time and enthusiasm, the talk may include a look at ideas for future changes to MPI that will fix the known problems with persistent point-to-point operations.
28/11/2018   Bayes G.03 EPCC

Ted Dunning

(MapR)
Real-world Machine Learning and AI

Machine learning and AI doesn't work in the real world the way it appears to in the classroom. There are lots of reasons why but two of the largest are logistics and the value of cheap learning (as opposed to deep learning). I will describe what you need to know about both topics, illustrating with stories of real systems that have made real billions.

Wed, 21 November 12:30 AT 2.14 EPCC
Aaron Smith
Is it Time for RISC and CISC to Die?

Specialization, accelerators, and machine learning are all the rage. But most of the world's computing today still uses conventional RISC and CISC CPUs, which expend significant energy to achieve high single-thread performance. Von Neumann ISAs have been so successful because they provide a clean conceptual target for software while running a wide range of algorithms reasonably well. We badly need clean new abstractions that utilize fine-grain parallelism and run energy efficiently.

Prior work (such as the UT-Austin TRIPS EDGE ISA and others) showed how to form blocks of computation containing limited-scope dataflow graphs, which can be thought of as small structures (DAGs) mapped to silicon. In this talk I will describe work that addresses the limitations of early EDGE ISAs and provide results for two specific microarchitectures developed in collaboration with Qualcomm Research. These early results are based on placed and routed RTL in 10nm FinFET.

 Bio:  Aaron is a part-time Reader in the School of Informatics at the University of Edinburgh and Principal Researcher at Microsoft. In Edi he co-teaches UG3 Compiling Techniques and is working on binary translation and machine learning related projects with ICSA colleagues. At Microsoft he leads a research team investigating hardware accelerators for datacenter workloads. He is active in the LLVM developer community and a number of IEEE and ACM conferences and has over 50 patents pending.

Thur, 8 November 4pm IF 4.31/ 4.33 ICSA
Dr Michele Weiland
NEXTGenIO - the quest to improve I/O performance

One of the major challenges to achieving sustained performance in large-scale (HPC) applications is the I/O bottleneck. Processing performance has increased significantly over the years, but with I/O performance not improving at the same rate, reading and writing large amounts of data (in particular from/to a parallel file system) can be the limiting factor to an application's time to solution and scalability.

NEXTGenIO is an EC-funded project that is working on addressing to performance bottleneck by introducing byte-addressable persistent memory (Intel's Optane DC Persistent Memory) to the compute nodes, as well  as increasing both the I/O performance and node-local memory & storage capacity. The project is developing (and soon deploying) a prototype system together with a system software stack that will greatly speed up data intensive applications.

In this talk, I will present the system architecture and the use cases we are going to support. I will also talk about how you may get access to the system once it has arrived in Edinburgh.

Wed, 7 November 2pm BC G.03 EPCC

Álvaro Fernández Millara, University of Oviedo

Detecting and correcting for measurement quality degradation in a laser triangulation system

Laser triangulation is a commonly used technique for 3D scanning. Many of the applications of 3D scanning, and especially quality control, require narrow error margins. Therefore, some thought has to be given as to how the capabilities of a laser triangulation system may degrade with time, and what can be done about that. 

In this talk, I will describe one specific laser triangulation system (namely, a quality control system for rail manufacturing) and some of the factors that may affect the quality of its measurements. I will then show some approaches that can be used to quantify these factors, and discuss how I  am using high-performance computing to develop techniques to compensate for them.

Tue, 30 October 3pm   EPCC
Dr John Goodenough, ARM
Scaling to a Trillion IoT Devices

ARM microprocessor technologies in embedded compute and IoT service management through Pelion Platform are central to many existing and emerging distributed IoT services. As an increasing number of value added application services are deployed there are significant challenges to be overcome  in scale to a future which sees  a trillion managed devices. This talk will cover an overview of arms research activities and emerging collaborations within the broad IoT space ranging from energy harvesting sensor nodes to full data lifecycle privacy and attestation services. With a focus on developing technologies that address barriers to successful proliferation and deployment, including secure service operations and data lifecycle management the presentation is intended to set a perspective on challenges ahead and result in some lively debate and dialog on emerging research questions and trajectories

Wed, 24 October 2pm IF 4.31/ 4.33 ICSA

Pablo Cerro Cañizares, Complutense University of Madrid

   
Large Scale Mutation Testing

Large-scale systems has been widely adopted due to its cost-effectiveness and the evolution of networks. In general, large scale systems can be used to reduce the long execution time of applications that require a vast amount of computational resources and, especially, techniques that are usually deployed in centralized environments - like testing - can be deployed in these systems. Currently, one of the main challenges in testing is to obtain an appropriate test suite. In essence, the main difficulty lies in the elevated number of potential test cases. Mutation testing is a valuable technique to measure the quality of test suites that can be used to overcome this difficulty. However, one of the main drawbacks of mutation testing is the high computational cost associated with this process. 

In this work, we propose two improvements based on our previous approach, called OUTRIDER, a set of strategies to optimize the mutation testing process in distributed systems. Although OUTRIDER efficiently exploits the computational resources in distributed systems, several bottlenecks have been detected while applying these strategies using large-scale systems. For this reason, this proposal is three folded: i) Providing a hybrid algorithm designed to reduce the communications between the master and the worker processes while maintaining a high level of resources usage, ii) Improving the compilation phase, iii) Comparing the proposal with other distribution solutions, such as Spark or Cloud systems.

Wed, 17 October 2pm Dugald Stewart 1.20 EPCC
Amna Shahab
Farewell My Shared LLC! A Case for Private Die-Stacked DRAM Caches for Servers

The slowdown in technology scaling mandates rethinking of conventional CPU architectures in a quest for higher performance and new capabilities. This work takes a step in this direction by questioning the value of on-chip shared last-level caches (LLCs) in server processors and argues for a better alternative. Shared LLCs have a number of limitations, including on-chip area constraints that limit storage capacity, long planar interconnect spans that increase access latency, and contention for the shared cache capacity that hurts performance under workload colocation. To overcome these limitations, we propose a Die-Stacked Private LLC Organization (SILO), which combines conventional on-chip private L1 (and optionally, L2) caches with a per-core private LLC in die-stacked DRAM. By stacking LLC slices directly above each core, SILO avoids long planar wire spans. The use of private caches inherently avoids inter-core cache contention. Last but not the least, engineering the DRAM for latency affords low access delays while still providing over 100MB of capacity per core in today's technology. Evaluation results show that SILO outperforms state-of-the-art conventional cache architectures on a range of scale-out and traditional workloads while delivering strong performance isolation under colocation.

Tue, 16 October 2pm IF 2.33 CArD
Vijayanand Nagarajan
A Peek into Concurrency and Distributed Systems through the Lens of Shared Memory Multiprocessors

The shared memory consistency model is the critical hardware-software interface in multiprocessors that specifies what value a read must return. However, this seemingly simple interface has been facing a number of notorious challenges including: (1) unclear specification; (2)  implementation riddled with bugs; (3) tension between programmability and efficiency. In this talk, Vijay will first overview the work they've been doing for the past decade in addressing these challenges. Midway into the research, they realised a couple of key insights: (1) Hardware design for shared memory should not be guided by ad-hoc informal specifications. Instead, they advocate for implementations to be derived from the specification through a series of refinements. This not only leads to designs that are correct but also (somewhat surprisingly) to designs that are efficient. (2) Although today's distributed systems (e.g. a distributed key-value-store in a datacentre) don't look anything like "shared memory", they fundamentally share state and hence there is an opportunity of the exchange of ideas between distributed systems and computer architecture communities. Based on these realisations, Vijay will outline our vision for synthesizing efficient and correct-by-construction distributed systems---a problem that has had a rich history including at Edinburgh---but one whose time has come.

Thur, 11 October 4pm IF 4.31/ 4.33 ICSA
Jörg Thalheim
Lightweight OS Containers

Container-based virtualization has become the de-facto standard for deploying applications in data centers. However, deployed containers frequently include a wide-range of tools (e.g., debuggers) that are not required for applications in the common use-case, but they are included for rare occasions such as in-production debugging. As a consequence, containers are significantly larger than necessary for the common case, thus increasing the build and deployment time. Cntr provides the performance benefits of lightweight containers and the functionality of large containers by splitting the traditional container image into two parts: the “fat” image-containing the tools, and the “slim” image-containing the main application. At run-time, Cntr allows the user to efficiently deploy the “slim” image and then expand it with additional tools, when and if necessary, by dynamically attaching the “fat” image. To achieve this, Cntr transparently combines the two container images using a new nested namespace, without any modification to the application, the container manager, or the operating system. We have implemented Cntr in Rust, using FUSE, and incorporated a range of optimizations. Cntr supports the full Linux filesystem API, and it is compatible with all container implementations (i.e., Docker, rkt, LXC, systemd-nspawn). Through extensive evaluation, we show that Cntr incurs reasonable performance overhead while reducing, on average, by 66.6% the image size of the Top-50 images available on Docker Hub.

Thur, 4 October 4pm IF 4.31/ 4.33 CArD
Gordon Brebner,    Xilinx Labs
Programmable networking: Extending the range of the P4 language

In four years, P4 has evolved from being a paper proposal to being a packet processing programming language with increasing adoption worldwide, overseen by the P4 Language Consortium (P4.org).  The talk will first overview developments over this period, which have brought the community to the current P4_16 language specification, the PSA (Portable Switch Architecture) specification, and the P4Runtime API specification. It will then discuss some current community efforts to extend the reach of P4. One of the key developments in 2017 was language-architecture separation, leading to the P4_16 (language) and PSA (architecture) threads.  In practice, NICs (Network Interface Cards, notably Smart NICs) are a common target, so one community goal is to define a PNA (Portable NIC Architecture) specification, to complement the existing PSA specification. Then, a bigger picture is to extend P4 to allow the description of architectures, which is the goal of the Programmable Target Architecture (PTA) research project of Stanford and Xilinx Labs. The talk will describe this project, and a current prototype that compiles extended P4 descriptions to FPGA-based hardware implementations.  An important test case for the new approach will be the expression of both PSA and PNA (when ultimately defined) in the extended "P4 +" rather than in English plus diagrams. Currently, P4 is focused on packet processing - through parsing, match-action pipelines, and deparsing.  Another ongoing research project, involving MIT, NYU, Stanford, and Xilinx Labs, concerns extending P4 (language and architecture) to cover Traffic Management - providing programmable packet scheduling, shaping, policing, queueing, etc.  The talk will overview this project, and a current prototype based on the PIFO scheduling model that was presented at SIGCOMM 2016.  Finally, the talk will consider future evolution of the open source community around P4, including the development of comprehensive reference examples for both switch and NIC architectures, for both software and programmable hardware implementations.

Wed, 26 September 2pm

IF 4.31/ 4.33

ICSA
Björn Franke
Software Transformation Driven By Dynamic Information

I will talk about my research on compiler driven program transformation, where static analyses have been enhanced with or replaced by dynamic information. I give examples of research compilers used for program optimisation, parallelisation and just-in-time compilation and show how these different application contexts determine how dynamic information is gathered and incorporated. I will conclude with an overview of my research vision for compilers operating in an increasingly fast paced environment, where the only constant is change.

Tue, 25 September 4pm IF G.07 ICSA

Adam Valles Mari

(Institute of Photonic Sciences Barcelona)
New ways of detecting spatial modes of light using machine learning

Photons can be described in terms of their spatial modes – the “patterns” of light. As there are an infinite number of spatial modes, entanglement in this degree of freedom offers the opportunity to realize high-dimensional quantum states. In this seminar, we will review some applications where the patterns of light can be used, studying the advantages and disadvantages of using such entangled states for ghost imaging experiments, or also as a means to encode information for secure quantum communication channels, considering the preservation of entanglement through noisy channels, e.g., a turbulent atmosphere. We will explain how to create such states in the laboratory and how to improve their detection by using machine learning techniques.

Wed 19 September 2pm BC G.03 EPCC

Francisco Alarcón Oseguera

(Universidad Complutense de Madrid)
Mesoscopic simulations for tailoring active particles

Active matter is concerned with the study of systems composed of self-driven units and active particles capable of converting energy into systematic movement, while the Brownian particles are subject to thermal fluctuations. A remarkable feature of active matter is its natural tendency to self-organize. One striking instance of this ability is to generate what have been called living clusters, where clusters broadly distributed in size constantly move and evolve through particle exchange, breaking or merging. Experiments in this field are developing at a rapid pace and a new theoretical framework is needed to establish a “universal”  behaviour among these internallydriven systems.

In this talk, I will show results of numerical simulations of active particles forming living clusters, such structures are similar than clusters observed in experiments of both, Janus chemotactic and dipolar active particles. I will present the influence of both the hydrodynamic and the anisotropic interactions in the formation of clusters by measuring morphological and dynamical features of the system

Tue 18 September 2pm BC G.03 EPCC