PPar Lunch Series 2018/19
PPar lunch series schedule for 2018/19
|Introducing Cohort 2018
|Pablo Andres Martinez
Towards Distributed Quantum Algorithms
Quantum computers are capable of efficiently executing computations that are intractable on classical computers. This difference in computational power is achieved by exploiting some properties of quantum mechanics. However, these same properties impose tough challenges on the development of large scale quantum computers. Small quantum computers have already been built, and some experts have proposed a distributed grid of small quantum computers as an alternative to large scale monolithic architectures. Currently the literature offers no support for the design of distributed quantum programs. In this talk we aim to give an answer to this gap, presenting an algorithm that automatically distributes any input quantum circuit. The algorithm proceeds by reducing the problem to hypergraph partitioning.
iPregel: A Combiner-Based In-Memory Shared Memory Vertex-Centric Framework
The expressiveness of the vertex-centric programming model introduced by Pregel attracted great attention. Over the years, numerous frameworks emerged, abiding by the same programming model, while relying on widely different architectural designs. The vast majority of existing vertex-centric frameworks exploits distributed memory parallelism or out-of-core computations. To our knowledge, only one vertex-centric framework is designed upon in-memory storage and shared memory parallelism. Unfortunately, while built on a faster architecture than that of other vertex-centric frameworks, it did not prove to significantly outperform other existing solutions.
In this presentation I will present iPregel: another in-memory shared memory vertex-centric framework. The optimisations developed and presented in this presentation target three hotspots of vertex-centric calculations: selecting active vertices, routing messages to their recipient and updating recipients inbox. We will see how iPregel performs against the state-of-the-art in-memory distributed memory framework Pregel+ on three of the most common vertex-centric applications: PageRank, Hashmin and the Single-Source Shortest Path. Experiments demonstrate that the single-node framework iPregel is faster than its distributed memory counterpart until at least 11 nodes are used. Further experiments show that iPregel completes a PageRank application with an order of magnitude less memory than popular vertex-centric frameworks.
Learning Models of Accelerator Behaviour
An increasing number of computational workloads now have accelerated implementations available to improve performance (for example, linear algebra, machine learning or bioinformatics). These accelerators are very varied - specialised libraries, platform-specific optimisation and dedicated hardware are all common examples. Unfortunately, performance typically comes at the expense of portability when using accelerators. Code written for one implementation cannot easily be modified to run on another, or updated in the future for new devices.
To solve this problem I propose the use of program synthesis to automatically learn a model of an accelerator's behaviour. These models can then be integrated with existing compiler-based tools to discover compatible code, and to modify this code to run on the accelerator. In this talk I describe my approach and implementation, as well as applications to real-world codebases.
|No speaker. Meet to chat
Quantum Supremacy in Quantum Machine Learning
In recent years, many companies and groups have been developing small quantum devices, and as such we are nearing the dawn of the NISQ (Noisy Intermediate Scale Quantum) era of quantum technologies.
As such, the most immediate milestone in the field is the demonstration of 'Quantum Computational Supremacy', the solution of a particular problem which cannot be solved by any classical means efficiently on these noisy devices, according to results in complexity theory.
The problem in question is the generation of random samples by the quantum device, which has no inherent application by itself. As such, we propose the Ising Born Machine, a generative Quantum Machine Learning Algorithm, which is a potential use for these random samples.Given that these samples cannot be generated classically, we claim that this algorithm cannot be simulated efficiently by any classical means and we present some numerical results to demonstrate successful training of the model on a small scale.
Compilation Scheduling Policy for JIT-Based Systems
To enable fast cross-platform virtualisation, parts of the target applications are compiled Just-In-Time (JIT) to enable fast native execution, as opposed to interpretation. As the compilation is performed while the application is being executed, it contributes to the end-to-end runtime. To offset the compilation cost, only frequently executed (i.e. hot) code regions are compiled. Further speedups can be achieved by decoupling the application execution from the compilation sub-system. This can be done using a compilation queue which facilitates the transfer of code regions selected for compilation between the runtime and the compilation engine. Al
My work explores new compilation scheduling policies used for ordering code regions in the compilation queue. The most significant innovation is the Dynamic Heat policy. The policy schedules code regions based on the value of execution frequency (i.e. heat). This heat is dynamically derived by profiling before and after the code regions are dispatched to the compilation queue.
|No lunch (PG Open day)
|No speaker. Meet to chat
Where the linear lambdas goHave you ever wondered where the lambdas live, and if you could just grab a random few to see if your compiler works, only to realise “WHOA, that’s kinda hard!” Turns out that linear lambdas live in an even more elusive land, where none of the existing solutions – as far as I know – really apply. A common problem with support for those pesky linear types, really. This talk will drag you though some rather naive solutions, some existing work, and some promises of future work.
|PPar Lunch Deluxe
|End of semester social.
|Title / Abstract
Hermes: Coherence for the Datacenter
Today’s online services are underpinned by datastores that need to provide high throughput to keep up with the ever-increasing demand for online services. To guarantee high availability in the presence of failures, these datastores replicate the data across several nodes. This is accomplished with the help of a replication protocol that is responsible for maintaining the replicas consistent by ensuring that each write is propagated to the replicas. Strong consistency is preferred to weaker consistency models that cannot guarantee an intuitive behaviour for clients. In this work, we observe that existing replication protocols offering high availability and strong consistency are fundamentally unable to deliver high performance. We identify two endemic sources of performance loss: lack of concurrency on writes and/or inability to offer local reads. Meanwhile, it is well known that shared memory multiprocessors offer both strong consistency and high performance by using invalidating coherence protocols that support local reads (cache hits) and offer high concurrency on reads and writes to different memory locations across threads. However, invalidating coherence protocols are not resilient. Based on these observations, we propose Hermes, a fault-tolerant replication protocol that guarantees strong consistency and high performance. Similarly to multiprocessor coherence protocols, Hermes uses an invalidation-based mechanism to achieve linearizability. In contrast to the multiprocessor, Hermes achieves fault tolerance by avoiding single points of failure and guaranteeing that any uncommitted write can be replayed. Using an RDMA-based datastore with 5 replicas, we show that Hermes outperforms two state-of-the-art replication protocols by more than 4× under a workload with 5% writes while offering strong consistency in the presence of faults.
Capturing loop parallelisability property: from software parallelisability metrics to machine learning based approach
We live in the world, where application performance matters. In order to fully utilise the whole potential of modern parallel high-performance architectures and achieve the maximum possible performance software must be parallel as well. At the present day program parallelisation is done in a several different ways: manual, automatic and all the possible combinations and mixtures of the two.
Being quite effective overall, these approaches have their inherent limitations and restrictions. While manual parallelisation requires a wide range of skills and expertise from a programmer, automatic parallelisation has to be conservative and does not exploit all available parallelisation opportunities.
In this talk I will speak about a step towards supplementing the available spectrum of software parallelisation approaches with a new machine learning assisted one.
Training fast and reliable English-Chinese NMT models for large-scale e-commerce usage
The machine translation is a key aspect of modern e-commerce. During my internship at Ebay last year, I’ve been responsible for deploying models for large scale online usage adapted for translation of titles, queries and descriptions from English to Chinese. In this presentation, I will show you the progress of this task, challenges and results, while focusing on both high quality as well as very fast inference.
|No speaker. Meet to chat
Towards Mapping Lift to Deep Neural Network Accelerators
Deep Neural Network (DNN) accelerators enjoy a rise in popularity due to the ubiquity of DNN applications. Devices to accelerate DNNs -- CPUs, GPUs, ASICs, FPGAs -- vary significantly and pose an increasingly difficult challenge to extract performance from them. Approaches proposed to address this problem lack in either portability or extensibility.
Lift is a novel approach that produces performance-portable GPU and CPU code for linear algebra, sparse matrix and stencil computations. Lift uses rewrite rules to detect and transform patterns for parallelism, memory configuration and instruction set of the target hardware. In this talk, I present preliminary work in applying Lift to the generation of optimised code for DNN accelerators by mapping expressions to coarse-grained ISA primitives. Making a case for extensibility of Lift, I discuss the additions to the IR, type system, code generation and rewrite rules.
|No lunch - Industrial Engagement Event week
ProtoGen: Automatically Generating Directory Cache Coherence Protocols from Atomic Specifications
Non-atomicity of cache coherence transactions in modern multicore processors leads to messages interleaving with other conflicting cache coherence transactions. Hence, designing correct directory cache coherence protocols is a complex and difficult task architects face. ProtoGen is an automated tool that overcomes this design challenge. It takes the description of a directory cache coherence protocol with atomic transactions and generates the corresponding protocol for a multicore processor with non-atomic transactions. The output of ProtoGen are finite state machines for the cache and directory controllers, including all the transient states required to handle concurrent transactions.
Using ProtoGen, the complete MSI, MESI, and MOSI protocols have been generated based on their stable state protocol specifications. The correctness of the generated protocols for safety and deadlock freedom was verified by a model checker. By reducing the number of stalls and merging logically identical states, ProtoGen generates protocols that are identical or better than manually generated protocols.
Furthermore, to prove its generality, ProtoGen has been used to generate a non-atomic implementation of the relatively unconventional TSO-CC protocol.
Position-Dependent Arrays and Their Application for High Performance Code Generation
Data types in high-level programming languages capture semantic information about the program. Compilers for languages such as Accelerate, Futhark, Delite or Lift have shown how to exploit this information to produce high-performance code. While these approaches have the ability to handle multi-dimensional arrays, their type systems are currently limited to nested arrays of rectangular shapes.
This talk presents our work on lifting some of these limitations and enable the representation of irregular, yet statically known shaped, multi-dimensional arrays at the type level. This will enable the Lift compiler to generate efficient code for applications operating on irregular-shaped structure, such as matrix-triangle multiplication or operations on compact representations of tree-like data structures. In particular, the type system is extended by allowing the array element type to depend, in a restricted way, on its position in the enclosing array. We show how existing high-level patterns together with a couple new primitives naturally express computations on such data structures.
Finally, we show how classical low-level loop optimizations can be expressed in terms of operations on such irregular arrays, highlighting their utility even in applications that are expressible using more traditional methods. We conclude by demonstrating how using these techniques, the compiler can generate highly performing code, competitive with state-of-the art alternatives.
Efficient Structures and Optimisations for Neural Networks
The traditional compiler view of neural networks is that of a suspicious black box; a concoction of mysterious trickery and simple matrix operations, tenuously stacked into layers of nebulous hyperparameters and numerical stability hacks, not to be tinkered with lest they tumble. Behind their hype-clad exterior and often dubious mathematical notation, though, lies a highly approximable, very general tool with significant opportunities for acceleration. In this talk I?ll give a brief tour of some of the methods I (and others) have been working on to exploit redundancies in neural networks and make them go a bit faster.
Compiler Hacking @ Google
I recently spent a six month spell at Google as a compiler engineering intern. I will discuss my experiences in this role. Somewhat surprising to me, my work at Google was reminiscent of the work I do here at the university. In other words, I found the gap between industry and academia to be small. During the internship I worked on two different compilers: the Dart Software Development Kit and Triple 20. The former being the main product developed at the site, and the latter being a compiler for a domain specific meta programming language, which I designed and implemented from the ground up as part of my intern project.
Specialization Opportunities in Graphical Workloads
Computer games are extremely complex graphical applications which require specialized GPU hardware, as performance is critical. For this reason, GPU drivers often include many heuristics to help optimize throughput. Recently however, new APIs are emerging which sacrifice many heuristics for lower-level hardware control and more predictable driver behaviour. This shifts the burden for many optimizations from GPU driver developers to game programmers, but also provides numerous opportunities to exploit application-specific knowledge.
My current project examines different opportunities for specializing GPU code and reducing redundant data transfers. Static analysis of commercial games shows that in some games, up to 97% of the programs'
uniform data inputs are unused, as well as up to 62% of those in interface between vertex and fragment shaders, which shows potential for improving memory usage and communication overheads.
We have also explored the upper limits of specialization if all dynamic inputs are constant at run-time. For instance, if uniform inputs are constant, up to 44% of instructions can be eliminated in some games, with a further 14% becoming constant-foldable at compile time. Analysis of run-time traces, reveals that 48-91% of uniform inputs are constant in real games, so values close to the upper limit may be achieved in practice.
Language Integrated Updateable Views
Language Integrated Updateable Views, which are based off existing work on Relational Lenses by Bohannon, Pierce and Vaughan, offer an additional paradigm for writing applications which interface with databases in a type-safe manner similar to Language Integrated Query (LINQ). The existing work on relational lenses lays the theoretical foundation necessary for updateable views, and we try to concretize the work and further requirements with the goal of producing a useable implementation. This talk also describes the general steps required to extend programming languages with new features.
Vectorization Aware Loop Unrolling
While compiler optimization passes are treated as standalone, they are not orthogonal to each other. Applying an optimization affects not only the performance of the output code but also the effectiveness of subsequent optimization passes. Loop unrolling and straight-line-code vectorization (SLP) are such a case. SLP thrives on unrolled loops, as the unrolled copies expose more isomorphic straight-line code. Currently, however, even production compilers have these optimizations applied independently and uncoordinated, missing significant vectorization opportunities. Without coordination, unrolling is mainly focused on reducing loop management overhead while avoiding code bloat. If it ends up helping vectorization is purely incidental. Given the significant benefits in performance, it is important that compilers exploit vector units better. We are proposing VALU, a novel loop unrolling heuristic that takes vectorization into account when making unrolling decisions. Our heuristic is powered by an analysis that estimates, in the rolled loop, the potential benefit of SLP vectorization for the unrolled version of the loop. The heuristic then selects the unrolling factor that maximizes the utilization of the vector units. Our evaluation on a production compiler shows that VALU uncovers many vectorization opportunities that were missed by the default loop unroller. This results in more vectorized code and significant performance speedups for 17 of the kernels of the TSVC benchmarks suite, reaching up to 2× speedup over the already highly optimized O3.
Kite: Efficient and Available Release Consistency for the Datacenter
Replicated datastores in an asynchronous environment must navigate the natural tension between consistency and performance. Strongly consistent actions invariably incur a higher performance overhead than more relaxed ones. Appreciating this reality, modern datastores often provide primitives of variable consistency, exposing the consistency/performance trade-off to the programmer. Alas, the onus is on the programmer to reason about correctness. In this work, we draw inspiration from the data-race-free programming paradigm, arguing that Release Consistency (RC) can be instrumental in seamlessly navigating the trade-offs. We present Kite,a highly-available, replicated Key-Value-Store that offers RCLin, a linearizable RC variant, in an asynchronous environment with crash-stop and network failures. Kite incorporates three existing high-performance protocols (Eventual Store, ABD & Paxos), which we implement in an RDMA-enabled and heavily multi-threaded manner, and a novel fast/slow path mechanism to enforce the barrier semantics of RC.We evaluate Kite against Derecho (a state-of-the-art RMDA-based state machine replication system) and an in-house implementation of ZAB (the protocol at the heart of Zookeeper). We show that Kite achieves orders of magnitude better performance than Derecho and significantly outperforms ZAB.
Verifying temporal properties of continuous systems under contexts and masks
In this talk we demonstrate novel methods for formal verification of continuous systems. We consider temporal properties of non-linear differential equation models expressed in Signal Temporal Logic (STL), and show how discrete jumps arising from the changing context of a system can be expressed using Contexual Operators. Then we show how Taylor-model based Validated Integration and Three-Valued Signal Monitoring may be used to soundly verify these temporal and contextual properties. Next we extend this to monitoring of signals under masks, which allows for more efficient, property-directed checking. We also investigate the effectiveness of these methods in verifying complex temporal and contextual properties of an ecological model of predator-prey interactions in the presence of concentration-dependant role-reversal.
|PPar Lunch Deluxe
|End of semester lunch series wrap-up/social.