ICSA Colloquium - 02/07/19

Hardware-Conscious Optimization of Large Hash Tables and Cuckoo Filters

A common pain point for many workloads is the memory subsystem, whose rate of latency and bandwidth improvements has historically lagged behind that of general-purpose CPUs.  To combat this growing mismatch, CPU architects have integrated fast hardware memory caches onto the CPU.  Frequently used data can be read from the fast on-chip caches rather than making the comparatively costlier trip to main memory.  Unfortunately, a growing class of enterprise workloads often lack sufficient locality (a measure of data reuse) to make effective use of CPU caches. 

 A common cause of this poor locality is the use of hash-based data structures, which are ubiquitous in networking, databases, and bioinformatics.  Hash-based data structures typically offer amortized constant-time lookups and updates but do so at the expense of hashing inputs to pseudorandom locations.  This pseudorandomness makes cacheability difficult once the data structure’s hot working set exceeds the size of on-chip caches.  Thus, large hash based memory structures typically make many off-chip memory requests.  However, because the off-chip bandwidth is scarce, often they saturate the bandwidth.

 In this talk, I will present several optimizations for improving the throughput of these workloads that work by being cognizant of this off-chip memory bandwidth limitation.  Whereas prior hash-based data structures typically make multiple memory requests per operation (e.g., hash table lookup), our data structures (the Horton table and Morton filter) by contrast are carefully engineered to pare down the number of superfluous memory accesses so that many operations only require one memory access and at most two.  These optimizations enable Horton tables (a modified cuckoo hash table) to achieve up to 1.8x higher lookup throughput than a cuckoo hash table.  Morton filters (a modified cuckoo filter) achieve up to 2.5x, 15.5x, and 1.3x higher lookup, insertion, and deletion throughput than a stock cuckoo filter. 

This work appeared at USENIX ATC’16 and VLDB’18.  The VLDB’18 paper was one of several to be fast-tracked to a special issue of the VLDB Journal including the best papers from VLDB’18.


Alex Breslow is a researcher at AMD Research, a subdivision of Advanced Micro Devices, Inc.  He is broadly interested in computer systems research and the interaction between software and hardware.  His work has spanned high-performance data structure design, general-purpose GPU computing, and supercomputer and data center scheduling.  Alex received his PhD in 2016 from University California, San Diego where he was advised by Dean Tullsen.  Two of Alex’s papers have been nominated for awards.  His 2013 ACM/IEEE Supercomputing Conference (SC) paper received nominations for both the Best Student Paper and Best Paper awards, and his 2018 International Conference on Very Large Data Bases (VLDB) was invited to the VLDB Journal’s Special Issue on the Best Papers of the VLDB’18 Conference.  Outside of work, Alex particularly enjoys running, rock climbing, hiking, and cinema.

Jul 02 2019 -

ICSA Colloquium - 02/07/19

Alex Breslow (AMD)

4.31/33, IF