ADA participated in MICRO (October 1-5) in Chicago, IL with a large presence this year.
Detailed information below:
Workshop and Tutorial
Driving Architecture Specialization with Benchmarks for Emerging Applications was a well-received workshop organized by ADA PIs Valeria Bertacco (Univ. of Michigan), Sharad Malik (Princeton), Todd Austin (Univ. of Michigan) and David Brooks (Harvard). Benchmark researchers from academia and industry were invited to submit proposals on benchmark suites in their area. Seven presentations were given during the event followed by an interactive panel for Q&A with all seven presenters including talks from four University of Michigan ADA members, Andrew McCrabb, Todd Austin, Reetu Das and Baris Kasikci.
The workshop brought together researchers who have embarked upon new benchmarking efforts over the past several years. The goal of the workshop was to allow benchmarking researchers to share lessons learned and best practices for identifying, developing, and gaining adoption of benchmarks, and for the broader community to learn about recent offerings that support architecture research. Ultimately, we seek to promulgate best practices in engaging with the prospective user communities for new benchmarks.
ADA PI, Chris Batten, co-organized the OpenROAD tutorial: Open-Source ASIC Design for Computer Architects which took place in Chicago in the afternoon on October 1st.
The OpenROAD Project was founded in 2018 under the DARPA IDEA program to address the issue of hardware design requiring too much effort, cost, and time. Since then, the project has shown great success in providing free, open-source implementation for ASIC designs in technology nodes as small as 7nm. Notably, OpenROAD has enabled first-time chip designers, hobbyists, and students to fabricate chips through the Google/Efabless/Skywater 130nm free shuttle program. Over 100 designs have been silicon-tested on the first two shuttle runs and hundreds more have been taped out.
OpenROAD provides a tremendous opportunity for researchers to perform design space exploration, collaborate, and construct real chips in a free and open-source manner.
Publications by ADA researchers
Four papers co-authored by ADA researchers were presented at the 2022 IEEE/ACM International Symposium on Microarchitecture (MICRO), the premier research forum for new microarchitecture techniques that advance computing and communication systems.
The four papers covered reducing modern data center applications’ footprints, novel accelerator architecture and a programming model for mining temporal motifs efficiently, the first online PGO system for unmanaged code, and in-memory computing.
Whisper: Profile-Guided Branch Misprediction Elimination for Data Center Applications Best Paper Award Winner
Tanvir Ahmed Khan, Muhammed Ugur (University of Michigan); Krishnendra Nathella, Dam Sunwoo (Arm Research); Heiner Litz (University of California, Santa Cruz); Daniel A. Jiménez (Texas A&M University); Baris Kasikci (University of Michigan)
Abstract: Modern data center applications experience frequent branch mispredictions – degrading performance, increasing cost, and reducing energy efficiency in data centers. Even the state-of-the-art branch predictor, TAGE-SC-L, suffers from an average branch Mispredictions Per Kilo Instructions (branch-MPKI) of 3.0 (0.5-7.2) for these applications since their large code footprints exhaust TAGE-SC-L’s intended capacity.
In this work, we propose Whisper, a novel profile-guided mechanism to avoid branch mispredictions. Whisper investigates the in-production profile of data center applications to identify precise program contexts that lead to branch mispredictions. Corresponding prediction hints are then inserted into code to strategically avoid those mispredictions during program execution. Whisper presents three novel profile-guided techniques: (1) hashed history correlation which efficiently encodes hard-to-predict correlations in branch history using lightweight Boolean formulas, (2) randomized formula testing which selects a locally optimal Boolean formula from a randomly selected subset of possible formulas to predict a branch, and (3) the extension of Read-Once Monotone Boolean Formulas with Implication and Converse Non-Implication to improve the branch history coverage of these formulas with minimal overhead.
We evaluate Whisper on 12 widely-used data center applications and demonstrate that Whisper enables traditional branch predictors to achieve a speedup close to that of an ideal branch predictor. Specifically, Whisper achieves an average speedup of 2.8% (0.4%-4.6%) by reducing 16.8% (1.7%-32.4%) of branch mispredictions over TAGE-SC-L and outperforms the state-of-the-art profile-guided branch prediction mechanisms by 7.9% on average.
Nishil Talati, Haojie Ye (University of Michigan); Sanketh Vedula (Technion); Kuan-Yu Chen, Yuhan Chen, Daniel Liu, Yichao Yuan, David Blaauw (University of Michigan); Alex Bronstein (Technion); Trevor Mudge, Ronald Dreslinski (University of Michigan)
Abstract: A variety of complex systems, including social and communication networks, financial markets, biology, and neuroscience are modeled using temporal graphs that contain a set of nodes and directed time stamped edges. Temporal motifs in temporal graphs are generalized from subgraph patterns in static graphs in that they also account for edge ordering and time duration, in addition to the graph structure. Mining temporal motifs is a fundamental problem used in several application domains. However, existing software frameworks offer suboptimal performance due to high algorithmic complexity and irregular memory accesses of temporal motif mining.
This paper presents Mint—a novel accelerator architecture and a programming model for mining temporal motifs efficiently. We first divide this workload into three fundamental tasks: search, book-keeping, and backtracking. Based on this, we propose a task–centric programming model that enables decoupled, asynchronous execution. This model unlocks massive opportunities for parallelism, and allows storing task context information on-chip. To best utilize the proposed programming model, we design a domain-specific hardware accelerator using its data path and memory subsystem design to cater to the unique workload characteristics of temporal motif mining. To further improve performance, we propose a novel optimization called search index memoization that significantly reduces memory traffic. We comprehensively compare the performance of Mint with state-of-the-art temporal motif mining software frameworks (both approximate and exact) running on both CPU and GPU, and show 9x–2576x benefit in performance.
Yuxuan Zhang (University of Pennsylvania); Tanvir Ahmed Khan (University of Michigan); Gilles A. Pokam (Intel); Baris Kasikci (University of Michigan); Heiner Litz (University of California, Santa Cruz); Joseph Devietti (University of Pennsylvania)
Abstract: The processor front-end has become an increasingly important bottleneck in recent years due to growing application code footprints, particularly in data centers. First-level instruction caches and branch prediction engines have not been able to keep up with this code growth, leading to more front-end stalls and lower Instructions Per Cycle (IPC). Profile-guided optimizations performed by compilers represent a promising approach, as they rearrange code to maximize instruction cache locality and branch prediction efficiency along a relatively small number of hot code paths. However, these optimizations require continuous profiling and rebuilding of applications to ensure that the code layout matches the collected profiles. If an application’s code is frequently updated, it becomes challenging to map profiling data from a previous version onto the latest version, leading to ignored profiling data and missed optimization opportunities.
In this paper, we propose OCOLOS, the first online code layout optimization system for unmodified applications written in unmanaged languages. OCOLOS allows profile-guided optimization to be performed on a running process, instead of being performed offline and requiring the application to be relaunched. By running online, profile data is always relevant to the current execution and always maps perfectly to the running code. OCOLOS demonstrates how to achieve robust online code replacement in complex multithreaded applications like MySQL and MongoDB, without requiring any application changes. Our experiments show that OCOLOS can accelerate MySQL by up to 1.41x, the Verilator hardware simulator by up to 2.20x, and a build of the Clang compiler by up to 1.14x.
Daichi Fujiki (Keio University); Alireza Khadem (University of Michigan); Scott Mahlke (University of Michigan / NVIDIA Research); Reetuparna Das (University of Michigan)
Abstract: In-memory computing provides revolutionary changes to computer architecture by fusing memory and computation, allowing data-intensive computations to reduce data communications. Despite promising results of in-memory computing in each layer of the memory hierarchy, an integrated approach to a system with multiple computable memories has not been examined. This paper presents a holistic and application-driven approach to building Multi-Layer In-Memory Processing (MLIMP) systems, enabling applications with variable computation demands to reap the benefits of heterogeneous compute resources in an integrated MLIMP system. By introducing concurrent task scheduling to MLIMP, we achieve improved performance and energy efficiency for graph neural networks and multiprogramming of data-parallel applications.
Tuan Ta (Cornell), Khalid Al-Hawaj (Cornell), Nick Cebry (Cornell), Yanghui Ou (Cornell), Eric Hall (Cornell), Courtney Golden (Cornell), Christopher Batten (Cornell)
Abstract: Single-ISA heterogeneous multi-core architectures offer a compelling high-performance and high-efficiency solution to executing task-parallel workloads in mobile systems on chip (SoCs). In addition to task-parallel workloads, many data- parallel applications, such as machine learning, computer vi- sion, and data analytics, increasingly run on mobile SoCs to provide real-time user interactions. Next-generation scalable vector architectures, such as the RISC-V Vector Extension and Arm SVE, have recently emerged as unified vector abstractions for both large- and small-scale systems. In this paper, we pro- pose novel area-efficient high-performance architectures called big.VLITTLE that support next-generation vector architectures to efficiently accelerate data-parallel workloads in conventional big.LITTLE systems. big.VLITTLE architectures reconfigure multiple little cores on demand to work as a decoupled vector engine when executing data-parallel workloads. Our results show that a big.VLITTLE system can achieve 1.6⇥ performance speedup over an area-comparable big.LITTLE system equipped with an integrated vector unit across multiple data-parallel applications and 1.7⇥ speedup compared to an aggressive decoupled vector engine for task-parallel workloads.