FORMS: Fine-grained Polarized ReRAM-based In-situ Computation for Mixed-signal DNN Accelerator


Abstract: "Recent work demonstrated the promise of using resistive random access memory (ReRAM) as an emerging technology to perform inherently parallel analog domain in-situ matrix-vector multiplication—the intensive and key computation in deep neural networks (DNNs). One key problem is the weights that are signed values. However, in a ReRAM crossbar, weights are stored as conductance of... » read more

Vector Runahead


Abstract: "The memory wall places a significant limit on performance for many modern workloads. These applications feature complex chains of dependent, indirect memory accesses, which cannot be picked up by even the most advanced microarchitectural prefetchers. The result is that current out-of-order superscalar processors spend the majority of their time stalled. While it is possible to bui... » read more

Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology


Abstract: "Emerging applications such as deep neural network demand high off-chip memory bandwidth. However, under stringent physical constraints of chip packages and system boards, it becomes very expensive to further increase the bandwidth of off-chip memory. Besides, transferring data across the memory hierarchy constitutes a large fraction of total energy consumption of systems, and the ... » read more

IChannels: Exploiting Current Management Mechanisms to Create Covert Channels in Modern Processors


Find technical paper link here. Abstract: "To operate efficiently across a wide range of workloads with varying power requirements, a modern processor applies different current management mechanisms, which briefly throttle instruction execution while they adjust voltage and frequency to accommodate for power-hungry instructions (PHIs) in the instruction stream. Doing so 1) reduces the pow... » read more

A RISC-V in-network accelerator for flexible high-performance low-power packet processing


Find the technical paper link here. Abstract "The capacity of offloading data and control tasks to the network is becoming increasingly important, especially if we consider the faster growth of network speed when compared to CPU frequencies. In-network compute alleviates the host CPU load by running tasks directly in the network, enabling additional computation/communication overlap and pot... » read more

Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers


Harini Muthukrishnan (U of Michigan); David Nellans, Daniel Lustig (NVIDIA); Jeffrey A. Fessler, Thomas Wenisch (U of Michigan). Abstract—"Despite continuing research into inter-GPU communication mechanisms, extracting performance from multiGPU systems remains a significant challenge. Inter-GPU communication via bulk DMA-based transfers exposes data transfer latency on the GPU’s critical... » read more

SARA: Scaling a Reconfigurable Dataflow Accelerator


Yaqi Zhang, Nathan Zhang, Tian Zhao, Matt Vilim, Muhammad Shahbaz, Kunle Olukotun (Stanford) Abstract—"The need for speed in modern data-intensive workloads and the rise of “dark silicon” in the semiconductor industry are pushing for larger, faster, and more energy and areaefficient architectures, such as Reconfigurable Dataflow Accelerators (RDAs). Nevertheless, challenges remain in d... » read more

TimeCache: Using Time to Eliminate Cache Side Channels when Sharing Software


"Abstract—Timing side channels have been used to extract cryptographic keys and sensitive documents even from trusted enclaves. Specifically, cache side channels created by reuse of shared code or data in the memory hierarchy have been exploited by several known attacks, e.g., evict+reload for recovering an RSA key and Spectre variants for leaking speculatively loaded data. In this paper, we ... » read more

Communication Algorithm-Architecture Co-Design for Distributed Deep Learning


"Abstract—Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify th... » read more

Don’t Forget the I/O When Allocating Your Last-Level Cache


Source/Authors: Yifan Yuan (UIUC); Mohammad Alian (Kansas); Yipeng Wang, Ren Wang (Intel Labs); Ilia Kurakin (Intel); Charlie Tai (Intel Labs); Nam Sung Kim (UIUC) Find technical paper here. 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA.) "Abstract—In modern server CPUs, last-level cache (LLC) is a critical hardware resource that exerts significant... » read more