Author's Latest Posts


SARA: Scaling a Reconfigurable Dataflow Accelerator


Yaqi Zhang, Nathan Zhang, Tian Zhao, Matt Vilim, Muhammad Shahbaz, Kunle Olukotun (Stanford) Abstract—"The need for speed in modern data-intensive workloads and the rise of “dark silicon” in the semiconductor industry are pushing for larger, faster, and more energy and areaefficient architectures, such as Reconfigurable Dataflow Accelerators (RDAs). Nevertheless, challenges remain in d... » read more

REDUCT: Keep It Close, Keep It Cool – Scaling DNN Inference on Multi-Core CPUs with Near-Cache Compute


Abstract—"Deep Neural Networks (DNN) are used in a variety of applications and services. With the evolving nature of DNNs, the race to build optimal hardware (both in datacenter and edge) continues. General purpose multi-core CPUs offer unique attractive advantages for DNN inference at both datacenter [60] and edge [71]. Most of the CPU pipeline design complexity is targeted towards optimizin... » read more

RaPiD: AI Accelerator for Ultra-low Precision Training and Inference


Abstract—"The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use of hardware accelerators in their execution. Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments. The intrinsic error-resilient nature of AI workloads present a unique opportunity for performance/energy i... » read more

PF-DRAM: A Precharge-Free DRAM Structure


Authors: Nezam Rohbani † (IPM); Sina Darabii § (Sharif); Hamid Sarbazi-Azad † i §(Sharif / IPM): † School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran § Department of Computer Engineering, Sharif University of Technology, Tehran, Iran Abstract: "Although DRAM capacity and bandwidth have increased sharply by the advances in technology ... » read more

NN-Baton: DNN Workload Orchestration & Chiplet Granularity Exploration for Multichip Accelerators


"Abstract—The revolution of machine learning poses an unprecedented demand for computation resources, urging more transistors on a single monolithic chip, which is not sustainable in the Post-Moore era. The multichip integration with small functional dies, called chiplets, can reduce the manufacturing cost, improve the fabrication yield, and achieve die-level reuse for different system scales... » read more

TimeCache: Using Time to Eliminate Cache Side Channels when Sharing Software


"Abstract—Timing side channels have been used to extract cryptographic keys and sensitive documents even from trusted enclaves. Specifically, cache side channels created by reuse of shared code or data in the memory hierarchy have been exploited by several known attacks, e.g., evict+reload for recovering an RSA key and Spectre variants for leaking speculatively loaded data. In this paper, we ... » read more

Communication Algorithm-Architecture Co-Design for Distributed Deep Learning


"Abstract—Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify th... » read more

Don’t Forget the I/O When Allocating Your Last-Level Cache


Source/Authors: Yifan Yuan (UIUC); Mohammad Alian (Kansas); Yipeng Wang, Ren Wang (Intel Labs); Ilia Kurakin (Intel); Charlie Tai (Intel Labs); Nam Sung Kim (UIUC) Find technical paper here. 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA.) "Abstract—In modern server CPUs, last-level cache (LLC) is a critical hardware resource that exerts significant... » read more

Ten Lessons From Three Generations Shaped Google’s TPUv4i


Source: Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Nishant Patil, Sushma Prasad, Clifford Young, Zongwei Zhou (Google); David Patterson (Google / Berkeley) Find technical paper here. 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) Abstract–"Google de... » read more

HyperRec: Efficient Recommender Systems with Hyperdimensional Computing


A group of researchers are taking a different approach to AI. The University of California at San Diego, the University of California at Irvine, San Diego State University and DGIST recently presented a paper on a new hardware algorithm based on hyperdimensional (HD) computing, which is a brain-inspired computing model. The new algorithm, called HyperRec, uses data that is modeled with bina... » read more

← Older posts Newer posts →