Scalable Chiplet System for LLM Training, Finetuning and Reduced DRAM Accesses (Tsinghua University)


A new technical paper titled "Hecaton: Training and Finetuning Large Language Models with Scalable Chiplet Systems" was published by researchers at Tsinghua University. Abstract "Large Language Models (LLMs) have achieved remarkable success in various fields, but their training and finetuning require massive computation and memory, necessitating parallelism which introduces heavy communicat... » read more

DL Compiler for Efficiently Utilizing Inter-Core Connected AI Chips (UIUC, Microsoft)


A new technical paper titled "Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor" was published by researchers at UIUC and Microsoft Research. Abstract "As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on th... » read more

GPU Microarchitecture Integrating Dedicated Matrix Units At The Cluster Level (UC Berkeley)


A new technical paper titled "Virgo: Cluster-level Matrix Unit Integration in GPUs for Scalability and Energy Efficiency" was published by UC Berkeley. Abstract "Modern GPUs incorporate specialized matrix units such as Tensor Cores to accelerate GEMM operations central to deep learning workloads. However, existing matrix unit designs are tightly coupled to the SIMT core, limiting the size a... » read more

Benefits Of The Ultra-Low Leakage Currents from IGZO TFTs For Neuromorphic Applications


A new technical paper titled "A tunable multi-timescale Indium-Gallium-Zinc-Oxide thin-film transistor neuron towards hybrid solutions for spiking neuromorphic applications" was published by researchers at imec, CSIC Universidad de Sevilla, and Sungkyunkwan University. Abstract "Spiking neural network algorithms require fine-tuned neuromorphic hardware to increase their effectiveness. Such ... » read more

Characterizing Three Supercomputers: Multi-GPU Interconnect Performance


A new technical paper titled "Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects" was published by researchers at Sapienza University of Rome, University of Trento, Vrije Universiteit Amsterdam, ETH Zurich, CINECA, University of Antwerp, IBM Research Europe, HPE Cray, and NVIDIA. Abstract "Multi-GPU nodes are increasingly common in the rapidly evolving landscape... » read more

Survey of Energy Efficient PIM Processors


A new technical paper titled "Survey of Deep Learning Accelerators for Edge and Emerging Computing" was published by researchers at University of Dayton and the Air Force Research Laboratory. Abstract "The unprecedented progress in artificial intelligence (AI), particularly in deep learning algorithms with ubiquitous internet connected smart devices, has created a high demand for AI compu... » read more

Co-optimizing HW Architecture, Memory Footprint, Device Placement And Per-Chip Operator Scheduling (Georgia Tech, Microsoft)


A technical paper titled “Integrated Hardware Architecture and Device Placement Search” was published by researchers at Georgia Institute of Technology and Microsoft Research. Abstract: "Distributed execution of deep learning training involves a dynamic interplay between hardware accelerator architecture and device placement strategy. This is the first work to explore the co-optimization ... » read more

CHERI RISC-V: HW Extension for Conditional Capabilities


A technical paper titled “Mon CHÈRI <3 Adapting Capability Hardware Enhanced RISC with Conditional Capabilities” was published by researchers at Ericsson Security Research, Université Libre de Bruxelles, and KU Leuven. Abstract: "Up to 10% of memory-safety vulnerabilities in languages like C and C++ stem from uninitialized variables. This work addresses the prevalence and lack of ade... » read more

Using Diffusion Models to Generate Chip Placements (UC Berkeley)


A technical paper titled “Chip Placement with Diffusion” was published by researchers at UC Berkeley. Abstract: "Macro placement is a vital step in digital circuit design that defines the physical location of large collections of components, known as macros, on a 2-dimensional chip. The physical layout obtained during placement determines key performance metrics of the chip, such as power... » read more

Survey of CXL Implementations and Standards (Intel, Microsoft)


A new technical paper titled "An Introduction to the Compute Express Link (CXL) Interconnect" was published by researchers at Intel Corporation, Microsoft, and University of Washington. Abstract "The Compute Express Link (CXL) is an open industry-standard interconnect between processors and devices such as accelerators, memory buffers, smart network interfaces, persistent memory, and solid-... » read more

← Older posts