Author's Latest Posts


PLANAR: A Programmable Accelerator For Near-Memory Data Rearrangement


Many applications employ irregular and sparse memory accesses that cannot take advantage of existing cache hierarchies in high performance processors. To solve this problem, Data Layout Transformation (DLT) techniques rearrange sparse data into a dense representation, improving locality and cache utilization. However, prior proposals in this space fail to provide a design that (i) scales with m... » read more

Keyword Transformer: A Self-Attention Model For Keyword Spotting


The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully... » read more

Synchronization Overview And Case Study on Arm Architecture


The objective of this white paper is to share knowledge on Arm architecture. The target reader of this document is those who work on synchronization with the Arm architecture. [Warning] When we are dealing with locking optimizations, we must be extremely careful about correctness. Bugs caused by synchronization are usually hard to root cause and the optimized code may crash on other CPUs wit... » read more

Introduction To The Arm Cortex-M55 Processor


This white paper covers the technical details, including pipeline, floating-point support and features of Arm Cortex-M55 processor. The Arm Cortex-M55 processor is Arm’s most AI-capable Cortex-M processor and the first to feature Arm Helium vector processing technology, bringing enhanced, energy efficient signal processing and machine learning (ML) performance. Click here to read more. » read more

Every Walk’s A Hit: Making Page Walks Single-Access Cache Hits


As memory capacity has outstripped TLB coverage, large data applications suffer from frequent page table walks. We investigate two complementary techniques for addressing this cost: reducing the number of accesses required and reducing the latency of each access. The first approach is accomplished by opportunistically "flattening" the page table: merging two levels of traditional 4 KB p... » read more

Components And Tools for Functional Safety Applications


Functional safety is important across a variety of markets, including the automotive, industrial, medical, and railway sectors, and often prevalent in consumer electronics. However, the complexity of the embedded software required for functional safety is growing and security issues are rising due to connectivity requirements. This can result the failure of a safety-critical system and lead to ... » read more

Arm Neoverse N1 Core: Performance Analysis Methodology


The Arm Neoverse ecosystem is growing substantially with many Arm hardware and software partners developing applications and porting their workloads onto Arm-based cloud instances. With Neoverse N1 based systems becoming widely available, many real-world workloads are showing very competitive performance and significant cost savings when compared to legacy systems. Some recent examples include:... » read more

Bandwidth Utilization Side-Channel On ML Inference Accelerators


Abstract—Accelerators used for machine learning (ML) inference provide great performance benefits over CPUs. Securing confidential model in inference against off-chip side-channel attacks is critical in harnessing the performance advantage in practice. Data and memory address encryption has been recently proposed to defend against off-chip attacks. In this paper, we demonstrate that bandwidth... » read more

Post-Quantum Cryptography


Quantum computing is increasingly seen as a threat to communications security: rapid progress towards realizing practical quantum computers has drawn attention to the long understood potential of such machines to break fundamentals of contemporary cryptographic infrastructure. While this potential is so far firmly theoretical, the cryptography community is preparing for this possibility by deve... » read more

Understanding Write Combining On Arm


Write Combining (WC) is a specialized memory type defined by the x86-64 architecture that is used for gathering multiple stores into burst transactions over the system bus. WC is commonly used on x86-64 platforms for interaction with I/O and other peripheral devices. In this whitepaper we provide an overview of the Arm architecture memory types that provide WC-like capabilities. In addition, t... » read more

← Older posts Newer posts →