Flexible AI-MCU For Fast Inference of Transformer Models At The Ultra-Low-Power Edge (ETH Zurich, U. Bologna)


Researchers from ETH Zurich and University of Bologna have released “CHIMERA: A Flexible and Scalable 3.1 TOPS/W AI-MCU with Transformer Accelerator and 563 Gb/s Shared-L2 Memory Subsystem with QoS Guarantees”. Abstract “We present Chimera, a flexible and scalable Microcontroller Unit (MCU) designed to accelerate real-time inference of rapidly evolving transformer-based models a... » read more

Analog IMC Attention Mechanism For Fast And Energy-Efficient LLMs (FZJ, RWTH Aachen)


A new technical paper titled "Analog in-memory computing attention mechanism for fast and energy-efficient large language models" was published by researchers at Forschungszentrum Jülich and RWTH Aachen. Abstract "Transformer networks, driven by self-attention, are central to large language models. In generative transformers, self-attention uses cache memory to store token projec... » read more

Transformers At The Edge: Efficient LLM Deployment


Since the groundbreaking 2017 publication of “Attention Is All You Need,” the transformer architecture has fundamentally reshaped artificial intelligence research and development. This innovation laid the foundation for Large Language Models (LLMs) and Video Language Models (VLMs), fueling a wave of productization across the industry. A defining milestone was the public launch of ChatGPT in... » read more

Shrinking LLMs With Self-Compression


Language models are becoming ever larger, making on-device inference slow and energy-intensive. A direct and surprisingly effective remedy is to prune complete channels whose contribution to the task is negligible. Our earlier work introduced a training-time procedure – Self-Compression [1, 4] – that lets back-propagation decide the bit-width of every channel, so unhelpful ones fade away. T... » read more

The Rise Of Generative AI On The Edge


Artificial intelligence (AI) and machine learning (ML) have undergone significant transformations over the past decade. The revolution of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) is evolving toward the adoption of transformers and generative AI (GenAI), marking a pivotal shift in the field. This transition is driven by the need for more accurate, efficient, and ... » read more

No Fooling With Voxel Pooling


A variety of new and complicated transformer models have emerged in the past 18 to 24 months as new “must have” networks in advanced automotive use cases. These novel architectures often introduce new network operators or novel ways of combining tensors – often from different types of sensors – in ways to enhance detection and recognition of objects in L3 / L4 / L5 ADAS and autonomous d... » read more

NPU Acceleration For Multimodal LLMs


Transformer-based models have rapidly spread from text to speech, vision, and other modalities. This has created challenges for the development of Neural Processing Units (NPUs). NPUs must now efficiently support the computation of weights and propagation of activations through a series of attention blocks. Increasingly, NPUs must be able to process models with multiple input modalities with ac... » read more

HW and SW Architecture Approaches For Running AI Models


How best to run AI inference models is a current topic of much debate as a wide breadth of systems companies look to add AI to a variety of systems, spurring both hardware innovation and the need to revamp models. Hardware developers are making progress with AI accelerators and SoCs. But on the model side, questions abound about whether the answer might come from revisiting older, less compl... » read more

Vision Is Why LLMs Matter On The Edge


Large Language Models (LLMs) have taken the world by storm since the 2017 Transformers paper, but pushing them to the edge has proved problematic. Just this year, Google had to revise its plans to roll out Gemini Nano on all new Pixel models — the down-spec’d hardware options proved unable to host the model as part of a positive user experience. But the implementation of language-focused mo... » read more

How To Successfully Deploy GenAI On Edge Devices


Generative AI (GenAI) burst onto the scene and into the public’s imagination with the launch of ChatGPT in late 2022. Users were amazed at the natural language processing chatbot’s ability to turn a short text prompt into coherent humanlike text including essays, language translations, and code examples. Technology companies – impressed with ChatGPT’s abilities – have started looking ... » read more

← Older posts