Home

TECHNICAL PAPERS

Lower Energy, High Performance LLM on FPGA Without Matrix Multiplication

June 27th, 2024 - By: Technical Paper Link

A new technical paper titled “Scalable MatMul-free Language Modeling” was published by UC Santa Cruz, Soochow University, UC Davis, and LuxiTech.

Abstract

“Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model’s memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at this https URL.”

Find the technical paper here (preprint). Published June 2024. The university’s news summary is here.

Zhu, Rui-Jie, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason K. Eshraghian. “Scalable MatMul-free Language Modeling.” arXiv preprint arXiv:2406.02528 (2024).

Lower Energy, High Performance LLM on FPGA Without Matrix Multiplication

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2026

All AI Data Center Interconnects Will Be Optical Within 5 Years

The Sub-2nm Paradox

TSMC Tech Symposium 2026, By The Numbers

When Semiconductor Materials Misbehave

Silicon Photonics Lights The Way To More Efficient Data Centers

TSV Complexity Leads To Manufacturing Bottleneck

AI Growing Impact On Chip Design And EDA Tools

Sponsors

Recent Comments

About

Navigation

Connect With Us

Lower Energy, High Performance LLM on FPGA Without Matrix Multiplication

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2026

All AI Data Center Interconnects Will Be Optical Within 5 Years

The Sub-2nm Paradox

TSMC Tech Symposium 2026, By The Numbers

When Semiconductor Materials Misbehave

Silicon Photonics Lights The Way To More Efficient Data Centers

TSV Complexity Leads To Manufacturing Bottleneck

AI Growing Impact On Chip Design And EDA Tools

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored